Fish Audio
Fish Audio S2-Pro is the open-source alternative to ElevenLabs cloning, with two real differentiators: natural-language bracket control inside the prompt (`[whispering]`, `[laughing nervously]`, `[pause]`) and an open serving stack you can self-host. Voice cloning needs a clean reference audio sample plus a matching transcript — Fish Audio uses the transcript text to disambiguate phonemes, so a misaligned transcript hurts cloning quality more than it does on ElevenLabs. Coverage is 80+ languages with automatic detection. Critical: only clone voices you own or have explicit written permission to clone. Fish Audio is open-source, which means consent enforcement is on you, not the platform — make the rights clearance explicit before you upload reference audio.
Document the rights to the voice before you upload. Fish Audio S2-Pro is open-source, so the consent burden sits with you — keep written permission on file for any voice that is not your own. Then prepare the reference: 30+ seconds of clean speech (no background music, no second speaker, 44.1kHz WAV preferred) and a verbatim transcript that exactly matches what was spoken. Fish Audio uses the transcript to align phonemes during cloning; a transcript that says "we will" when the audio actually says "we'll" introduces small alignment errors that compound when you generate longer scripts. If you don't have a transcript, run the audio through a STT model first, then proofread.
Add an Audio node, select Fish Audio S2-Pro from the model picker, and upload the reference audio + transcript pair. The model returns a cloned voice ID that you can reuse across every Audio node in the project. Note: Fish Audio in Martini is currently positioned as an SEO-comparison surface — the model exists in Martini's registry alongside ElevenLabs, MiniMax, and others, but production runtime integration depends on your workspace's configuration. If your workspace doesn't have Fish Audio runtime wired up, fall back to ElevenLabs for production, and use Fish Audio for self-hosted experimentation outside Martini.
Fish Audio S2-Pro's biggest differentiator is open-ended bracket control. Instead of a fixed list of tokens, you can write [whispering sweetly], [laughing nervously], [pause for two seconds], [angry whisper] — natural language inside brackets, placed at the exact word where delivery should change. Keep the bracket cues local: a tag near a word affects that word and the immediate phrase, not the rest of the paragraph. Compare this to ElevenLabs' fixed inline tag set ([whispers], [laughs], [excited]): Fish Audio gives you wider expressive range at the cost of slightly less predictable interpretation. Test 2-3 deliveries before committing to a long-form script.
For high-stakes content, place a Fish Audio S2-Pro Audio node and an ElevenLabs Eleven v3 Audio node side by side on the canvas, both reading the same script with cloned voices. The comparison is usually revealing: ElevenLabs is more polished and confident in English emotional delivery, while Fish Audio handles open-ended emotion tags more flexibly and covers more languages without a separate model swap. Many production teams use Fish Audio for prototype recordings and multilingual experiments, then ship the released cut on ElevenLabs. The shared canvas means you don't have to leave Martini to run the comparison.
Course intro with cloned host voice — bracket cues are placed exactly where delivery should shift. Compare against the same line on ElevenLabs Eleven v3 (fixed tags) to see Fish Audio's wider expressive range.
[calmly] Welcome to the studio. [pause] Today, [thoughtfully] we're going to look at three patterns that show up again and again — [emphasizing] every single time — in successful product launches.
Multi-speaker dialogue using Fish Audio's S2-Pro speaker tags. Each speaker can be a different cloned voice, all in one Audio node. The [curious] cue inside the host turn directs delivery on that line only.
[Speaker:Host] Hey, thanks for joining us. [Speaker:Guest] Happy to be here. [Speaker:Host] [curious] So tell me — what made you start this project?
Reference audio + transcript pairing is the single biggest quality lever. A misaligned transcript ("we will" when the audio says "we'll") accumulates phoneme errors across long generations.
Bracket cues are open-ended natural language: [whispering sweetly], [laughing nervously], [pause for two seconds]. Local placement (right next to the word) gives the most predictable result.
Fish Audio covers 80+ languages with automatic detection — no need to switch model variants between English, Mandarin, Japanese, or Spanish on the same cloned voice.
Open-source serving means you can self-host outside Martini if needed — useful for sensitive voice content where you don't want the audio leaving your infrastructure.
For S1 legacy prompts using parenthesis emotion syntax, switch to S2-Pro's bracket syntax. Old (excited) prompts must be rewritten as [excited] to work with the new model.
Fish Audio S2-Pro voice cloning is the open-source choice — wider language coverage, more flexible emotion control via natural-language brackets, and self-hostable infrastructure. Trade-off vs. ElevenLabs: less polished English emotional delivery, slightly less predictable interpretation of bracket cues, and the consent burden sits with you (no platform-level voice verification). For production-grade English narration on a single voice, ElevenLabs Eleven v3 is the safer bet. For multilingual projects, voice prototyping, self-hosted deployments, or workflows that need natural-language emotion tags beyond a fixed set, Fish Audio S2-Pro is worth the trade. The Martini canvas lets you A/B both on the same script without switching tools.
Connect Fish Audio S2-Pro with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started Free