Fish Audio
Fish Audio S2-Pro's multi-speaker dialogue mode is exclusive to S2-Pro within the Fish Audio family — older S1 doesn't support it. Use [Speaker:Name] syntax to assign different voices to different speakers, with natural-language bracket cues like [whispering], [laughing nervously], or [pause two seconds] directing per-line delivery. Coverage is 80+ languages with automatic detection on the same voice IDs, which makes Fish Audio the strongest pick for multilingual dialogue scenes (an audio drama shipping in English + Mandarin + Japanese, for example) or scenes that need expressive ranges beyond ElevenLabs' fixed inline tag set. Open-source serving means you can self-host the dialogue generation outside Martini for sensitive or pre-release content.
Fish Audio S2-Pro is the right pick over ElevenLabs Dialogue v3 in three specific cases: (1) the dialogue ships in multiple languages — Fish Audio handles 80+ languages with the same voice IDs, no separate model swap; (2) you need natural-language emotion cues beyond ElevenLabs' fixed [whispers] / [laughs] / [excited] / [sighs] / [pause] set — for example, [conspiratorial whisper], [nervous chuckle], [exhausted sigh]; (3) self-hosted infrastructure outside Martini for sensitive content. For an English-only scripted animation where polish matters most, ElevenLabs Dialogue v3 is the safer default.
Pick or clone one voice per character before writing the scene. Then format the script with explicit speaker tags: [Speaker:Cole], [Speaker:Mira], [Speaker:Captain]. Each line is prefixed with its speaker; the model uses these tags to switch between voice IDs at the speaker boundary. Keep turns short (1-3 sentences), use line breaks between turns. For interruptions, end a turn with em dash and have the next speaker pick up immediately. Voice consent: if you cloned voices for the characters rather than using prebuilt ones, document written permission for each — Fish Audio is open-source, so consent enforcement sits with you.
Fish Audio S2-Pro's open-ended bracket cues let you describe delivery in your own words: [conspiratorial whisper], [exhausted sigh], [pause for two seconds], [nervous chuckle]. Place each cue immediately before the words it should affect, scoped to one speaker's line. Compare against ElevenLabs' fixed tag set: Fish Audio's open-ended brackets give wider expressive range; the cost is slightly less predictable interpretation. For scenes with subtle emotional gradients (a character moving from calm to suspicious to alarmed across three lines), Fish Audio's descriptive cues hit closer than the fixed [excited] / [angry] tags. Test 2-3 deliveries per emotional beat before locking in the scene.
Fish Audio's 80+ language support shines for multilingual dialogue. Build the scene canvas once with English script, then duplicate the canvas and translate the script — keep the same [Speaker:Name] structure and the same voice IDs. Fish Audio uses the same cloned (or prebuilt) voices across all language editions, so each character's sonic identity stays consistent across English, Mandarin, Japanese, or Spanish editions. The canvas-as-template pattern means you ship a multilingual audio drama from one source canvas without rebuilding character voices per edition. Note: Fish Audio in Martini is currently SEO-positioned — production runtime depends on workspace configuration; if Fish Audio isn't wired up, fall back to ElevenLabs Multilingual v2 for the multilingual workflow.
Two-character interrogation scene with Fish Audio bracket cues. The [skeptically, leaning in] and [nervous whisper] cues describe delivery in natural language — wider range than ElevenLabs' fixed tags would allow.
[Speaker:Cole] So you were home all night? [Speaker:Mira] [pause for two seconds] Yes. With my sister. [Speaker:Cole] [skeptically, leaning in] And she'll vouch for that? [Speaker:Mira] [nervous whisper] She has to.
Same three-character scene rendered in Mandarin Chinese — Fish Audio uses the same voice IDs across languages, so each character's sonic identity stays identical between English and Mandarin editions.
[Speaker:Captain] 任务控制,状态检查。 [Speaker:Engineer] [excited] 所有系统正常! [Speaker:Pilot] 准备好了,船长。 [Speaker:Captain] [confidently] 出发吧。
Multi-speaker dialogue is exclusive to S2-Pro in the Fish Audio family — S1 does not support [Speaker:Name] tags. Stay on S2-Pro for any scene with two or more characters.
Open-ended bracket cues ([conspiratorial whisper], [exhausted sigh]) give wider expressive range than ElevenLabs' fixed tags but slightly less predictable interpretation. Test before committing.
Same voice IDs across 80+ languages means a multilingual audio drama keeps consistent character voices across English, Mandarin, Japanese, etc. Build once, translate the script per edition.
Self-hosted serving is available for sensitive or pre-release dialogue content where you don't want audio leaving your infrastructure.
For animated shorts, split the rendered audio per character and feed each track to a separate Lipsync node (OmniHuman or Kling Avatar) on the same Martini canvas.
Fish Audio S2-Pro multi-speaker dialogue is the multilingual, open-source choice — wider language coverage, more flexible bracket-style emotion cues, self-hostable infrastructure. Trade-off vs. ElevenLabs Dialogue v3: less polished English emotional delivery, slightly less predictable bracket interpretation, consent burden sits with you. For an English-only animation where polish matters most, ElevenLabs is safer. For multilingual audio dramas, scenes with subtle emotional gradients beyond fixed tags, or self-hosted production, Fish Audio S2-Pro is worth the trade. Build the scene once on the Martini canvas; ship localized editions by duplicating the canvas and translating the script while keeping voice IDs constant.
Connect Fish Audio S2-Pro with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started Free