Fish Audio

How to Generate AI Dialogue with Fish Audio S2-Pro

Fish Audio S2-Pro's multi-speaker dialogue mode is exclusive to S2-Pro within the Fish Audio family — older S1 doesn't support it. Use [Speaker:Name] syntax to assign different voices to different speakers, with natural-language bracket cues like [whispering], [laughing nervously], or [pause two seconds] directing per-line delivery. Coverage is 80+ languages with automatic detection on the same voice IDs, which makes Fish Audio the strongest pick for multilingual dialogue scenes (an audio drama shipping in English + Mandarin + Japanese, for example) or scenes that need expressive ranges beyond ElevenLabs' fixed inline tag set. Open-source serving means you can self-host the dialogue generation outside Martini for sensitive or pre-release content.

Try Fish Audio S2-Pro Free

Step-by-Step Guide

Decide if Fish Audio is the right pick for the scene

Fish Audio S2-Pro is the right pick over ElevenLabs Dialogue v3 in three specific cases: (1) the dialogue ships in multiple languages — Fish Audio handles 80+ languages with the same voice IDs, no separate model swap; (2) you need natural-language emotion cues beyond ElevenLabs' fixed [whispers] / [laughs] / [excited] / [sighs] / [pause] set — for example, [conspiratorial whisper], [nervous chuckle], [exhausted sigh]; (3) self-hosted infrastructure outside Martini for sensitive content. For an English-only scripted animation where polish matters most, ElevenLabs Dialogue v3 is the safer default.

Cast voices and write the script with [Speaker:] tags

Pick or clone one voice per character before writing the scene. Then format the script with explicit speaker tags: [Speaker:Cole], [Speaker:Mira], [Speaker:Captain]. Each line is prefixed with its speaker; the model uses these tags to switch between voice IDs at the speaker boundary. Keep turns short (1-3 sentences), use line breaks between turns. For interruptions, end a turn with em dash and have the next speaker pick up immediately. Voice consent: if you cloned voices for the characters rather than using prebuilt ones, document written permission for each — Fish Audio is open-source, so consent enforcement sits with you.

Direct delivery with natural-language brackets

Fish Audio S2-Pro's open-ended bracket cues let you describe delivery in your own words: [conspiratorial whisper], [exhausted sigh], [pause for two seconds], [nervous chuckle]. Place each cue immediately before the words it should affect, scoped to one speaker's line. Compare against ElevenLabs' fixed tag set: Fish Audio's open-ended brackets give wider expressive range; the cost is slightly less predictable interpretation. For scenes with subtle emotional gradients (a character moving from calm to suspicious to alarmed across three lines), Fish Audio's descriptive cues hit closer than the fixed [excited] / [angry] tags. Test 2-3 deliveries per emotional beat before locking in the scene.

Render once, ship in multiple languages

Fish Audio's 80+ language support shines for multilingual dialogue. Build the scene canvas once with English script, then duplicate the canvas and translate the script — keep the same [Speaker:Name] structure and the same voice IDs. Fish Audio uses the same cloned (or prebuilt) voices across all language editions, so each character's sonic identity stays consistent across English, Mandarin, Japanese, or Spanish editions. The canvas-as-template pattern means you ship a multilingual audio drama from one source canvas without rebuilding character voices per edition. Note: Fish Audio in Martini is currently SEO-positioned — production runtime depends on workspace configuration; if Fish Audio isn't wired up, fall back to ElevenLabs Multilingual v2 for the multilingual workflow.

Prompt Examples

Two-character interrogation scene with Fish Audio bracket cues. The [skeptically, leaning in] and [nervous whisper] cues describe delivery in natural language — wider range than ElevenLabs' fixed tags would allow.

[Speaker:Cole] So you were home all night? [Speaker:Mira] [pause for two seconds] Yes. With my sister. [Speaker:Cole] [skeptically, leaning in] And she'll vouch for that? [Speaker:Mira] [nervous whisper] She has to.

Same three-character scene rendered in Mandarin Chinese — Fish Audio uses the same voice IDs across languages, so each character's sonic identity stays identical between English and Mandarin editions.

[Speaker:Captain] 任务控制，状态检查。 [Speaker:Engineer] [excited] 所有系统正常！ [Speaker:Pilot] 准备好了，船长。 [Speaker:Captain] [confidently] 出发吧。

Parameter Tips

Multi-speaker dialogue is exclusive to S2-Pro in the Fish Audio family — S1 does not support [Speaker:Name] tags. Stay on S2-Pro for any scene with two or more characters.

Open-ended bracket cues ([conspiratorial whisper], [exhausted sigh]) give wider expressive range than ElevenLabs' fixed tags but slightly less predictable interpretation. Test before committing.

Same voice IDs across 80+ languages means a multilingual audio drama keeps consistent character voices across English, Mandarin, Japanese, etc. Build once, translate the script per edition.

Self-hosted serving is available for sensitive or pre-release dialogue content where you don't want audio leaving your infrastructure.

For animated shorts, split the rendered audio per character and feed each track to a separate Lipsync node (OmniHuman or Kling Avatar) on the same Martini canvas.

What to Expect

Fish Audio S2-Pro multi-speaker dialogue is the multilingual, open-source choice — wider language coverage, more flexible bracket-style emotion cues, self-hostable infrastructure. Trade-off vs. ElevenLabs Dialogue v3: less polished English emotional delivery, slightly less predictable bracket interpretation, consent burden sits with you. For an English-only animation where polish matters most, ElevenLabs is safer. For multilingual audio dramas, scenes with subtle emotional gradients beyond fixed tags, or self-hosted production, Fish Audio S2-Pro is worth the trade. Build the scene once on the Martini canvas; ship localized editions by duplicating the canvas and translating the script while keeping voice IDs constant.

Use Fish Audio S2-Pro on Martini

Connect Fish Audio S2-Pro with other AI models on Martini's infinite canvas. No GPU required — start free.

Get Started Free

Related features

Docs

nodes/audio

Try Other Models for This Task

ElevenLabs

ElevenLabs Dialogue v3

ElevenLabs Dialogue v3 is the multi-speaker endpoint of Eleven v3 — built for natural turn-taking between distinct character voices, with inline emotion tags ([whispers], [laughs], [excited], [sighs]) directing per-line delivery. Where standard Eleven v3 is one voice reading a paragraph, Dialogue v3 lets you assign different voices to different speakers and have them read a scripted scene with natural pacing, breath, and emotional response. On Martini, you build dialogue scenes as Audio nodes on the canvas — one node per character if you want fine-grained control, or a single Dialogue v3 node for the full multi-speaker generation. The 21-voice library covers the full range of character archetypes, and the cloned voice support lets you bring in custom characters when the prebuilt voices don't match.

View guide

How to Generate AI Dialogue

Fish Audio

How to Generate AI Dialogue with Fish Audio S2-Pro

Try Fish Audio S2-Pro Free

Step-by-Step Guide

Decide if Fish Audio is the right pick for the scene

Cast voices and write the script with [Speaker:] tags

Direct delivery with natural-language brackets

Render once, ship in multiple languages

Prompt Examples

[Speaker:Captain] 任务控制，状态检查。 [Speaker:Engineer] [excited] 所有系统正常！ [Speaker:Pilot] 准备好了，船长。 [Speaker:Captain] [confidently] 出发吧。

Parameter Tips

Multi-speaker dialogue is exclusive to S2-Pro in the Fish Audio family — S1 does not support [Speaker:Name] tags. Stay on S2-Pro for any scene with two or more characters.

Open-ended bracket cues ([conspiratorial whisper], [exhausted sigh]) give wider expressive range than ElevenLabs' fixed tags but slightly less predictable interpretation. Test before committing.

Same voice IDs across 80+ languages means a multilingual audio drama keeps consistent character voices across English, Mandarin, Japanese, etc. Build once, translate the script per edition.

Self-hosted serving is available for sensitive or pre-release dialogue content where you don't want audio leaving your infrastructure.

For animated shorts, split the rendered audio per character and feed each track to a separate Lipsync node (OmniHuman or Kling Avatar) on the same Martini canvas.

What to Expect

Use Fish Audio S2-Pro on Martini

Connect Fish Audio S2-Pro with other AI models on Martini's infinite canvas. No GPU required — start free.

Get Started Free

Related features

Docs

nodes/audio

Try Other Models for This Task

ElevenLabs

ElevenLabs Dialogue v3

View guide

How to Generate AI Dialogue

How to Generate AI Dialogue with Fish Audio S2-Pro

Step-by-Step Guide

Decide if Fish Audio is the right pick for the scene

Cast voices and write the script with [Speaker:] tags

Direct delivery with natural-language brackets

Render once, ship in multiple languages

Prompt Examples

Parameter Tips

What to Expect

Use Fish Audio S2-Pro on Martini

Related features

Docs

Related reading

Try Other Models for This Task

ElevenLabs Dialogue v3

This website uses cookies

How to Generate AI Dialogue with Fish Audio S2-Pro

Step-by-Step Guide

Decide if Fish Audio is the right pick for the scene

Cast voices and write the script with [Speaker:] tags

Direct delivery with natural-language brackets

Render once, ship in multiple languages

Prompt Examples

Parameter Tips

What to Expect

Use Fish Audio S2-Pro on Martini

Related features

Docs

Related reading

Try Other Models for This Task

ElevenLabs Dialogue v3