ElevenLabs
ElevenLabs Dialogue v3 is the multi-speaker endpoint of Eleven v3 — built for natural turn-taking between distinct character voices, with inline emotion tags ([whispers], [laughs], [excited], [sighs]) directing per-line delivery. Where standard Eleven v3 is one voice reading a paragraph, Dialogue v3 lets you assign different voices to different speakers and have them read a scripted scene with natural pacing, breath, and emotional response. On Martini, you build dialogue scenes as Audio nodes on the canvas — one node per character if you want fine-grained control, or a single Dialogue v3 node for the full multi-speaker generation. The 21-voice library covers the full range of character archetypes, and the cloned voice support lets you bring in custom characters when the prebuilt voices don't match.
Pick one voice per character from the 21-voice library before writing dialogue. The voice-character match drives believability more than any other production decision: a hardened detective wants Brian or Roger; a curious teenager wants Lily or Charlie; a wise grandmother wants Sarah or Matilda. Generate a 10-second test line per character — the same line read by 2-3 candidate voices — and listen back to back. Once committed, document the voice-to-character mapping; you'll reuse the same mapping across every scene in the project to maintain character consistency listeners notice subconsciously.
Dialogue v3 reads scripts with explicit speaker tags. Format: each line prefixed with the character name in brackets, like [Detective Cole], [Mira], [Captain]. Keep each turn short — 1-3 sentences per turn produces the most natural pacing; longer monologues feel like one voice reading a paragraph rather than a conversation. Use line breaks between turns; the model treats them as the natural breath/pause that separates speakers. For interruptions or overlapping speech, write a single character's line that ends mid-sentence with an em dash ("I was just about to —") and have the next speaker pick up immediately.
Dialogue v3 inline tags work the same as Eleven v3: [whispers], [laughs], [excited], [sighs], [pause]. Place each tag immediately before the words it should affect, scoped to one speaker's line. Example: "[Mira] [whispers] Did you hear that?" makes Mira whisper the question, not the entire scene. Three tags per scene is plenty — a 60-second 4-character scene with one tag per character feels naturally inflected; ten tags makes the dialogue feel theatrical and overdirected. Reserve the strongest tags ([whispers], [angry], [terrified]) for plot beats; let the prebuilt voice character carry the everyday tone.
A Dialogue v3 Audio node outputs a single audio file with all speaker turns rendered. From there, the Martini canvas opens up the full production pipeline: connect the Audio output to a Video node for an animated short scene, route to OmniHuman or Kling Avatar for character portrait talking-head delivery (one node per character), or layer onto a video timeline as voiceover for animation. For a 4-character animated short, the standard architecture is one Image node per character (consistent portrait via Nano Banana 2 or Flux Kontext), one Lipsync node per character (OmniHuman for hero close-ups, Kling Avatar for ensemble shots), and the Dialogue v3 audio split per character feeding each Lipsync node.
Two-character interrogation scene — Detective Cole as Brian (authoritative male), Mira as Sarah (warm female with edge). The [skeptically] tag bumps suspicion at the question; [whispers] makes Mira's answer ominous without raising the stakes too far.
[Detective Cole] So you were home all night? [Mira] [pause] Yes. With my sister. [Detective Cole] [skeptically] And she'll vouch for that? [Mira] [whispers] She has to.
Three-character ensemble — Captain as Daniel (calm authority), Engineer as Charlie (energetic young), Pilot as Liam (composed professional). Each character gets one inline tag matching their archetype. Total runtime: ~12 seconds.
[Captain] Mission control, status check. [Engineer] [excited] All systems nominal! [Pilot] Ready when you are, Captain. [Captain] [confidently] Let's go.
Cast voices to characters before writing the script. Voice-character match drives believability more than any production decision; document the mapping so it stays consistent across scenes.
Keep speaker turns to 1-3 sentences per turn. Long monologues lose dialogue rhythm; short turns produce natural turn-taking with breath and pause between speakers.
Use inline tags ([whispers], [laughs], [excited]) sparingly — three tags in a 60-second scene is plenty. Overdirection makes the scene feel theatrical; underdirection lets the voice character carry tone naturally.
For long scenes (>60s), consider splitting into multiple Dialogue v3 calls. Eleven v3 has a 5,000-character per-request limit; pacing also benefits from natural breaks between scene beats.
For animated shorts, split the Dialogue v3 output per character (one audio track each) and feed each track to a separate Lipsync node (OmniHuman for hero close-ups, Kling Avatar for ensemble shots).
ElevenLabs Dialogue v3 produces the most polished multi-speaker English dialogue available — natural turn-taking, distinct character voices, and emotional inflection that rivals voice-acted recordings. Trade-off vs. Fish Audio S2-Pro: Dialogue v3 is more polished in English but limited to ElevenLabs' 21-voice library plus your cloned voices, and tags are a fixed set rather than open-ended natural language. Where Fish Audio uses [Speaker:Name] syntax with bracket emotion cues, Dialogue v3 uses [CharacterName] tags with the standard Eleven v3 inline emotion set. For an English animation, audio drama, or interactive prototype where polish matters most, Dialogue v3 is the safer pick. For multilingual or experimental scenes with custom emotion language, Fish Audio is worth comparing on the same canvas.
Connect ElevenLabs Dialogue v3 with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started Free