When do I use ElevenLabs vs Fish Audio?

ElevenLabs v3 for single-voice delivery with inline emotion tags ([whispers], [laughs], [excited]) — best for ads, explainers, course narration, podcast intros. Fish Audio S2 for multi-speaker dialogue and multilingual work — best when the script has two-or-more voices or needs to ship in multiple locales with native cadence.

How heavily should I use inline emotion tags?

Sparingly. Inline tags amplify delivery — one or two tags on key beats give the read shape. Stacking tags ([whispers] [excited] [laughs]) flattens the read because the model averages them. The pacing of your script does most of the work; tags accent the read, they do not drive it.

Can I clone a celebrity voice?

No. Voice cloning requires explicit consent from the voice owner — only clone your own voice or a voice you have licensed permission to use. Cloning a celebrity, public figure, or unconsenting third party is both prohibited by the providers and exposes you to legal liability.

How do I localize a script across 5 languages?

Generate one voice per locale rather than auto-translating English audio. Have a native speaker review the localized script before generating — auto-translated copy reads stilted. Fish Audio S2 handles per-locale generation cleanly. The cadence and emotion ride differs across languages, so cloned-English-then-translated rarely works.

Why does the read sound flat or stilted?

Almost always a script problem, not a model problem. Stacked clauses, long sentences, and abstract phrasing read flat in any language. Rewrite for short sentences, clear punctuation, and concrete nouns. Pacing-aware writing beats inline tags every time.

How do I turn voiceover into a talking-head video?

Generate the voiceover, generate or anchor a portrait image (see /prompts/image/consistent-character-prompts), then wire both into a Kling Avatar or OmniHuman video node — see /features/ai-talking-head-video and /features/ai-lip-sync. The same canvas hosts the script, voice, portrait, and final video.

AI Voiceover Script Prompts

Pacing-aware scripts beat raw scripts. These recipes pair a copyable script with a delivery direction (tone, pace, inline emotion tags) that ElevenLabs v3 or Fish Audio S2 actually responds to. Inline tags like [whispers], [laughs], and [excited] amplify delivery on Eleven v3 — use them sparingly. For multilingual work, generate one voice per locale rather than auto-translating; cadence and emotion ride differs across languages. Voiceover is rarely the deliverable on its own — chain into Kling Avatar or OmniHuman lip-sync for the talking-head finish.

When to use this prompt

Producing weekly course module narration without recording sessions.
Branding a 12-second podcast intro to play before every episode.
Voicing a homepage explainer for a SaaS landing page.
Localizing a campaign across 5 locales without booking native VO talent per language.
Demoing brand voice variants (warm vs authoritative vs playful) for a stakeholder review.

Required inputs

A voice selection (ElevenLabs voice ID or Fish Audio voice profile) consistent with brand tone.
A pacing-aware script — short sentences, clear punctuation, no stacked clauses.
A target duration (15s, 30s, 60s, 2min, etc.) so the script can be sized to the cut.
Optional: inline emotion tags ([whispers], [laughs], [excited]) — used sparingly on Eleven v3.
Consent to clone if cloning a real voice — only your own voice or licensed talent.

Prompt recipes

30-second product ad voiceover (warm, conversational)

Conversational ad VO with a short warm-tone direction. Eleven v3 reads the smile-tag as a softer delivery on the punchline.

[warm, conversational pacing, ~30 seconds]

Mornings used to be chaos. Coffee, keys, somewhere half-eaten. Then [BRAND] showed up. One bottle. One pour. [smiles] Real fuel for real days. Try [BRAND] today — your morning is waiting.

Recommended model:elevenlabs

~30s when delivered at a measured conversational pace.

60-second explainer voiceover (clear, neutral)

SaaS-explainer pacing. Short sentences, declarative beats, no inline tags — the clarity is in the pacing, not the delivery direction.

[clear, neutral pacing, ~60 seconds]

If you have ever struggled to keep your team aligned on creative work, you are not alone. Most tools force you to switch tabs — design here, video there, audio somewhere else. [BRAND] brings every model onto one canvas. Drop a reference. Wire it into image. Wire it into video. Run them in parallel. The result lives in the same workspace where you started. No re-uploads. No re-prompting. One canvas, every model.

Recommended model:elevenlabs

15-second social hook voiceover (high energy)

Social-platform hook for IG Reels or TikTok. The [excited] and [laughs] tags amplify delivery; pacing is short-sentence-driven.

[high energy, fast pacing, ~15 seconds]

[excited] Wait. You can do all of this on ONE canvas? Image. Video. Audio. All from the same reference? [laughs] Okay. I am rebuilding my whole pipeline.

Recommended model:elevenlabs

Best paired with on-screen text overlay — VO carries the energy.

Course module narration (calm, instructive)

Educational narration for course modules. Even pacing, no inline tags, structured enumeration that the model paces naturally.

[calm, instructive pacing, evenly delivered, ~45 seconds]

In this module, we will cover three patterns. The first is multi-anchor wiring — wiring the product, the brand color, and the scene as separate canvas references. The second is per-shot model choice — picking the model that matches the shot intent rather than forcing a single tool. The third is the canvas template — saving the chain so the next campaign reuses it. Let us start with multi-anchor wiring.

Recommended model:elevenlabs

Multi-character dialogue (two voices, turn-taking)

Two-voice scripted dialogue. Fish Audio S2 handles multi-speaker turn-taking cleanly with explicit speaker labels.

[two-voice dialogue, turn-taking, conversational]

Voice A (warm, female, mid-30s): So, you ran the whole campaign on Martini?
Voice B (clear, male, late-20s): Same canvas, image to video to audio. Took half a day.
Voice A: And the model picks?
Voice B: One per shot. Seedance for the hero, Runway for the lifestyle, Kling for the cinematic. Same canvas.
Voice A: [impressed] That is the workflow.

Recommended model:fish-audio-s2

Generate as one take with speaker labels; the model alternates voices.

Multilingual dub (5 locales, same script)

Same brief, five locales, native-cadence VO per language. Generate one voice per locale rather than auto-translating; cadence differs.

Master script (English, ~30s):
From one canvas, every model. Image to video to audio, no tab juggling. Try Martini today.

Generate one voice per locale:
[en-US, warm conversational], [es-ES, warm conversational], [fr-FR, warm conversational], [de-DE, warm conversational], [ja-JP, warm conversational].

Recommended model:fish-audio-s2

Have a native speaker review the localized script before generating — auto-translated copy reads stilted.

Whisper / ASMR variant

Intimate brand spot or product-launch teaser. The [whispers] tag holds across the entire delivery on Eleven v3.

[whispers, slow pacing, intimate]

[whispers] You barely have to do anything. Drop the reference. Let the canvas do the rest. [whispers] One canvas. Every model.

Recommended model:elevenlabs

Variations

Substitute [whispers] with [thoughtful] for a quieter conversational variant.

Podcast intro voiceover (branded host name + tagline)

Branded podcast intro reusable across every episode. Warm tone, branded host name, light pacing.

[warm, branded, ~12 seconds]

Welcome to The Build Brief, with your host Maya Chen — the show where founders walk through the playbook they wish they had on day one. New episodes every Tuesday. Let us get into it.

Recommended model:elevenlabs

Martini canvas workflow

Drop the script as a text node on the canvas. Pin a voice selection — either a stock ElevenLabs voice or a cloned voice (only with explicit consent for cloning). Wire script + voice into an audio node.

For Eleven v3, sprinkle inline emotion tags ([whispers], [laughs], [excited]) on key beats. Sparingly. Tags amplify delivery; over-using them flattens the read. Generate, listen, adjust the script (not the tags) if the pacing feels off.

For multi-speaker dialogue or multilingual work, switch to Fish Audio S2 — it handles turn-taking and per-locale generation cleanly. Generate one voice per locale rather than auto-translating; cadence differs across languages.

Voiceover is rarely the final deliverable. Chain the audio output into a Kling Avatar or OmniHuman video node alongside a portrait reference for the talking-head finish — see /features/ai-talking-head-video and /features/ai-lip-sync. The portrait + voice + script becomes the talking-head video on one canvas.

Save the canvas as a template. The voice node, the avatar portrait, and the chain into video become the spokesperson template the next episode (or campaign) reuses end-to-end.

Variations

Warm conversational
Default brand tone — friendly, human, paced like a podcast guest. Use for ads, social spots, podcast intros.
Clear authoritative
Even pacing, declarative beats, no inline tags. Use for explainers, tutorials, B2B SaaS narration.
High-energy playful
Fast pacing, [excited] and [laughs] tags on key beats. Use for social hooks, Reels, TikTok.
Calm instructive
Slow even pacing, structured enumeration. Use for course modules, audiobook chapters, meditation content.
Whisper / intimate
Quiet [whispers] delivery throughout. Use for ASMR, premium product teasers, intimate brand spots.
Multilingual (per-locale)
One voice per locale rather than translation; native cadence and emotion ride per language.

Related features

Related how-tos

Related models

Related docs

nodes/audio

Frequently asked questions

When do I use ElevenLabs vs Fish Audio?: ElevenLabs v3 for single-voice delivery with inline emotion tags ([whispers], [laughs], [excited]) — best for ads, explainers, course narration, podcast intros. Fish Audio S2 for multi-speaker dialogue and multilingual work — best when the script has two-or-more voices or needs to ship in multiple locales with native cadence.
How heavily should I use inline emotion tags?: Sparingly. Inline tags amplify delivery — one or two tags on key beats give the read shape. Stacking tags ([whispers] [excited] [laughs]) flattens the read because the model averages them. The pacing of your script does most of the work; tags accent the read, they do not drive it.
Can I clone a celebrity voice?: No. Voice cloning requires explicit consent from the voice owner — only clone your own voice or a voice you have licensed permission to use. Cloning a celebrity, public figure, or unconsenting third party is both prohibited by the providers and exposes you to legal liability.
How do I localize a script across 5 languages?: Generate one voice per locale rather than auto-translating English audio. Have a native speaker review the localized script before generating — auto-translated copy reads stilted. Fish Audio S2 handles per-locale generation cleanly. The cadence and emotion ride differs across languages, so cloned-English-then-translated rarely works.
Why does the read sound flat or stilted?: Almost always a script problem, not a model problem. Stacked clauses, long sentences, and abstract phrasing read flat in any language. Rewrite for short sentences, clear punctuation, and concrete nouns. Pacing-aware writing beats inline tags every time.
How do I turn voiceover into a talking-head video?: Generate the voiceover, generate or anchor a portrait image (see /prompts/image/consistent-character-prompts), then wire both into a Kling Avatar or OmniHuman video node — see /features/ai-talking-head-video and /features/ai-lip-sync. The same canvas hosts the script, voice, portrait, and final video.

Try this prompt on Martini

Copy a recipe above, drop it into a node, and run it inside a full canvas workflow.

Open the canvas

AI Voiceover Script Prompts

When to use this prompt

Producing weekly course module narration without recording sessions.
Branding a 12-second podcast intro to play before every episode.
Voicing a homepage explainer for a SaaS landing page.
Localizing a campaign across 5 locales without booking native VO talent per language.
Demoing brand voice variants (warm vs authoritative vs playful) for a stakeholder review.

Required inputs

A voice selection (ElevenLabs voice ID or Fish Audio voice profile) consistent with brand tone.
A pacing-aware script — short sentences, clear punctuation, no stacked clauses.
A target duration (15s, 30s, 60s, 2min, etc.) so the script can be sized to the cut.
Optional: inline emotion tags ([whispers], [laughs], [excited]) — used sparingly on Eleven v3.
Consent to clone if cloning a real voice — only your own voice or licensed talent.

Prompt recipes

30-second product ad voiceover (warm, conversational)

Conversational ad VO with a short warm-tone direction. Eleven v3 reads the smile-tag as a softer delivery on the punchline.

[warm, conversational pacing, ~30 seconds]

Mornings used to be chaos. Coffee, keys, somewhere half-eaten. Then [BRAND] showed up. One bottle. One pour. [smiles] Real fuel for real days. Try [BRAND] today — your morning is waiting.

Recommended model:elevenlabs

~30s when delivered at a measured conversational pace.

60-second explainer voiceover (clear, neutral)

SaaS-explainer pacing. Short sentences, declarative beats, no inline tags — the clarity is in the pacing, not the delivery direction.

[clear, neutral pacing, ~60 seconds]

If you have ever struggled to keep your team aligned on creative work, you are not alone. Most tools force you to switch tabs — design here, video there, audio somewhere else. [BRAND] brings every model onto one canvas. Drop a reference. Wire it into image. Wire it into video. Run them in parallel. The result lives in the same workspace where you started. No re-uploads. No re-prompting. One canvas, every model.

Recommended model:elevenlabs

15-second social hook voiceover (high energy)

Social-platform hook for IG Reels or TikTok. The [excited] and [laughs] tags amplify delivery; pacing is short-sentence-driven.

[high energy, fast pacing, ~15 seconds]

[excited] Wait. You can do all of this on ONE canvas? Image. Video. Audio. All from the same reference? [laughs] Okay. I am rebuilding my whole pipeline.

Recommended model:elevenlabs

Best paired with on-screen text overlay — VO carries the energy.

Course module narration (calm, instructive)

Educational narration for course modules. Even pacing, no inline tags, structured enumeration that the model paces naturally.

[calm, instructive pacing, evenly delivered, ~45 seconds]

In this module, we will cover three patterns. The first is multi-anchor wiring — wiring the product, the brand color, and the scene as separate canvas references. The second is per-shot model choice — picking the model that matches the shot intent rather than forcing a single tool. The third is the canvas template — saving the chain so the next campaign reuses it. Let us start with multi-anchor wiring.

Recommended model:elevenlabs

Multi-character dialogue (two voices, turn-taking)

Two-voice scripted dialogue. Fish Audio S2 handles multi-speaker turn-taking cleanly with explicit speaker labels.

[two-voice dialogue, turn-taking, conversational]

Voice A (warm, female, mid-30s): So, you ran the whole campaign on Martini?
Voice B (clear, male, late-20s): Same canvas, image to video to audio. Took half a day.
Voice A: And the model picks?
Voice B: One per shot. Seedance for the hero, Runway for the lifestyle, Kling for the cinematic. Same canvas.
Voice A: [impressed] That is the workflow.

Recommended model:fish-audio-s2

Generate as one take with speaker labels; the model alternates voices.

Multilingual dub (5 locales, same script)

Same brief, five locales, native-cadence VO per language. Generate one voice per locale rather than auto-translating; cadence differs.

Master script (English, ~30s):
From one canvas, every model. Image to video to audio, no tab juggling. Try Martini today.

Generate one voice per locale:
[en-US, warm conversational], [es-ES, warm conversational], [fr-FR, warm conversational], [de-DE, warm conversational], [ja-JP, warm conversational].

Recommended model:fish-audio-s2

Have a native speaker review the localized script before generating — auto-translated copy reads stilted.

Whisper / ASMR variant

Intimate brand spot or product-launch teaser. The [whispers] tag holds across the entire delivery on Eleven v3.

[whispers, slow pacing, intimate]

[whispers] You barely have to do anything. Drop the reference. Let the canvas do the rest. [whispers] One canvas. Every model.

Recommended model:elevenlabs

Variations

Substitute [whispers] with [thoughtful] for a quieter conversational variant.

Podcast intro voiceover (branded host name + tagline)

Branded podcast intro reusable across every episode. Warm tone, branded host name, light pacing.

[warm, branded, ~12 seconds]

Welcome to The Build Brief, with your host Maya Chen — the show where founders walk through the playbook they wish they had on day one. New episodes every Tuesday. Let us get into it.

Recommended model:elevenlabs

Martini canvas workflow

Save the canvas as a template. The voice node, the avatar portrait, and the chain into video become the spokesperson template the next episode (or campaign) reuses end-to-end.

Variations

Warm conversational
Default brand tone — friendly, human, paced like a podcast guest. Use for ads, social spots, podcast intros.
Clear authoritative
Even pacing, declarative beats, no inline tags. Use for explainers, tutorials, B2B SaaS narration.
High-energy playful
Fast pacing, [excited] and [laughs] tags on key beats. Use for social hooks, Reels, TikTok.
Calm instructive
Slow even pacing, structured enumeration. Use for course modules, audiobook chapters, meditation content.
Whisper / intimate
Quiet [whispers] delivery throughout. Use for ASMR, premium product teasers, intimate brand spots.
Multilingual (per-locale)
One voice per locale rather than translation; native cadence and emotion ride per language.

Related features

Related how-tos

Related models

Related docs

nodes/audio

Frequently asked questions

When do I use ElevenLabs vs Fish Audio?: ElevenLabs v3 for single-voice delivery with inline emotion tags ([whispers], [laughs], [excited]) — best for ads, explainers, course narration, podcast intros. Fish Audio S2 for multi-speaker dialogue and multilingual work — best when the script has two-or-more voices or needs to ship in multiple locales with native cadence.
How heavily should I use inline emotion tags?: Sparingly. Inline tags amplify delivery — one or two tags on key beats give the read shape. Stacking tags ([whispers] [excited] [laughs]) flattens the read because the model averages them. The pacing of your script does most of the work; tags accent the read, they do not drive it.
Can I clone a celebrity voice?: No. Voice cloning requires explicit consent from the voice owner — only clone your own voice or a voice you have licensed permission to use. Cloning a celebrity, public figure, or unconsenting third party is both prohibited by the providers and exposes you to legal liability.
How do I localize a script across 5 languages?: Generate one voice per locale rather than auto-translating English audio. Have a native speaker review the localized script before generating — auto-translated copy reads stilted. Fish Audio S2 handles per-locale generation cleanly. The cadence and emotion ride differs across languages, so cloned-English-then-translated rarely works.
Why does the read sound flat or stilted?: Almost always a script problem, not a model problem. Stacked clauses, long sentences, and abstract phrasing read flat in any language. Rewrite for short sentences, clear punctuation, and concrete nouns. Pacing-aware writing beats inline tags every time.
How do I turn voiceover into a talking-head video?: Generate the voiceover, generate or anchor a portrait image (see /prompts/image/consistent-character-prompts), then wire both into a Kling Avatar or OmniHuman video node — see /features/ai-talking-head-video and /features/ai-lip-sync. The same canvas hosts the script, voice, portrait, and final video.

Try this prompt on Martini

Copy a recipe above, drop it into a node, and run it inside a full canvas workflow.

Open the canvas

When to use this prompt

Required inputs

Prompt recipes

30-second product ad voiceover (warm, conversational)

60-second explainer voiceover (clear, neutral)

15-second social hook voiceover (high energy)

Course module narration (calm, instructive)

Multi-character dialogue (two voices, turn-taking)

Multilingual dub (5 locales, same script)

Whisper / ASMR variant

Podcast intro voiceover (branded host name + tagline)

Martini canvas workflow

Variations

Related features

Related how-tos

Related models

Related blog posts

Related docs

Frequently asked questions

Try this prompt on Martini

This website uses cookies

When to use this prompt

Required inputs

Prompt recipes

30-second product ad voiceover (warm, conversational)

60-second explainer voiceover (clear, neutral)

15-second social hook voiceover (high energy)

Course module narration (calm, instructive)

Multi-character dialogue (two voices, turn-taking)

Multilingual dub (5 locales, same script)

Whisper / ASMR variant

Podcast intro voiceover (branded host name + tagline)

Martini canvas workflow

Variations

Related features

Related how-tos

Related models

Related blog posts

Related docs

Frequently asked questions

Try this prompt on Martini