AI Voiceover Script Prompts
Pacing-aware scripts beat raw scripts. These recipes pair a copyable script with a delivery direction (tone, pace, inline emotion tags) that ElevenLabs v3 or Fish Audio S2 actually responds to. Inline tags like [whispers], [laughs], and [excited] amplify delivery on Eleven v3 — use them sparingly. For multilingual work, generate one voice per locale rather than auto-translating; cadence and emotion ride differs across languages. Voiceover is rarely the deliverable on its own — chain into Kling Avatar or OmniHuman lip-sync for the talking-head finish.
When to use this prompt
- Producing weekly course module narration without recording sessions.
- Branding a 12-second podcast intro to play before every episode.
- Voicing a homepage explainer for a SaaS landing page.
- Localizing a campaign across 5 locales without booking native VO talent per language.
- Demoing brand voice variants (warm vs authoritative vs playful) for a stakeholder review.
Required inputs
- A voice selection (ElevenLabs voice ID or Fish Audio voice profile) consistent with brand tone.
- A pacing-aware script — short sentences, clear punctuation, no stacked clauses.
- A target duration (15s, 30s, 60s, 2min, etc.) so the script can be sized to the cut.
- Optional: inline emotion tags ([whispers], [laughs], [excited]) — used sparingly on Eleven v3.
- Consent to clone if cloning a real voice — only your own voice or licensed talent.
Prompt recipes
30-second product ad voiceover (warm, conversational)
Conversational ad VO with a short warm-tone direction. Eleven v3 reads the smile-tag as a softer delivery on the punchline.
[warm, conversational pacing, ~30 seconds] Mornings used to be chaos. Coffee, keys, somewhere half-eaten. Then [BRAND] showed up. One bottle. One pour. [smiles] Real fuel for real days. Try [BRAND] today — your morning is waiting.
~30s when delivered at a measured conversational pace.
60-second explainer voiceover (clear, neutral)
SaaS-explainer pacing. Short sentences, declarative beats, no inline tags — the clarity is in the pacing, not the delivery direction.
[clear, neutral pacing, ~60 seconds] If you have ever struggled to keep your team aligned on creative work, you are not alone. Most tools force you to switch tabs — design here, video there, audio somewhere else. [BRAND] brings every model onto one canvas. Drop a reference. Wire it into image. Wire it into video. Run them in parallel. The result lives in the same workspace where you started. No re-uploads. No re-prompting. One canvas, every model.
15-second social hook voiceover (high energy)
Social-platform hook for IG Reels or TikTok. The [excited] and [laughs] tags amplify delivery; pacing is short-sentence-driven.
[high energy, fast pacing, ~15 seconds] [excited] Wait. You can do all of this on ONE canvas? Image. Video. Audio. All from the same reference? [laughs] Okay. I am rebuilding my whole pipeline.
Best paired with on-screen text overlay — VO carries the energy.
Course module narration (calm, instructive)
Educational narration for course modules. Even pacing, no inline tags, structured enumeration that the model paces naturally.
[calm, instructive pacing, evenly delivered, ~45 seconds] In this module, we will cover three patterns. The first is multi-anchor wiring — wiring the product, the brand color, and the scene as separate canvas references. The second is per-shot model choice — picking the model that matches the shot intent rather than forcing a single tool. The third is the canvas template — saving the chain so the next campaign reuses it. Let us start with multi-anchor wiring.
Multi-character dialogue (two voices, turn-taking)
Two-voice scripted dialogue. Fish Audio S2 handles multi-speaker turn-taking cleanly with explicit speaker labels.
[two-voice dialogue, turn-taking, conversational] Voice A (warm, female, mid-30s): So, you ran the whole campaign on Martini? Voice B (clear, male, late-20s): Same canvas, image to video to audio. Took half a day. Voice A: And the model picks? Voice B: One per shot. Seedance for the hero, Runway for the lifestyle, Kling for the cinematic. Same canvas. Voice A: [impressed] That is the workflow.
Generate as one take with speaker labels; the model alternates voices.
Multilingual dub (5 locales, same script)
Same brief, five locales, native-cadence VO per language. Generate one voice per locale rather than auto-translating; cadence differs.
Master script (English, ~30s): From one canvas, every model. Image to video to audio, no tab juggling. Try Martini today. Generate one voice per locale: [en-US, warm conversational], [es-ES, warm conversational], [fr-FR, warm conversational], [de-DE, warm conversational], [ja-JP, warm conversational].
Have a native speaker review the localized script before generating — auto-translated copy reads stilted.
Whisper / ASMR variant
Intimate brand spot or product-launch teaser. The [whispers] tag holds across the entire delivery on Eleven v3.
[whispers, slow pacing, intimate] [whispers] You barely have to do anything. Drop the reference. Let the canvas do the rest. [whispers] One canvas. Every model.
Variations
- Substitute [whispers] with [thoughtful] for a quieter conversational variant.
Podcast intro voiceover (branded host name + tagline)
Branded podcast intro reusable across every episode. Warm tone, branded host name, light pacing.
[warm, branded, ~12 seconds] Welcome to The Build Brief, with your host Maya Chen — the show where founders walk through the playbook they wish they had on day one. New episodes every Tuesday. Let us get into it.
Martini canvas workflow
Drop the script as a text node on the canvas. Pin a voice selection — either a stock ElevenLabs voice or a cloned voice (only with explicit consent for cloning). Wire script + voice into an audio node.
For Eleven v3, sprinkle inline emotion tags ([whispers], [laughs], [excited]) on key beats. Sparingly. Tags amplify delivery; over-using them flattens the read. Generate, listen, adjust the script (not the tags) if the pacing feels off.
For multi-speaker dialogue or multilingual work, switch to Fish Audio S2 — it handles turn-taking and per-locale generation cleanly. Generate one voice per locale rather than auto-translating; cadence differs across languages.
Voiceover is rarely the final deliverable. Chain the audio output into a Kling Avatar or OmniHuman video node alongside a portrait reference for the talking-head finish — see /features/ai-talking-head-video and /features/ai-lip-sync. The portrait + voice + script becomes the talking-head video on one canvas.
Save the canvas as a template. The voice node, the avatar portrait, and the chain into video become the spokesperson template the next episode (or campaign) reuses end-to-end.
Variations
Warm conversational
Default brand tone — friendly, human, paced like a podcast guest. Use for ads, social spots, podcast intros.
Clear authoritative
Even pacing, declarative beats, no inline tags. Use for explainers, tutorials, B2B SaaS narration.
High-energy playful
Fast pacing, [excited] and [laughs] tags on key beats. Use for social hooks, Reels, TikTok.
Calm instructive
Slow even pacing, structured enumeration. Use for course modules, audiobook chapters, meditation content.
Whisper / intimate
Quiet [whispers] delivery throughout. Use for ASMR, premium product teasers, intimate brand spots.
Multilingual (per-locale)
One voice per locale rather than translation; native cadence and emotion ride per language.
Related features
Related how-tos
Related models
Related blog posts
Related docs
Frequently asked questions
- When do I use ElevenLabs vs Fish Audio?
- ElevenLabs v3 for single-voice delivery with inline emotion tags ([whispers], [laughs], [excited]) — best for ads, explainers, course narration, podcast intros. Fish Audio S2 for multi-speaker dialogue and multilingual work — best when the script has two-or-more voices or needs to ship in multiple locales with native cadence.
- How heavily should I use inline emotion tags?
- Sparingly. Inline tags amplify delivery — one or two tags on key beats give the read shape. Stacking tags ([whispers] [excited] [laughs]) flattens the read because the model averages them. The pacing of your script does most of the work; tags accent the read, they do not drive it.
- Can I clone a celebrity voice?
- No. Voice cloning requires explicit consent from the voice owner — only clone your own voice or a voice you have licensed permission to use. Cloning a celebrity, public figure, or unconsenting third party is both prohibited by the providers and exposes you to legal liability.
- How do I localize a script across 5 languages?
- Generate one voice per locale rather than auto-translating English audio. Have a native speaker review the localized script before generating — auto-translated copy reads stilted. Fish Audio S2 handles per-locale generation cleanly. The cadence and emotion ride differs across languages, so cloned-English-then-translated rarely works.
- Why does the read sound flat or stilted?
- Almost always a script problem, not a model problem. Stacked clauses, long sentences, and abstract phrasing read flat in any language. Rewrite for short sentences, clear punctuation, and concrete nouns. Pacing-aware writing beats inline tags every time.
- How do I turn voiceover into a talking-head video?
- Generate the voiceover, generate or anchor a portrait image (see /prompts/image/consistent-character-prompts), then wire both into a Kling Avatar or OmniHuman video node — see /features/ai-talking-head-video and /features/ai-lip-sync. The same canvas hosts the script, voice, portrait, and final video.
Try this prompt on Martini
Copy a recipe above, drop it into a node, and run it inside a full canvas workflow.