Audio
AI Voiceover Generator
You wrote the script and the video is locked — now you need narration that lands inside the cut, not in another tool. Martini generates voiceover with ElevenLabs and Fish Audio in audio nodes that wire directly into video, lip-sync, and sequence builder nodes. Voice, video, and timeline live on one canvas, end to end.
What this feature solves
Voiceover used to mean a booth, a voice actor, and a directed session — for thirty seconds of polished narration. AI voiceover collapses cost and time dramatically, but most tools end at the audio file. You generate the take, download the WAV, drop it into your NLE, and rebuild the connection between voice and picture by hand. For an explainer or course series with dozens of clips, that handoff burns hours per episode.
The integration gap matters even more for spokesperson and dialogue work. Voice has to drive lip-sync; lip-sync has to land on the right portrait; portrait has to match the brand. Without a canvas where audio chains directly into video and into sync, every spokesperson cut becomes a multi-tool relay race with identity drift between every step.
And there is the multi-language reality. International campaigns need the same script in five voices, the same persona in five locales, and consistent delivery quality across all of them. Doing that in five separate tabs of a TTS tool is brutal. Without a workflow that fans script into multiple voices and chains each into downstream video, multi-language production stays slow and inconsistent.
Why Martini is different
Audio is a first-class node type on Martini. Drop a script into a text node, wire it into an ElevenLabs or Fish Audio node, and the voice take generates as a real audio asset on the canvas. The output port connects directly into video nodes for lip-sync, into sequence builder for narration tracks, or into export for standalone audio deliverables. No download-and-re-import cycle.
The voice-to-video chain is where Martini becomes more than a TTS tool. Generate the voiceover, chain it into a Kling Avatar lip-sync node fed by your spokesperson portrait, and the spokesperson speaks the line on camera with locked identity. The same chain handles narrative dialogue, dub work, and explainer hosting — all with one upstream character and a fanout of voice takes per language or per episode.
Multi-language and multi-voice production runs as fanout. One script, five ElevenLabs nodes with different voices, five lip-sync video nodes downstream — five fully synced spokesperson cuts from one canvas. The character holds, the timing locks, and the editorial team has a workflow for global content production that scales beyond what tab-based tools can deliver.
Common use cases
Narration for explainer and product videos
Generate clean professional narration for explainer videos and chain it directly into the cut without a separate audio session.
Course narration at episode scale
Voice an entire course or curriculum with one consistent narrator and ship modules without booking talent per episode.
Spokesperson dialogue with lip-sync
Chain ElevenLabs voice into a lip-sync video node so your spokesperson speaks the line on camera with locked identity.
Multi-language dubs for global campaigns
Fan one script into multiple language voices and chain into matching lip-sync clips for localized delivery.
Documentary and editorial narration
Produce documentary-grade narration that sits inside the cut alongside b-roll, interviews, and music.
Internal training and corporate video
Build training video at scale with consistent corporate narration without a recurring studio booking line item.
Recommended model stack
elevenlabs
audioIndustry-leading voice quality and language coverage for narration and dialogue.
fish-audio-s2
audioHigh-quality voice synthesis with strong multi-language coverage.
kling-avatar
videoPair narration with lip-synced spokesperson cuts when video matches voice.
nano-banana-2
imageGenerate the upstream spokesperson portrait that drives the lip-sync chain.
hailuo
videoFast portrait-to-talk iterations for spokesperson voiceover work.
How the workflow works in Martini
- 1
1. Drop the script into a text node
Hold the dialogue or narration script in a text node on the canvas. Keep lines natural and within typical spoken cadence — avoid stacked clauses that confuse pacing.
- 2
2. Pick a voice and generate the take
Wire the script into an ElevenLabs or Fish Audio node. Choose a voice that matches the brand or character — preview a few before committing.
- 3
3. Review the take and iterate
Listen end to end. Adjust pacing, emphasis, or voice selection on the same node and re-run. Audio nodes iterate in seconds, so test multiple voices before locking.
- 4
4. Chain the voice forward
Connect the audio output to its destination — a lip-sync video node for spokesperson cuts, a sequence builder track for narration, or an export node for standalone audio.
- 5
5. Run lip-sync if needed
For talking-head cuts, wire the voice and the spokesperson portrait into a Kling Avatar lip-sync node. The mouth shapes drive from the audio, identity holds from the portrait.
- 6
6. Export through NLE export
Push the finished sequence into Premiere, DaVinci, or Final Cut. Voice and video are aligned and the editor takes over for trim, color, and final mix.
Example workflow
A SaaS company is launching a 4-episode product course narrated by their existing brand voice. They draft the four episode scripts as text nodes on one canvas. Each script wires into an ElevenLabs node using the brand voice ID. The four voiceover takes generate in minutes — pacing is tight, brand voice is consistent across episodes. For the cold-open hero shot of each episode, the voice chains into a Kling Avatar lip-sync node fed by the brand spokesperson portrait, so the host says the cold-open line on camera. The four episodes route into the sequence builder with their respective narration and hero cuts in place. NLE export ships the four cuts to Premiere for finishing. The course ships in days instead of weeks, and the brand voice is identical across every episode.
Tips and common mistakes
Tips
- Preview multiple voices before committing. The voice that sounds right in isolation can clash with the visual brand.
- Keep narration lines under 8-10 seconds for the cleanest delivery and easiest pacing edits.
- Use the same voice ID across an entire campaign or course series. Consistency is the whole point.
- For lip-sync chains, generate the voice take first and review for delivery before chaining into video.
- Save the voice + portrait combination as a template for recurring spokesperson work.
Common mistakes
- Writing voiceover scripts that read like written prose. Spoken language has shorter sentences and natural rhythm.
- Switching voice IDs between episodes. The audience hears the change immediately and trust drops.
- Skipping pacing review. A flat or rushed take ruins the cut even with perfect lip-sync downstream.
- Generating audio in one tool and dragging WAVs into another. The chain is the value — keep voice, video, and sync on the canvas.
- Using a stylized character voice for a corporate brand video. Match the voice to the brand, not the trend.
Related how-to guides
Related models and tools
Related features
AI Lip Sync — Sync Voice and Dialogue to Portraits and Video
Sync voiceovers, dialogue, and music to portraits and video on Martini using lip-sync models.
AI Character Consistency Across Images and Video
Keep a subject consistent across image and video generations on Martini using reference workflows.
AI Sound Effects Generator — SFX for Scenes and Product Videos
Skip the SFX library hunt — generate scene-matching sound effects on Martini's canvas with ElevenLabs SFX and chain into video and voice workflows.
AI Music Generator — Background Music for AI Video
Generate background music and soundtracks for AI video projects on Martini.
AI Voice Cloning — Clone or Design Voices for Production
Clone a voice from 30 seconds of reference audio on Martini's canvas — ElevenLabs, Fish Audio, chained directly into video, lip-sync, and sequence.
Related docs
Related reading
Comparisons
Frequently asked questions
Which voice model should I use for brand work?
ElevenLabs is the default for English and most major language brand work — voice quality is the highest in the registry, and voice cloning lets you build a consistent brand voice ID. Fish Audio is the choice when your script needs strong Asian language coverage or a different vocal character that complements ElevenLabs.
Can I clone my own brand voice?
ElevenLabs supports voice cloning from a reference recording — drop the cloned voice ID into the audio node and reuse it across every project. This is how brands lock a single voice across an entire content library without having to re-record.
How does this connect to lip-sync?
Chain the audio node output directly into a lip-sync-aware video node like Kling Avatar, alongside your spokesperson portrait. The mouth shapes drive from the audio waveform, the identity holds from the portrait, and the result is a synced spokesperson cut without a separate dub workflow.
How do I handle multiple languages?
Fan the script into multiple audio nodes — one per language — and chain each one into its own downstream lip-sync or sequence node. The same canvas produces five localized cuts from one script, with consistent identity and pacing across all of them.
Will the voiceover land cleanly in my NLE?
Yes. NLE export bundles the voice tracks alongside the video sequence at standard sample rates and bit depths. Premiere, DaVinci, and Final Cut all import the audio aligned to the cut — your editor mixes from there without re-syncing.
How does this compare to running ElevenLabs directly?
ElevenLabs alone gives you the voice file. Martini gives you the voice file inside a chain — connected to script, video, lip-sync, sequence, and NLE export on one canvas. For standalone audio deliverables, ElevenLabs direct is fine. For voice that has to drive a video workflow, the canvas saves hours per project.
Build it on the canvas
Open Martini and wire this workflow up in minutes. Free to start — no card required.