How to Create an AI Voiceover Video on Martini
Audio narration plus video generation workflow on Martini.
Key takeaways
- A great AI voiceover video is the product of three inputs working together — script written for voice, voice generated or cloned, video that the voice syncs to.
- Use ElevenLabs (or Fish Audio S2 for more cost-sensitive runs) for voice generation, Kling Avatar for lip-synced talking-head video, and Seedance 2 or Veo for B-roll if the piece is narrated over visuals rather than spoken on-camera.
- Lip-sync is now a single-step operation when wired correctly — pass the character image, the audio, and a short motion prompt to Kling Avatar and the take is sync-ready.
- Voice consistency across episodes matters more than voice perfection on any single take — clone or pick a voice once, lock it as canonical, use it across every script.
- On the Martini canvas, the entire chain (script, voice, video, lip-sync, export) runs in one workspace with shared references — no download-and-re-upload between tools.
What makes an AI voiceover video work
A voiceover video has three load-bearing inputs: the script, the voice, and the video. The script is written for ear, not for eye. The voice is generated or cloned to fit the brand or character. The video is either a talking-head shot of a character lip-synced to the audio, or a sequence of B-roll shots that the voiceover narrates over. Get all three right and the video reads as professionally produced; let any one of them slip and the whole piece feels off.
The mistake most teams make is treating voiceover as a finishing step — generate the video first, then add a voiceover later. The order matters the other way around. Write the script first because the script determines the pacing of everything downstream. Generate the voice next because the voice locks the timing. Generate or assemble the video last because the visuals serve the voice.
On the Martini canvas, this entire chain runs as one connected pipeline. The script node holds the text. The voice node generates audio from the script. The video node (Kling Avatar for talking-head, or Seedance 2 / Veo for B-roll) generates visuals. The NLE export node assembles everything. Asset continuity across nodes means the voice from the audio node is wired directly into the lip-sync video node — no download, no re-upload, no inconsistency between what the voice node produced and what the video node received.
Step 1 — Write the script for voice, not for reading
Voiceover scripts are read aloud. They are not blog posts. The grammar that works on the page (long sentences with multiple clauses, parenthetical asides, formal punctuation) does not work in the ear. The grammar that works for voice is short sentences, conversational rhythm, deliberate pauses, and one idea per beat. Read every draft of the script aloud at speed; if you stumble, the voice model will stumble too.
Length matters. A thirty-second video typically holds about seventy-five to ninety words of voiceover at a natural reading pace. A sixty-second video holds about a hundred and fifty to a hundred and eighty words. Plan the script length to fit the target duration; do not write more than fits and hope to speed up the voice in post.
Pace matters. Bake in pauses by writing them into the script — short ellipses, paragraph breaks for breath beats, em-dashes for thought-shifts. ElevenLabs and Fish Audio S2 both respect these markers in the input text and produce more natural-feeling pacing as a result. A script with no pacing markers will produce a voiceover that feels rushed even if the duration is right.
Step 2 — Choose or clone the voice
Voice is the second input. The two production patterns are: pick a voice from the ElevenLabs (or Fish Audio S2) library that fits the brand or character, or clone a voice from a sample audio source (with consent and rights cleared). Either pattern is valid; the decision is brand-driven. For a brand that wants a recognizable spokesperson voice across many videos, cloning is the structural choice. For a brand that just needs a high-quality narrator voice on individual videos, picking from the library is faster and meets the need.
On the Martini canvas, voice generation lives in an audio node — drop ElevenLabs or Fish Audio S2, paste the script, select the voice, generate. ElevenLabs is the deeper voice tool for cloning and emotional range; Fish Audio S2 is the more cost-sensitive choice for high-volume voiceover where the script is straightforward and the voice does not need to carry strong emotional performance. For most production work, ElevenLabs is the default; reserve Fish Audio S2 for cost-driven volume runs.
Voice consistency across episodes matters more than voice perfection on any single take. Lock the canonical voice once — pick it from the library or clone it — and use it across every script in the series. Pin a sample audio in the canvas version tray as the reference for what the voice sounds like at brand-spec. Every new generation references this canonical voice, which keeps the audio side of the brand consistent over hundreds of episodes.
Step 3 — Generate the video
The video step depends on the format. For a talking-head voiceover (the character is speaking on camera), the video and lip-sync collapse into a single Kling Avatar node — see Step 4. For a narrated-over-B-roll voiceover (the voice plays over visuals of something else), generate the B-roll shots independently from the video models that fit each shot type and assemble in the NLE export node downstream.
For talking-head, generate the character portrait first with Nano Banana 2. Pin the canonical front view. This still becomes the input to the lip-sync step. For narrated-over-B-roll, identify the visual moments you want — the establishing shot, the product shot, the demonstration, the closing beat — and generate each with the appropriate model (Seedance 2 for cinematic image-to-video, Veo for environmental wides, Runway Gen4 for editor-grade kinetic shots).
The cardinal rule for narrated-over-B-roll: the visuals serve the voice. Cut the B-roll to the rhythm of the voiceover. Plan the shots based on what the voice is saying at each beat. The voice is the spine; the visuals are the supporting cast. This is the inverse of how silent video is planned and it changes the production order.
Step 4 — Lip-sync the voice to the video
For talking-head video, lip-sync is the step that turns "character image plus audio plus motion prompt" into a credible take. Kling Avatar is the node that handles this on the Martini canvas — wire the character image (from Nano Banana 2), the audio (from ElevenLabs or Fish Audio S2), and a short motion prompt covering body language and gaze. The output is a video where the character's mouth, jaw, and micro-expressions are synced to the audio at production-grade quality.
The motion prompt for Kling Avatar should focus on body language rather than mouth movement — Avatar handles the mouth automatically. Specify framing, gesture pattern, gaze direction, and emphasis behavior. For example: "Subtle gestures with the hands on emphasis points, eye contact with camera throughout, slight head tilt at the end of each sentence, medium close-up framing, soft three-point lighting." That is the kind of direction Avatar will execute.
For longer scripts, render the lip-sync in segments of one or two sentences each rather than one long take. Avatar handles short blocks more reliably than long monologues, and segmenting also gives you cleaner cut points if you need to edit the script later. Wire each segment into the NLE export node in order; the assembly is seamless because the voice and the character are consistent across segments.
Step 5 — Layer music and sound design
A voiceover video without music feels naked. Even quiet ambient music underneath the voice changes how the piece reads. On the Martini canvas, drop a music node alongside the voice node and generate or pull a music track that fits the brand sonic identity. For sound effects (a button click, an ambient room tone, a transitional whoosh), drop SFX nodes for each.
The mix matters. Keep the music underneath the voice — typically the voice rides ten to fifteen decibels above the music for clarity. Layer SFX at moments that punctuate the voiceover rather than competing with it. Do not over-layer; voiceover videos read best when the audio mix is clean and the voice is the priority.
For brand work where the music is a fixed brand asset (a stinger, a recurring theme), pin the canonical track in the canvas and reference it across every voiceover video in the series. The music identity becomes part of the brand audio library, the same way the voice does. Consistency across episodes is what turns individual videos into a recognizable channel.
Step 6 — Export and distribute
The NLE export node sits at the end of the canvas and assembles the voice, video, music, and SFX into a finished file. For talking-head pieces, the lip-synced Kling Avatar takes carry both video and synced voice — drop them into the NLE node in order, layer the music underneath, layer the SFX where appropriate, export. For narrated-over-B-roll, drop the B-roll takes in order with the voiceover audio playing through the timeline; the NLE node respects the audio-driven cut.
The output format from the NLE export node is standard video — ready to upload directly to social platforms, embed on a website, or import into a traditional editor for additional finishing. For most voiceover videos, the canvas export is the deliverable; for higher-end work that needs color grading or audio mastering, the export becomes the source for a finishing pass in Premiere or Resolve.
For a series of voiceover videos (an episodic channel, a campaign of variants), the canvas pattern lets you reuse the canonical voice, the brand music, and the standing visual references across every episode. Production cost per episode drops dramatically after the first one because the references compound. The first episode sets up the canvas; every subsequent episode is a downstream variant.
How Martini changes the voiceover workflow
Outside a canvas-based tool, AI voiceover video production is a tool-juggling exercise. Generate the voice in one product, download. Switch to a video tool, upload the voice (if the tool supports it) or generate video separately. Switch to a lip-sync tool, upload everything. Render. Switch to an editor, import, layer the music, export. Each step costs time and silently introduces inconsistencies.
On the Martini canvas, the entire chain — script, voice, video, lip-sync, music, export — runs in one workspace with the audio wired directly between the voice node and the lip-sync node, the video wired directly into the NLE node, and the canonical voice and brand music shared across every episode. Production cost per video drops dramatically; consistency across episodes becomes a structural property of the canvas. The canvas is the voiceover production studio, not just one tool inside it.
Workflow example
A complete sixty-second talking-head voiceover video on Martini: drop a Nano Banana 2 image node and generate the spokesperson portrait, pin the front view as canonical. Drop an ElevenLabs audio node, paste the sixty-second script, select the brand voice, generate the audio. Drop a Kling Avatar video node, wire in the canonical portrait and the ElevenLabs audio, write the motion prompt covering body language and gaze. Render two takes, pick the stronger. Drop a music node and generate ambient music underneath. Drop the lip-synced video take, the music, and any SFX into the NLE export node. Export. Total elapsed time, roughly forty-five minutes from blank canvas to finished sixty-second piece.
Recommended models
Recommended features
Related models and tools
Tool
AI Lip Sync
Lip-sync tools on Martini for syncing voice and dialogue to portraits and video.
Tool
AI Video Upscaling
Upscale generated video outputs on Martini's canvas.
Provider
ElevenLabs
ElevenLabs voiceover, lip-sync, and voice cloning workflows on Martini.
Provider
Minimax
Minimax's Hailuo video model and adjacent audio workflows on Martini.
Provider
Suno
Suno's AI music generation workflows for video on Martini.
Provider
Kling
Kling 3, O3, and Avatar video model workflows on Martini.
Related how-to guides
Related reading
AI Influencer Production Workflow: Repeatable Pipeline
Repeatable content pipeline for AI influencers using Martini's character + voice + video chain.
Kling 3 Guide: Variants, Use Cases, and How to Choose
Kling 3, O3, and Avatar variants — when to use each, on Martini.
AI Video Production Pipeline: From Idea to NLE Export
From idea to NLE export with AI tools on Martini.
Frequently asked questions
- Do I need a separate lip-sync tool?
- No — Kling Avatar handles lip-sync as part of the same node that produces the talking-head video. Wire in the character image and the audio, and the take comes out sync-ready. There is no separate lip-sync step on the Martini canvas; it is one node that does both.
- Can I clone my own voice for the voiceover?
- Yes — ElevenLabs voice cloning is the most stable across many generations and works directly on the canvas. Provide sample audio (with rights cleared), generate the cloned voice profile, and use it across every voiceover script. This is the canonical pattern for brand-spokesperson voice consistency.
- Which TTS model has the best output for voiceover ads?
- ElevenLabs is the default pick for voiceover work on the Martini canvas — its emotional range, voice cloning, and overall naturalness are the strongest. Fish Audio S2 is the cost-sensitive alternative for high-volume runs where the script is straightforward and the voice does not need to carry strong emotional performance.
- How long should an AI voiceover script be?
- About seventy-five to ninety words for a thirty-second video, a hundred and fifty to a hundred and eighty words for a sixty-second video. Read every draft aloud at speed; if you stumble, the voice model will too. Length matches duration when you write for the natural reading pace.
- Can I do voiceover over B-roll instead of talking-head?
- Yes — generate the B-roll shots independently with Seedance 2, Veo, or Runway Gen4 (one shot at a time, picking the model that fits each shot type), then assemble in the NLE export node with the ElevenLabs audio playing across the timeline. Cut the B-roll to the rhythm of the voiceover.
- How do I keep the voice consistent across many episodes?
- Lock the canonical voice once — clone or pick from the library — and pin a sample audio in the canvas version tray as the brand voice reference. Use this canonical voice for every script in the series. The voice consistency is structural across episodes because the same voice profile is referenced every time.
Ready to try it on the canvas?
Open Martini and fan your prompt across every frontier model in one workflow.