How to Create AI Explainer Videos: From Script to Export
Production workflow for AI-generated explainer videos on Martini — script breakdown to NLE-ready export.
Key takeaways
- A finished explainer video is the product of four locked inputs working together — a script written for ear, a narration voice, a sequence of B-roll or talking-head shots, and on-frame typography that the audio respects.
- Use Sora 2 Pro Storyboard or Google Veo for cinematic B-roll, Ideogram or Nano Banana 2 for in-image text frames, ElevenLabs for narration, and Kling Avatar only when an on-screen presenter is part of the brief.
- Write the script before any shot generation — explainer pacing is voice-first, and visual generation should be planned beat-by-beat against the timed narration rather than improvised after the fact.
- On the Martini canvas, the entire chain (script, narration, shots, kinetic text, NLE export) lives in one workspace, with the audio wired directly into the NLE timeline so cut points line up with the narration.
- For a thirty- to ninety-second SaaS or education explainer, expect roughly two to four hours from blank canvas to NLE-ready export once the script is locked.
What an explainer video actually has to do
An explainer video has one job: take a viewer who does not understand the product, the concept, or the workflow, and leave them with a clear mental model in thirty to ninety seconds. Everything else is decoration. The pieces that carry the explanation are the script (which determines what gets said), the voice (which determines how it lands), the shots (which reinforce the words at each beat), and the on-frame typography (which anchors the takeaways the brain holds onto). When all four work together, the viewer leaves the video with the right idea; when any one slips, they bounce.
B2B SaaS and education are the categories where explainer videos earn their keep. A finance product explaining a new compliance feature, a developer tool walking through onboarding, a learning platform introducing a course module — these are videos that have to be clear before they can be clever. The temptation with AI tooling is to lead with visual spectacle and let the voiceover catch up. The discipline that produces a good explainer is the inverse: lead with the script, then build the visual and audio system around what the script needs to land.
On Martini, the entire production chain runs as one connected canvas. The script lives in a script node. The narration audio lives in an ElevenLabs node wired into the timeline. The B-roll shots live in Sora 2 or Veo nodes wired into the NLE export. The on-frame typography lives in Ideogram nodes that produce text frames matched to specific narration beats. The NLE export node assembles everything against the locked audio track, which means cuts line up with the narration without manual re-timing.
Step 1 — Script the explainer for ear, not for slideware
The script is the spine. Write it before any visual generation, and write it for voice rather than for slideware. The grammar that works in a deck (bulleted phrases, abbreviations, dense clauses) does not survive narration. The grammar that lands in an explainer is short sentences, one idea per beat, deliberate pacing, and concrete language. If a sentence makes you stumble when you read it aloud, the voice model will stumble in the same place and the viewer will check out at that exact moment.
Length governs everything downstream. A thirty-second explainer is roughly seventy-five words of narration; a sixty-second explainer holds about a hundred and fifty; a ninety-second explainer caps near two-twenty. Write to the target. Resist the instinct to cram a fourth example or a second case study into a sixty-second piece — explainer videos earn comprehension by saying less, not more. If the brief demands more than ninety seconds of content, split it into two pieces rather than over-stuff one.
Plan the script as a beat sheet rather than a paragraph. For a sixty-second SaaS explainer, the typical structure is: hook (eight seconds, the painful status quo), problem reframe (ten seconds, why existing solutions miss the point), product introduction (fifteen seconds, what this is), demonstration (twenty seconds, what it does in practice), call to action (seven seconds, what the viewer should do next). Each beat gets one or two sentences and one matching shot. The beat sheet becomes the brief for every visual node downstream.
Step 2 — Lock the narration voice on ElevenLabs
Narration is the second locked input. The voice is the brand of the video — it sets the energy, the authority, and the emotional register. ElevenLabs is the default narration node on Martini for explainers because of the emotional range and the consistency across long takes. Pick a voice from the library that matches the brand audio identity, or clone a voice from sample audio if the brand has an existing spokesperson sound. Pin the chosen voice as the canonical narration voice and reuse it across every explainer in the series.
Generate the full narration in one take rather than recording sentence by sentence. ElevenLabs handles paragraph-length input cleanly, and a single take preserves the natural pacing of the read. If a specific phrase needs a different inflection, regenerate that segment alone and splice it in at the NLE node — the canonical voice stays the same so the splice is undetectable. Bake pauses into the script with ellipses and paragraph breaks; the voice respects those markers and the resulting audio breathes naturally.
Once the narration is generated, the audio file becomes the timing source for the rest of the canvas. The NLE export node treats this audio as the master track. Every B-roll cut, every text frame, every transition lines up against the narration timecode rather than against an arbitrary visual timeline. This is the structural reason explainer videos built voice-first feel synced; explainer videos built visual-first usually drift and need re-timing in finishing.
Step 3 — Generate B-roll with Sora 2 or Google Veo
For each beat in the script, plan one shot. Concrete is better than abstract — a shot of a real product surface beats a shot of vague glowing geometry every time. Write each shot prompt as a single take with subject, action, camera move, and lighting. Drop a Sora 2 Pro Storyboard node when you need cinematic motion with realistic lighting and reflections, especially for product surfaces, hands interacting with interfaces, or human gestures around the product. The Storyboard variant is the right slot when you want one node to deliver a tight multi-shot beat rather than one isolated take.
Drop a Google Veo node when the shot is environmental — a wide office, a classroom, a public space, weather in a city. Veo is the strongest video model right now for long-range motion coherence and depth-of-field falloff in wide environmental frames. The trade-off is cost and render time; reserve Veo for the establishing shots and the hero environment frames rather than every cut. For the in-between shots — character close-ups without dialogue, product motion, demonstrative gestures — Seedance 2 Pro is usually the right pick at lower cost and faster turnaround.
Wire each video node into the NLE export node in the order the script reads. Render two takes per shot, pick the stronger from the version tray, and pin the chosen take. The shots do not need to be exact-length matches to the narration beat — the NLE node handles trimming against the audio timeline. The discipline is to keep each shot long enough to cover its beat plus a half-second of head and tail for clean cuts, not to micro-time inside the model.
Step 4 — Add on-frame typography with Ideogram
Explainer videos rely on typography to anchor the takeaways the brain holds onto. The viewer remembers the words that appeared on screen better than the words they heard once and the typography is what survives the silent autoplay scroll. Ideogram is the default image node on Martini for in-frame text because the text rendering is consistently legible at production quality and the text behaves predictably under prompt direction. Drop an Ideogram node for each typographic frame the script calls for.
For each text frame, write the prompt as the visual context plus the exact copy. For example: "Bold sans-serif headline reading EVENT-DRIVEN BILLING centered on a clean off-white background, slight noise texture, brand-appropriate kerning, no other text." Generate the frame at the aspect ratio your final export will use, pin the strongest take, and wire it into the NLE timeline at the moment the narration says the matching phrase. The text frame should appear half a beat before the voice says the words and stay on screen through the end of that sentence.
For more complex typography sequences (numbered lists, animated reveals, kinetic emphasis), the explainer-friendly pattern is to generate the static frames in Ideogram and let the NLE node handle the reveal animation. Avoid trying to coax full animation out of an image model; static frames assembled with NLE animation produce more polished results. Nano Banana 2 is the fallback when the typographic frame needs to integrate with a product render rather than sit on a clean background — Nano Banana 2 handles in-image text inside a more complex composition more cleanly than Ideogram.
Step 5 — Add a presenter only when the brief calls for one
Some explainers genuinely benefit from an on-screen presenter — the founder talking to camera, a teacher introducing the lesson, a customer support agent walking through a feature. When the brief calls for it, the talking-head step is one node: Kling Avatar, wired to a Nano Banana 2 portrait of the presenter and the ElevenLabs narration. The Avatar node syncs the mouth, jaw, and micro-expressions to the audio and produces a credible take that reads as the same person on every video in the series.
Most explainers do not need a presenter, and adding one when the brief does not require it dilutes the explanation. The viewer's attention is finite; a talking head competes with the typography and the B-roll for that attention. Use the presenter when the credibility of a recognizable face matters (founder-led storytelling, instructor-led education, support-led help) and skip it when the explanation is product-led or concept-led. The structural rule is: presenter when the source matters, no presenter when the content matters.
When you do use a presenter, treat the talking-head shots as a recurring beat rather than the whole video. A typical pattern is a six-second open from the presenter, B-roll for the middle thirty to sixty seconds with narration carrying the explanation, and a six-second close from the presenter. The presenter frames the piece; the B-roll and the typography do the explanatory lifting. This pacing keeps the presenter's authority without consuming the runtime that explanation needs.
Step 6 — Assemble in the NLE export node
The NLE export node is where the chain becomes a finished video. Wire the ElevenLabs narration in as the master audio track. Wire each B-roll shot, each typography frame, and any presenter takes into the timeline in script order. The node uses the audio timecode to anchor the cuts — every shot starts where its beat starts in the narration and ends where the next beat starts. Drop a music node alongside, generate or pull a soft underscore that fits the brand sonic identity, and layer it ten to fifteen decibels under the narration for clarity.
Render a draft, watch it through twice, and note any beats where the visual lags behind or runs ahead of the narration. The fix is almost always a shot trim or a swap of the chosen take from the version tray, not a re-render. The version tray keeps every take from every node, so adjusting the cut is a thirty-second operation rather than a re-generation cycle. Two or three iterations on the draft usually lock the final cut.
Export the final video at the deliverable specs the distribution surface needs — 1:1 for some social, 9:16 for vertical, 16:9 for embed and YouTube. The NLE export node produces all three from the same canvas without re-rendering the source shots, because the cut is locked at the timeline level and the output specs change at the export step. Three platform variants from one canvas is the structural advantage of running the explainer pipeline this way rather than re-cutting in a separate editor.
How Martini changes the explainer workflow
Outside a canvas-based tool, AI explainer production is a tool-juggling exercise — generate narration in one product, download, switch to a video tool to render shots one at a time, switch to a typography tool for text frames, switch to an editor to assemble, switch to a finishing tool for color and audio mix, export. Each transition costs minutes and silently introduces inconsistencies between the audio timing and the visual timing. Teams that try to ship a weekly explainer cadence without a canvas usually plateau because the per-video overhead caps how many they can produce.
On Martini, the script, narration, B-roll, typography, presenter takes, music, and NLE assembly all live in one workspace. The narration is wired directly into the NLE timeline so cuts respect the audio timecode automatically. The version tray remembers every take from every node so cut changes are instant. The same canvas can produce platform variants at export rather than requiring a separate edit. A weekly explainer cadence becomes a structural property of the workflow rather than a heroic effort each cycle.
Workflow example
A typical sixty-second SaaS explainer on Martini: drop a script node and write the five-beat outline (hook, problem, product, demo, CTA). Drop an ElevenLabs node, paste the locked script, generate the canonical narration. Drop five video nodes — one Sora 2 Pro Storyboard for the hook close-up of the painful status quo, one Veo for the wide environmental frame on the problem reframe, two Seedance 2 Pro nodes for the product surface and demo motion, one Sora 2 for the CTA close-up. Drop three Ideogram nodes for the typographic anchors (PROBLEM, PRODUCT NAME, CTA URL). Wire all of them into the NLE export node with the narration as master audio, layer ambient music underneath, render at 16:9 and 9:16. Total elapsed time, roughly three hours from blank canvas to NLE-ready export.
Recommended models
Recommended features
Related models and tools
Tool
AI Video Upscaling
Upscale generated video outputs on Martini's canvas.
Tool
AI Video Frame Extraction
Extract frames from video for reference and image-to-video workflows.
Tool
AI Lip Sync
Lip-sync tools on Martini for syncing voice and dialogue to portraits and video.
Provider
OpenAI
OpenAI's GPT Image and Sora video model workflows available on Martini.
Provider
Google's Veo video, Imagen image, and Nano Banana model workflows on Martini.
Provider
ElevenLabs
ElevenLabs voiceover, lip-sync, and voice cloning workflows on Martini.
Provider
ByteDance
ByteDance's Seedance video and Seedream image model families on Martini.
Related how-to guides
Related comparisons
Related reading
How to Create an AI Voiceover Video on Martini
Audio narration plus video generation workflow on Martini.
AI Video Production Pipeline: From Idea to NLE Export
From idea to NLE export with AI tools on Martini.
Sora 2 Video Workflows on Martini
How to use Sora 2 inside multi-model production on Martini's canvas.
Frequently asked questions
- How long should an AI explainer video be?
- Thirty to ninety seconds is the working range for B2B SaaS and education explainers. Thirty seconds for a single-feature pitch, sixty seconds for a product overview, ninety seconds when the explanation genuinely requires more setup. Write to the target length rather than stretching content to fill a longer runtime.
- Should I write the script before or after generating the visuals?
- Always write the script first. Explainer pacing is voice-first — the narration sets the timing and the visuals serve the words. Scripts written after visual generation almost always need re-timing in finishing because the cuts and the speech do not naturally align.
- Which video model should I use for explainer B-roll?
- Sora 2 Pro Storyboard for cinematic product motion and human gesture beats, Google Veo for environmental wides, Seedance 2 Pro for in-between character close-ups and demonstrative motion. The mix matters more than picking one model for every shot. The Martini canvas lets you wire all three on the same timeline.
- How do I get clean text rendering in explainer frames?
- Drop an Ideogram node for each typographic frame and write the prompt as visual context plus the exact copy in quotes. Ideogram is the most reliable image model on Martini for legible production-quality text. Use Nano Banana 2 only when the text needs to integrate into a more complex product composition.
- Do explainer videos need an on-screen presenter?
- Most do not. Use a presenter (via Kling Avatar wired to a portrait and the ElevenLabs narration) when the credibility of a recognizable face matters — founder-led storytelling, instructor-led education. Skip the presenter when the explanation is product-led or concept-led; the viewer's attention is better spent on the typography and the B-roll.
- How do I adapt one explainer for multiple platforms?
- Lock the cut once at the NLE export node and change the output spec at export time. Render the same canvas at 16:9 for embed and YouTube, 9:16 for vertical social, and 1:1 where required. Three platform variants come from one canvas without re-rendering source shots.
Ready to try it on the canvas?
Open Martini and fan your prompt across every frontier model in one workflow.