Google Veo 3.1 ships native audio synthesis baked into the same generation pass as the video — describe the ambient sound right in the prompt and Veo synchronizes it to the picture. For an indie short film where the dialogue, footsteps, and music bed all have to land in time, Veo 3.1 is the cleanest end-to-end option. Output goes up to 1080p with Fast and Standard tiers, plus an Extend variant that continues an existing clip in V2V mode for seamless multi-clip assembly.
Veo 3.1's native audio synthesis means ambience is part of the prompt, not a separate node. Write: "rain on a tin roof, distant thunder, fire crackles in a wood stove, soft folk guitar in the background." Veo will synthesize all of those layered with the visual content. This is meaningfully tighter than chaining ElevenLabs SFX afterward.
Veo 3.1 supports reference images for style and character guidance. Pin the Nano Banana 2 character sheet to the canvas and feed it into each shot's Veo node. Identity is consistent across cuts — and because the audio renders in-pass, the character's footsteps and dialogue stay matched to their movement.
Veo 3.1 Fast returns drafts in 60-120s; Standard takes 120-180s but produces noticeably crisper detail and better audio fidelity. For a 3-5 shot short, run Fast on the first pass to lock prompt language, then re-render hero opener and closer on Standard with the proven prompt.
For a continuous take that runs longer than a single Veo render, the Extend variant takes an existing clip and continues it seamlessly in V2V mode. Render the first 8 seconds with Veo 3.1 Standard, then route the output into a Veo 3.1 Extend node with a continuation prompt. The result is a longer continuous shot without a visible cut.
Like Kling, Veo 3.1 generates dialogue lip-sync when you supply a script line in the prompt. For the climactic dialogue beat, write the line in quotes: "Character whispers: 'We need to leave now.' Soft golden hour light, medium close-up, ambient cricket sound, 6 seconds." Lip-sync renders in-pass.
Once all 3-5 shots are in the canvas, route them through the sequence builder and export as a 1080p native sequence. Veo 3.1 caps at 1080p — for 4K festival delivery, route the timeline through a video-upscale tool node (2x is enough; reserve 4x for hero only). Audio is already baked, so no separate audio export needed.
Opening with full ambient soundscape baked in. No separate SFX or music nodes needed.
Wide establishing shot of a remote cabin at dusk, rain on tin roof, distant thunder, fire crackles in wood stove, soft folk guitar in background, 8 seconds
Dialogue beat with in-pass lip-sync. Veo renders the spoken line synchronized to mouth movement.
Medium close-up. Character whispers: "We need to leave now." Soft golden hour light from camera right, ambient cricket sound, 6 seconds
First half of a long continuous chase. Pipe to Veo Extend for the next 8 seconds without a visible cut.
Continuous follow shot, character runs through wet forest at night, breathing heavily, leaves rustle, distant siren, handheld camera, 8 seconds (then continue with Veo Extend)
Veo 3.1 ambient sound renders in the same pass — describe the soundscape directly in the prompt.
Use Fast for blocking, Standard for hero shots — the audio fidelity gap is significant.
For dialogue, write the line in quotes inside the prompt; Veo renders lip-sync in-pass.
Veo 3.1 Extend is V2V-only — feed it an existing clip plus a continuation prompt for seamless multi-clip assembly.
Output is capped at 1080p — for 4K festival delivery, chain a video-upscale tool node downstream.
Veo 3.1 outputs at 720p or 1080p with native synchronized audio in the same generation pass — uniquely tight pairing of picture and sound. Render times: Fast 60-120s, Standard 120-180s. Reference images guide style and character. The Extend variant is V2V-only and useful for continuous shots that exceed a single render. For 4K final delivery, chain a video-upscale tool node downstream — Veo itself caps at 1080p.
Connect Google Veo 3.1 with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started FreeOpenAI
Sora 2 is OpenAI's flagship for cinematic short film work — realistic lighting, believable reflections, and camera moves that read like a real DP shot them. The base Sora 2 handles text-to-video and image-to-video at 1080p; Sora 2 Pro lifts fidelity and unlocks 15-second clips with clarity control. For an indie filmmaker drafting a 3-5 shot festival short over a weekend, Sora 2 hits the bar where the pre-viz looks production-ready before any crew is booked.
View guideKling
Kling 3.0 is the first major video model to render native 4K (3840x2160) at the diffusion stage rather than via post-process upscaling — sharper textures, accurate film grain, finer hair, fabric, and skin detail than any upscaler can recover. For a short film bound for a festival projector, that detail floor matters. Kling also bakes Omni Native Audio into the same pass (English, Chinese, Japanese, Korean, Spanish), so dialogue lip-sync and ambience can ship without a separate audio chain.
View guide