Google's Veo 3 is the only video model on Martini that generates synchronized audio alongside video. Every other model produces silent video that requires separate audio work. For ads, this is transformative — you get ambient sound, sound effects, and even music in a single generation step. The latest version (Veo 3.1) offers Standard and Fast tiers with support for reference images.
The key technique for Veo 3: describe sounds alongside visuals. Instead of just "a coffee shop scene," write "a barista steams milk with a high-pitched hiss, pours it into a ceramic cup with a soft splash, the gentle murmur of conversation in the background." The model synchronizes audio to visual action — the hiss happens when steam is visible, the splash aligns with the pour.
Standard tier produces better temporal consistency (smoother motion between frames) and higher-quality audio synchronization — use it for final ad output. Fast tier is cheaper and quicker, suitable for testing concepts and iterating on prompt ideas. The audio quality difference between tiers is significant: Standard's audio sounds more natural and better-synced.
Veo 3.1 supports reference images — connect a product photo or scene setup to guide the visual composition while Veo handles the animation and audio generation. This combines the brand consistency of image-to-video with Veo's unique audio generation capability.
Write your prompt as a director's brief that covers all senses: "Wide shot establishing a cozy bakery at dawn. A baker slides fresh bread into a brick oven. The crackling of fire, soft morning birdsong outside, and the warm hum of the oven. Camera slowly pushes in through the window." Every sound description you include gives the model concrete audio targets to synchronize.
Food & beverage ad — this prompt works because every visual action has a matching sound cue. The "satisfying clink" syncs with ice hitting glass, the "liquid pouring" matches the visual pour. On any other model, you'd need to add this audio separately.
A bartender crafts a colorful cocktail in slow motion — ice cubes tumbling into the glass with a satisfying clink, liquid swirling in vibrant layers, finished with a citrus twist. Ambient bar sounds, ice clinking, liquid pouring. Moody bar lighting with neon accents, 16:9
Lifestyle montage with diegetic sound — Veo 3 can generate a sequence of distinct, synchronized sounds (alarm, creak, hiss, click) that give the video a polished, production-ready feel without any post-production audio work.
Morning routine montage: alarm rings, feet touch wooden floor with a soft creak, coffee machine hisses and gurgles, toast pops up with a click. Quick cuts between actions, natural room sounds, warm morning light, lifestyle brand commercial
Always use Standard tier for final ad output — the audio sync quality is dramatically better than Fast. Save Fast for drafting and concept validation.
Be specific about sounds: "the soft thud of a box landing on a table" produces better audio than just "box sounds." The more descriptive your sound cues, the better the sync.
Veo 3 generates audio automatically. If you need a silent version of the same ad (for social feeds with autoplay mute), you can mute the video in any editor — easier than adding audio to a silent model's output.
For ads where voiceover narration is needed, generate the video+ambient audio with Veo 3, then add a TTS voiceover track on the canvas. The ambient audio from Veo serves as the bed track.
Veo 3 is the only video model on Martini that generates native audio. This makes it uniquely efficient for ad production — you get a near-complete ad asset in a single generation step. The trade-off: its visual quality for human subjects is behind Kling 3.0 Pro, and its physics simulation is behind Sora 2. Use Veo 3 when audio is part of the creative concept; use Kling 3.0 for people-focused ads; use Sora 2 for product physics shots.
Connect Veo 3 with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started FreeOpenAI
Sora 2 is OpenAI's video model, and its standout strength is physics simulation — liquids pour realistically, fabrics drape naturally, and objects interact with believable weight and momentum. For video ads, this means product shots look physically convincing without the uncanny "AI float" that plagues other models. On Martini, Sora 2 costs 100 credits for a 10-second clip or 150 credits for 15 seconds, with only two aspect ratios: 16:9 (landscape) and 9:16 (portrait). There are no quality tiers, speed options, or other knobs to tune — Sora 2 is a zero-config model where all your creative energy goes into the prompt and reference image.
View guideKling
Kling 3.0 is the best video model for ads featuring people. It generates the most natural human motion, facial expressions, and lip movement of any model on Martini. With Standard and Pro quality tiers, it scales from quick storyboarding to final ad-quality output. If your video ad shows a person — drinking coffee, unboxing a product, giving a testimonial — Kling 3.0 Pro should be your first choice.
View guideMinimax
Hailuo 02 by Minimax is the workhorse for video ad production — reliably generating clean, well-composed product commercials with consistent color accuracy. Where Sora 2 excels at physics and Kling 3.0 at people, Hailuo 02 excels at commercial polish: product reveals, beauty shots, and food content with the kind of clean, controlled composition that clients expect from ad agencies. Its Standard and Pro tiers let you iterate cheaply and deliver expensively.
View guide