AI lip sync on Martini takes a portrait or short clip plus a voice track and produces a talking-head video where the mouth movements match the dialogue. Drop an avatar node, wire in TTS or an uploaded voice, and render synced takes for product explainers, UGC ads, podcasts, and dubbing without filming a face.
Lip sync turns a still portrait, a generated character image, or a short reference clip into a talking-head video that matches a target audio track. The audio can come from an ElevenLabs TTS node, a Fish Audio S2 voice, an uploaded WAV/MP3, or even a track you already produced in Martini's audio nodes. The model analyses phonemes, head pose, and micro-expressions, then redraws the mouth, jaw, and adjacent face region frame by frame so the talent appears to be saying the lines.
On Martini, lip sync is exposed through avatar-style video models like Kling Avatar and OmniHuman that accept a face input plus an audio input and emit MP4 video on the canvas. You wire image or video on one input, audio on another, and the resulting node produces a synced clip that you can keep extending, layer with B-roll in the canvas, or export to your editor. Outputs preserve the look of the source portrait while replacing only the mouth area motion.
Start by adding an image or short video node holding the portrait you want to animate. Strong inputs are front-facing, well-lit, with the full mouth visible and minimal occlusion (no microphones, hands, or hair across the lips). If the source is generated, run an Image Upscale tool first so facial detail survives the lip-sync redraw.
Next, add an audio source. Use an ElevenLabs node for natural English voice clones, Fish Audio S2 for multilingual voice synthesis, or simply upload an existing voice take. Keep takes under the model's max duration (commonly 30-60 seconds per call) and trim leading silence so the sync model has clean phoneme onsets to lock onto.
Drop a Kling Avatar or OmniHuman video node onto the canvas. Connect the portrait to the image/video input and the audio to the audio input. Pick aspect ratio (vertical 9:16 for TikTok/Reels, square 1:1 for Instagram, horizontal 16:9 for YouTube) and submit. Generations queue through FAL and write back to the node when ready.
Once the synced take returns, chain a Video Upscale tool downstream if you need 1080p or 4K delivery. To assemble multi-scene cuts, duplicate the avatar node per line, swap audio per take, then drag the outputs into the export/timeline workflow. For longer videos, use the Pixverse extend node to bridge takes while keeping the same character.
Primary lip-sync video model — accepts portrait + audio and renders synced talking-head clips.
View modelFull-body avatar with lip sync, useful when you want gesture and torso motion alongside dialogue.
View modelHigh-quality TTS and voice cloning for the dialogue track that drives the lip sync.
View modelMultilingual voice synthesis for dubbing and localisation workflows.
View modelKling Avatar is the strongest general-purpose lip-sync model in the Martini library; OmniHuman is preferred when you also want body and gesture motion, not just mouth movement.
Yes — pair Fish Audio S2 (multilingual TTS) with Kling Avatar or OmniHuman. Quality is highest for the model's primary trained languages, so test a short take before committing to a full script.
Both work. Generate a consistent character with Nano Banana 2 or Flux, save it as a reference, then feed it into the avatar node alongside your dialogue audio.
Most avatar models cap individual generations at 30-60 seconds. For longer videos, generate per-sentence takes and stitch them on the canvas or extend with Pixverse Extend.
Kling Avatar mostly animates the face and head; OmniHuman drives the full upper body including gesture and torso motion. Pick based on whether you want a tight talking-head shot or a more naturalistic presenter.
Chain AI Lip Sync with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started Free