ByteDance
OmniHuman 1.5 is the full-upper-body lipsync model — it animates not just the face but the shoulders, arms, hands, and torso in response to the audio, producing presenter-style talking-head videos that look like recorded video rather than a still portrait with moving lips. The architecture is portrait + audio → synced video with natural micro-expressions, blink timing, head sway, and gesture. Where Kling AI Avatar gives you tight close-up framing, OmniHuman gives you a presenter who can read a script while gesturing naturally — the right pick for executive presentations, keynote-style marketing, courses with talent-on-screen, or UGC ads where presence matters. Output runs at 720p in 1:1, 16:9, or 9:16 aspect. The companion `tools/lip-sync` page covers tool routing; this how-to focuses on the OmniHuman-paired pipeline specifically.
Choose OmniHuman over Kling AI Avatar in three specific cases: (1) framing is upper body — the audience needs to see shoulders, arms, and gesture; (2) content type is presenter-driven — executive update, keynote talk, course lecture, UGC explainer where natural body language signals authenticity; (3) maximum realism is required for a flagship marketing piece. For tight close-ups (course intros, multilingual lipsync where face is the entire frame), Kling AI Avatar is the right pick. Decide upfront because the prep is different — OmniHuman wants a head-to-mid-torso portrait, while Kling AI Avatar works with face-only crops.
OmniHuman wants a head-to-mid-torso portrait with the subject facing the camera (or three-quarter angle), neutral closed-mouth expression, hands visible but not blocking the face, and even lighting. The model animates the visible upper body — shoulders, arms (when in frame), and hands — so the more visible body in the source, the more natural the animation. Avoid: hands directly in front of the face, side profiles, heavy shadows, sunglasses, motion blur. For AI-generated portraits, generate at 2K minimum from Nano Banana 2 or Flux. Resolution requirement: 512×512 minimum on the face area, but the full body crop should be 1024×1024+ for clean upper-body animation.
OmniHuman 1.5 reads emotional inflection in the audio and translates it to body language — an excited line gets matching gesture, a contemplative pause produces a head tilt, an emphatic word triggers a hand gesture. This is OmniHuman's biggest differentiator over face-only lipsync models. Use ElevenLabs Eleven v3 with inline tags ([excited], [pause], [confidently]) to direct emotional delivery, and OmniHuman will animate the body to match. Speaking pace at 130-160 WPM produces the most natural body motion alongside the lipsync; very fast speech causes the body to "vibrate" with too-rapid micro-gestures.
Add a Tool node, select OmniHuman 1.5, connect both Image (upper-body portrait) and Audio (speech) nodes as inputs. Pick the output aspect: 9:16 for vertical social (TikTok, Reels, Shorts), 16:9 for landscape presentations and YouTube, 1:1 for LinkedIn and feed posts. Output renders at 720p — for higher-resolution delivery, route through the video upscaler tool node afterward (see the upscale-video-to-4k how-to). Per-call cap is typically 30-60 seconds; longer scripts chunk into multiple OmniHuman nodes in sequence on the canvas, each with the same portrait and a different audio segment.
OmniHuman is the upper-body presenter pick — for tight close-up framing where face is the entire frame, Kling AI Avatar is more efficient.
Portrait should be head-to-mid-torso with hands visible but not blocking the face. The more visible body, the more natural the animation.
Use ElevenLabs v3 inline tags ([excited], [pause], [confidently]) in the audio source — OmniHuman 1.5 reads emotional inflection and translates it to body language.
720p output is the model native resolution; route through video upscaler tool node afterward for 4K delivery (see upscale-video-to-4k how-to).
Companion tool page: `models/tools/lip-sync` covers tool routing and chunking patterns. This how-to is the OmniHuman-paired pipeline specifically.
OmniHuman 1.5 produces presenter-style talking-head videos where the upper body animates naturally to the audio — gesture, head sway, blink timing, and micro-expression all synced to speech rhythm and emotional inflection. The pipeline is portrait + audio → synced video, runs as an async tool node on the canvas, chunks for longer scripts. Trade-off vs. Kling AI Avatar: OmniHuman is the right pick when presenter presence matters (executive updates, keynote-style marketing, UGC ads with body language); Kling AI Avatar is the right pick for tight close-up content where face is the entire frame. For multilingual presenter content, OmniHuman + ElevenLabs Multilingual v2 + same upper-body portrait produces a localized presenter who reads in 5+ languages with consistent body language across editions. The full pipeline runs on the Martini canvas; companion `tools/lip-sync` page covers more advanced routing.
Connect OmniHuman 1.5 with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started Free