How to Build a Consistent AI Character Across Images and Video
Reference workflows that keep character identity stable across image and video generations on Martini.
Key takeaways
- Character consistency is a workflow problem, not a model problem — pick the right models, then build a reusable reference library once.
- Use Nano Banana 2 as the generation model for the character library; it is tuned for multi-image reference and identity stability.
- Five canonical references — front, three-quarter left, three-quarter right, profile, smiling close-up — is the minimum library to start.
- Pair Nano Banana 2 with Flux Kontext for outfit and accessory swaps without breaking the face.
- Carry identity into video by wiring the character image into Seedance 2 Omni, Kling Avatar, or Vidu — the image-side library is the prerequisite for video consistency.
What "consistent character" actually requires
A consistent AI character is one whose face, build, hair, and signature wardrobe are recognizable across every image and every video frame in the project — across hundreds of generations, across weeks of production, and across multiple creators on the same team. Most people who try to build this hit the wall on generation three or four; the face drifts, the hair changes, the eye color subtly shifts. The reason is almost never the model alone. It is that the workflow has not been set up to enforce consistency at every step.
Consistency is a property of the canvas, not of any single generation. If you treat each new shot as an independent text-to-image prompt, you will get drift. If you treat each new shot as a multi-reference generation that pulls from a canonical library, you will get consistency. The work is in building the library and wiring it into every downstream node, which is exactly the workflow the Martini canvas is designed for.
There are three layers to get right: the model choice (Nano Banana 2 is the canvas's strongest character model), the reference library (five canonical images, pinned), and the chaining pattern (every downstream image and video node references the library directly, never a derived take). Get those three right and consistency stops being a worry.
Step 1 — Build the canonical reference library
Start with one detailed character description as text: age range, build, ethnicity, hair color and style, eye color, distinguishing features (a freckle, a scar, a particular jawline), and signature wardrobe. The more specific this is, the better. "Mid-thirties woman, athletic build, dark brown hair in a low ponytail, hazel eyes, small mole above left eyebrow, usually wears a charcoal cashmere sweater" is workable. "Beautiful young woman" is not.
Drop a Nano Banana 2 image node, paste the description, and generate four to six takes of the front view. Pick the strongest as your canonical front view and pin it in the version tray. This pinned image is now the source of truth — every downstream node will reference it.
Duplicate the Nano Banana 2 node four times. In each duplicate, wire the canonical front view in as a reference and prompt for an additional angle: three-quarter left, three-quarter right, profile, smiling close-up. Pin the strongest take from each. You now have a five-image canonical library that defines the character at the angles you will need downstream. For more demanding projects (an AI influencer feed, a recurring spokesperson), expand to ten or fifteen seed images covering more outfits and more emotional expressions.
Step 2 — Generate every new image against the library
For every new shot you produce, drop a fresh Nano Banana 2 node and wire in three or four images from the canonical library. Choose which references to wire based on what the shot demands: front view plus three-quarter for face anchoring, plus the smiling close-up if the new shot needs that expression, plus a full-body reference if the new shot is a wide.
Write the prompt for the new shot as: action and environment first, then explicit attribution of which reference governs which attribute. "Same character standing at a coffee bar in soft morning light, referencing image 1 for face and hair, image 2 for outfit, full-body proportions following image 4." Be explicit about which image controls which attribute and Nano Banana 2 will follow you.
The cardinal rule: never chain references through derived takes. Every new generation should reference the canonical library directly, not a previous shot you happened to like. Each chained derivation introduces small drift; over twenty shots this drift is visible. Going back to the source library on each generation keeps drift bounded.
Step 3 — Use Flux Kontext for wardrobe and accessory swaps
When you need the same character in a new outfit or with a different accessory, do not re-prompt Nano Banana 2 — that tends to subtly shift the face along with the outfit. Instead, drop a Flux Kontext node downstream of a chosen Nano Banana 2 take and use Kontext to swap only the clothing or accessory. The face stays. The pose stays. Only the targeted region changes.
On the canvas, this looks like: pick the Nano Banana 2 take that has the right pose and expression, drop a Flux Kontext node wired to it, mask the clothing region, and prompt for the new outfit. Repeat the Kontext step for each variant — same character, ten outfits, all from one Nano Banana 2 base. This is the production backbone of any character-driven content feed.
Use Kontext also for surgical fixes — a misplaced hand, an off-color background prop, a logo that drifted on a shirt. These are the kinds of edits that would force a re-roll on a generation model and that Kontext handles in seconds without disturbing the rest of the frame.
Step 4 — Carry identity into video
Image-side consistency is the prerequisite for video consistency. Without a stable image library, your video will drift. With one, the video models that respect image input strongly will hold the identity through motion. The three video nodes to reach for are: Vidu when you need fast iteration on character video, Kling 3 (and especially Kling Avatar for dialogue) for character-driven shots with subtle performance, and Seedance 2 Omni for cinematic image-to-video where motion realism is the deciding factor.
For each video shot, wire one or two images from the canonical library into the video node. The same multi-reference principle applies: choose the references for the angle and expression the shot needs. Write the motion prompt as a single take with one action — see the image-to-video guide for the full prompt structure. The video output will hold the identity because the reference is shared.
For talking-head video, the chain is Nano Banana 2 still into Kling Avatar with audio from ElevenLabs or Fish Audio S2. Avatar handles lip-sync; the Nano Banana 2 reference holds the face. This is the cleanest production pattern for a recurring spokesperson where the same character delivers different scripts across episodes — the character looks like the same person across every episode because the image library is the same across every Avatar node.
Step 5 — Maintain the library over time
A character library is a living document, not a one-time setup. As the project runs, you will produce new canonical references — a new outfit becomes signature, a new pose becomes recurring, a new emotional expression earns a permanent slot. Pin those new references in the canvas's version tray and treat them as additions to the library. The next round of generations references the expanded set.
Avoid library bloat. Three to four references per generation is the sweet spot; if your library grows to fifty references and you start passing eight per generation, you will lose the precision that makes Nano Banana 2 worth picking. Keep the library curated. Retire references that are no longer the strongest for a given angle or expression.
For team workflows, the canvas itself is the shared reference. A teammate opening the project sees the same library, the same pins, the same canonical takes. There is no separate file structure to maintain. The character lives in the canvas; everyone generates against it; consistency is automatic.
How Martini changes the workflow
Outside a canvas-based tool, character consistency is a discipline problem solved by spreadsheets, file naming, and remembering which version of which image is the "real" one. Most teams give up halfway and settle for "close enough." On the Martini canvas, consistency is a structural property of the workspace — the library is pinned, every node references it, and the version tray remembers everything.
The deeper unlock is editability. Update the canonical front view and every downstream node re-renders against the new reference. Swap the wardrobe in Kontext and every video node downstream picks up the new outfit. Reorganize the library and the whole project re-aligns. Character consistency stops being a battle you fight every shot and becomes a property of how the canvas is wired. That is the workflow change.
Workflow example
Recurring spokesperson series on Martini: build the five-image canonical library with Nano Banana 2 (front, three-quarter left, three-quarter right, profile, smiling close-up). For each weekly episode, drop a Nano Banana 2 node for the new scene (kitchen, office, outdoor walk, studio backdrop) and wire in the front and smiling references. Pin the chosen still. Drop a Kling Avatar node downstream, wire the still and an ElevenLabs audio node carrying the episode's script, prompt for "subtle gestures, eye contact with camera, slight head movement on emphasis." Render two takes, pick the stronger. Wire it into the NLE export node with intro and outro segments. Repeat for each episode. The character looks like the same person every week because the library is shared across every node.
Recommended models
Recommended features
Related how-to guides
Related comparisons
Related reading
Nano Banana 2 Workflows for Multi-Image Reference and Character Consistency
Multi-image reference and character consistency workflows on Martini using Nano Banana 2.
Kling 3 Guide: Variants, Use Cases, and How to Choose
Kling 3, O3, and Avatar variants — when to use each, on Martini.
How to Turn an Image Into Video With AI
End-to-end image-to-video workflow on Martini — model choice, motion control, and chaining shots.
Frequently asked questions
- Which model is best for AI character consistency?
- Nano Banana 2 on the image side, paired with Flux Kontext for edits. For video, Kling Avatar for talking heads, Kling 3 for character motion, Seedance 2 Omni for cinematic shots, Vidu for fast iteration. Identity carries from the shared image references.
- How many reference images do I need to start?
- Five canonical references — front, three-quarter left, three-quarter right, profile, smiling close-up — is the minimum. For demanding projects like an AI-influencer feed, expand to ten or fifteen seed images covering more outfits and emotional expressions.
- How do I change my character's outfit without breaking the face?
- Pick a Nano Banana 2 take that has the right pose and expression, then drop a Flux Kontext node downstream and use Kontext to swap only the clothing region. Re-prompting Nano Banana 2 for an outfit change tends to subtly shift the face.
- Why does my character drift after twenty generations?
- You are probably chaining references through derived takes. Always reference the canonical library directly, not a previous shot you happened to like. Each chained derivation introduces small drift; going back to the source on each generation keeps drift bounded.
- Can I keep the same character consistent in talking-head video?
- Yes — wire the canonical Nano Banana 2 still into a Kling Avatar node along with the dialogue audio. Avatar handles lip-sync; the still holds the face. This is the cleanest pattern for a recurring spokesperson across multiple episodes.
- Should I train a LoRA instead of using a reference library?
- Usually not. LoRA training takes time, requires curated data, and locks you to one model. A multi-image reference library on Nano Banana 2 gives you most of the consistency benefit, runs on the canvas immediately, and adapts as the character evolves. Consider LoRA only for very high-volume work where the cost difference per generation justifies the training overhead.
Ready to try it on the canvas?
Open Martini and fan your prompt across every frontier model in one workflow.