How to Turn an Image Into Video With AI
End-to-end image-to-video workflow on Martini — model choice, motion control, and chaining shots.
Key takeaways
- Start with the still — the quality of your input image is the single biggest predictor of how the video will look.
- Pick the model based on the shot type: Seedance 2 for cinematic image-to-video, Kling 3 for character motion, Veo for environmental wides.
- Write the motion prompt as a single shot, not a paragraph of mood. Subject + action + camera move + lens + lighting + atmosphere.
- Iterate on the prompt while the image stays pinned in the canvas. Do not re-upload between takes.
- For multi-shot sequences, share one reference image across multiple parallel video nodes and assemble downstream in the NLE export node.
What "image to video" actually means in production
Image-to-video is the workflow where you start with a still — generated, uploaded, or pulled from a library — and produce a moving clip from it. The image acts as the first frame, the visual reference, or both, depending on the model. This is the dominant entry point into AI video for most production teams because it gives you a level of control that pure text-to-video does not. You decide what the frame looks like; the model decides how it moves.
On the Martini canvas, image-to-video lives as a chain pattern: image node feeds into video node. The image node can be Nano Banana 2, GPT Image 2, Flux, Imagen 4, or any other generation model. The video node can be Seedance 2, Kling 3, Vidu, Veo, or whatever fits the shot. The connection between them is the key to the workflow — once you wire it, you iterate the motion prompt against a fixed reference image, which is the loop that produces consistent, controlled output.
The mistake to avoid is treating image-to-video as "generate an image, hand it to a video model, hope for the best." Every step in the chain is intentional. The image is composed for motion. The model is chosen for the shot type. The prompt is written for a single take. The take is reviewed and iterated. Skipping any of those four loses you control.
Step 1 — Prepare the input image for motion
A great still does not always make a great input for video. The properties that matter for motion are different from the ones that matter for a finished poster. Compositionally, leave room around the subject — the camera needs space to move into and around the frame. A face filling 90% of the canvas leaves the video model with no room to push in or pull back; a face occupying 40-50% gives it room to work. Similarly, depth of field cues (a foreground element, a background falloff) help the model understand the spatial layout and produce more plausible camera moves.
Avoid frames that are too busy. Fine detail in every corner looks great as a still but gives the video model conflicting signals during motion — small elements may flicker, drift, or melt as the camera moves. If the still has a crowded background, run it through a Flux Kontext background simplification pass first, or generate a cleaner variant with a less detailed environment.
For character work, the image should clearly establish the character — face visible, identity readable, lighting consistent enough that the video model can extrapolate from it. Profile shots are harder to animate than three-quarter or front views; if you need a profile in the final video, often it is better to animate from a three-quarter view and let the model rotate to profile during the take.
Step 2 — Pick the right video model for the shot
The model choice is the most consequential decision in the workflow. The wrong model for a shot will fight you no matter how good the prompt is. The decision tree we use is straightforward: if the shot is dominated by a character speaking, pick Kling Avatar. If it is dominated by a character moving without dialogue, pick Kling 3. If it is a cinematic image-to-video shot driven by camera motion and atmosphere, pick Seedance 2. If it is an environmental wide where the camera covers a lot of space, pick Google Veo.
Within those buckets, choose the variant by the cost-quality tradeoff. Seedance 2 Pro for hero shots, Lite for iteration. Kling 3.0 for hero, O3 for iteration. Veo only when you genuinely need its long-range coherence — it is the most expensive option. Vidu is a strong fast-iteration alternative for character motion at lower cost than Kling 3.0.
When you are uncertain, drop two parallel video nodes wired to the same image, set them to different models, and render both. The version tray shows you both takes side by side and the right pick is usually obvious. This is cheap with Lite and O3 variants; do it for any shot you are not confident about.
Step 3 — Write the motion prompt as a single shot
Write each motion prompt as if you were directing one shot for a DP. The structure that holds across Seedance 2, Kling 3, and Veo is: subject + action + camera move + lens + lighting + atmosphere. For example, "the woman remains still for the first beat, then slowly turns her head to camera left and offers a small smile, slow dolly in from medium-wide to medium close-up, anamorphic 35mm lens, warm golden-hour light, faint dust in the air." That single line directs the entire take.
Keep the prompt to one action per shot. "She walks across the room, picks up the cup, then turns to the window" is three shots — not one. AI video models will compress that into a half-second blur and ruin the take. Split actions into separate generations and cut on the canvas. This is the single most common prompting mistake we see.
When you have an image input wired in, drop most of the visual description from the prompt and lean on the reference. The prompt becomes pure motion direction: "subject begins still, then slow head turn to camera left, micro smile in the last second, slow push-in." This is the most controllable mode and the one to use for any shot that needs to match other shots in a sequence.
Step 4 — Iterate against the pinned image
The biggest workflow advantage of generating image-to-video on the Martini canvas is that the reference image stays pinned while you iterate the motion prompt. Render a take, watch it, refine the prompt, render again. The image does not move; only the prompt changes. The version tray keeps every take so you can compare across iterations without losing earlier work. This loop typically produces a usable take within three to six iterations on a well-prepared image.
When the take is wrong, diagnose where the problem is. If the framing is off, the issue is the camera move language — be more specific about start and end frame positions. If the motion is mushy, the issue is the action description — be more direct about what happens in each beat. If the character drifts, the issue is the reference signal — make sure you are using the model variant that respects image input strongly (Seedance 2 Omni rather than 2 Pro, for example).
Pin the strongest take when you have it. Do not delete the others — the version tray costs nothing to keep, and a take you did not pick today might be the right one tomorrow when the brief changes.
Step 5 — Chain into a multi-shot sequence
Once you have one shot working, build the rest of the sequence by duplicating the video node and varying the motion prompt while keeping the same image input. Same character, same setting, different camera moves and actions across multiple takes. This is how you produce a multi-shot sequence that holds visual continuity without re-rolling the character or environment for each shot.
Wire all the chosen takes into an NLE export node downstream. The NLE node assembles the takes in the order you wire them, applies cuts, and exports a single edited piece. This is meaningfully cleaner than re-rendering inside an external editor because the canvas keeps the source takes editable — change a prompt upstream and the whole sequence updates.
For longer pieces, build the canvas as a row of parallel image-to-video chains, one per shot. Cluster related shots so the canvas reads as a storyboard from left to right. The NLE node sits at the right edge and pulls from each chain in order. This pattern scales to ten- or twenty-shot sequences without becoming unmanageable.
How Martini changes the workflow
Outside the Martini canvas, image-to-video is a multi-tool dance — generate the image somewhere, download it, upload to the video tool, prompt, download the take, edit elsewhere. Each step loses fidelity, breaks the iteration loop, and silently makes consistency harder. On the canvas, the entire chain — image, video, edit, export — runs in one place with the references shared between nodes and the version tray remembering every take.
The unlock is not just convenience. It is that the workflow becomes a shape you can edit. Change the source image upstream and every downstream video node re-renders against the new reference. Swap a video model on one shot without touching the others. Reorder the cut by re-wiring the NLE node. The canvas becomes the production document, not just a tool to make one take.
Workflow example
Three-shot product reveal turning a single still into a finished video on Martini: drop a Nano Banana 2 image node and generate the hero shot of the product on a marble counter. Pin the chosen take. Drop three Seedance 2 nodes downstream, all wired to the same image. Prompt the first for "slow dolly in from wide to medium close-up, hold on label," the second for "macro pull-back from extreme close-up of the cap to medium-wide, soft window light," and the third for "sweep around the bottle in a slow arc, shallow depth of field." Render two takes per node, pick the strongest of each. Wire all three chosen takes into the NLE export node in order. Export. Total elapsed time, roughly twenty minutes from blank canvas to finished sequence.
Recommended models
Recommended features
Related how-to guides
Related reading
Seedance 2 Handbook: Variants, Best Workflows, and How to Use It on Martini
Hands-on guide to Seedance 2 — variants, strengths, and the production workflows it fits on Martini's canvas.
Kling 3 Guide: Variants, Use Cases, and How to Choose
Kling 3, O3, and Avatar variants — when to use each, on Martini.
How to Build a Consistent AI Character Across Images and Video
Reference workflows that keep character identity stable across image and video generations on Martini.
Frequently asked questions
- What's the best AI model to turn an image into a video?
- It depends on the shot. Seedance 2 for cinematic image-to-video, Kling 3 for character motion, Kling Avatar for talking heads, Veo for environmental wides. The right answer is usually two parallel takes against the same image so you can compare directly.
- How long should the motion prompt be?
- One shot, one prompt — typically two or three sentences. Cover subject, action, camera move, lens, lighting, atmosphere. Resist the urge to describe multiple actions; AI video models compress them into mush.
- Should the input image be busy or simple?
- Simpler is almost always better for motion. Leave room around the subject for the camera to move. If the still is detail-heavy, run it through a Flux Kontext background simplification pass before wiring it into the video node.
- How do I extend the clip past the model length cap?
- Chain the video node into a Runway Aleph or Wan continuation node. Aleph holds tonal grade most cleanly. Re-rolling the same model usually produces a visible cut at the splice — use a continuation model instead.
- Can I keep the same character across multiple shots?
- Yes — generate the character once with Nano Banana 2, pin the still, and wire it into every video node in the sequence. Use Seedance 2 Omni or Kling Avatar variants that respect image input strongly. Identity carries through.
- Do I need to download the image and re-upload to the video tool?
- No — that is the whole point of running this workflow on Martini. The image stays pinned in the canvas and the video nodes reference it directly. Iterate the prompt without ever re-uploading.
Ready to try it on the canvas?
Open Martini and fan your prompt across every frontier model in one workflow.