OpenAI
Use Sora 2 as the downstream camera-move engine for a Text-to-3D-Scene workflow on Martini — captured stills from the navigable Marble scene feed into Sora 2 video nodes for cinematographic shots that respect the scene's spatial structure. Sora 2 does not generate the scene itself; the scene comes from a text-conditioned Marble 3D node (or from an upstream Midjourney/FLUX.2 frame routed into Marble). Marble's output is a canvas-internal navigable preview, not a portable .obj, .fbx, .glb, or USD mesh — Sora 2 takes the captured stills as starting frames and produces motion clips that all share the same locked location.
Sora 2 needs a starting frame from a real spatial source. Upstream: drop a 3D node configured for Marble. For text-only: write the location prompt directly ("foggy alley at dusk, neon signs, wet cobblestones"). For stronger reconstruction: generate a concept frame on Midjourney or FLUX.2 first and wire it into Marble as image conditioning. Marble runs ~5 minutes; output is a navigable canvas-internal preview.
Inside the navigable Marble preview, capture stills from the four-angle pattern: front view, three-quarter left, three-quarter right, back/over-shoulder. Each capture lands as an image node. Capture more than you need — re-running Marble produces a different scene, so screenshot first, iterate later. These are the Sora 2 starting frames.
Sora 2 has deep understanding of 3D space, motion, and scene continuity — captured stills from a Marble scene are clean input format. Wire each captured angle into its own Sora 2 image-to-video node. The video model inherits the spatial structure from the still and produces motion that respects parallax, occlusion, and depth.
Use cinematographic verbs in the Sora 2 prompt: "slow camera push forward through the alley toward the ramen shop," "gentle orbit clockwise around the central fixture," "static camera, neon flickers in the foreground." Sora 2 maps these directly to its training distribution. Avoid generic verbs ("move closer," "spin") — they leave the model guessing.
For sequences longer than one Sora 2 clip, route the last frame of clip N into a frame-extraction tool node, then feed it as the starting frame of clip N+1. Combined with the locked Marble scene, this gives both spatial AND temporal continuity. The scene locks the location; frame chaining locks the motion thread.
Drop the Sora 2 outputs into Martini's sequence builder in story order. Each clip is 5-10s. Layer audio (ElevenLabs Eleven v3 + Minimax Music). Export as native sequence to Premiere, DaVinci Resolve, or Final Cut. The locked Marble scene made the multi-shot read as one place; the NLE export is the final delivery.
Establishing shot via slow push. The captured still locks the location; Sora 2 adds the camera move with parallax through the alley.
[Captured still from Marble: front view of foggy Tokyo alley] + Sora 2 prompt: slow camera push forward through the alley toward the ramen shop, neon signs flickering in mid-distance, rain particles in air, atmospheric, 8 seconds, 16:9.
Medium shot via orbit. Same scene; new angle. Orbit instruction maps to Sora 2's 3D-aware training.
[Captured still: three-quarter angle on the ramen shop entrance] + Sora 2 prompt: gentle orbit clockwise around the lantern outside the shop, soft lantern light, no character, 6 seconds, 16:9.
Detail close-up via static zoom. Static instruction tells Sora 2 not to add unwanted parallax.
[Captured still: tight close on the vending machine] + Sora 2 prompt: static camera, slow zoom in toward the vending machine, neon reflections shimmer on wet cobblestone in the foreground, 5 seconds, 16:9.
Reverse shot with parallax dolly. Sora 2's strongest move type — depth structure of the still drives parallax.
[Captured still: reverse over-shoulder, looking out of the alley] + Sora 2 prompt: dolly forward with parallax revealing depth of the alley behind, rain falls heavier in foreground, 7 seconds, 16:9.
Sora 2 is the camera-move engine, not the 3D scene generator. Marble (text or image conditioned) generates the scene upstream.
For best results, route a Midjourney or FLUX.2 frame into Marble as image conditioning. Text-only Marble runs are weaker than image-conditioned.
Capture stills BEFORE iterating Marble. Re-running produces a different scene; capture once, fan out to many Sora 2 nodes.
Use cinematographic verbs (dolly, orbit, push, pull, static, parallax) — they map to Sora 2's training distribution.
For sequences, use last-frame chaining: clip N's last frame = clip N+1's starting frame. Combined with the locked Marble scene, spatial and temporal continuity are preserved.
The Marble scene is canvas-internal — Sora 2 uses captured stills, not the navigable scene directly. Export from Martini = NLE-ready video, not a 3D file.
Sora 2 returns 5-10s 1080p video clips per node, with strong 3D spatial reasoning that respects the depth structure of captured stills from a Marble scene. Generation time 60-120s per clip. Cinematographic camera moves are Sora 2's strongest territory. The Marble scene remains canvas-internal (not exportable as .obj/.fbx/.glb/USD); Sora 2 outputs are exportable video deliverables. Chain via sequence builder for multi-shot delivery, NLE export for native Premiere/DaVinci sequences.
Connect Sora 2 with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started FreeMidjourney
Generate the cinematic concept frame on Martini using Midjourney v7 — then feed that frame into the Marble 3D node to draft a navigable scene from a description that started as text. Marble's output is a viewable canvas-internal scene preview, not a clean .obj, .fbx, .glb, or USD mesh file. Directors with no concept frame use Midjourney to produce the painterly, mood-rich anchor first ("foggy alley at dusk, neon signs, wet cobblestones"), then route the locked frame into Marble for the spatial draft. Image-conditioned Marble runs hold geometry and lighting more reliably than text-only — Midjourney + Marble is the cleanest text-to-3D-scene pipeline on the canvas.
View guideBlack Forest Labs
Generate the literal-staging concept frame on Martini using FLUX.2 — then feed that frame into the Marble 3D node to produce a navigable scene from a text description. Marble's output is a viewable canvas-internal scene preview, not a clean .obj, .fbx, .glb, or USD mesh file. Where Midjourney provides the painterly atmosphere, FLUX.2 is the prompt-fidelity pick: it renders the scene with literal foreground/mid-ground/background depth structure, which is exactly what Marble's image-conditioned mode needs to reconstruct geometry reliably.
View guide