2 min read

One clip, five shots, zero stitching

Seedance 2.0 generates coherent multi-shot video from a single API call. Here's how we wired it up.

videoseedancepipeline

The hardest part of AI video for social content isn't generating a single shot. It's generating multiple shots where the same character appears in different scenes and still looks like the same person.

Until last week, our approach was: generate each shot individually, hope the character stays consistent across renders, stitch the clips together in post. Three API calls, three chances for the face to drift, and a Remotion composition step at the end to glue everything together. It was expensive (~$4-5 per video), slow, and fragile.

Then we got Seedance 2.0 working through PiAPI.

What changed

Seedance supports something called omni_reference mode — you pass up to 12 reference images, and the model locks them as persistent elements across the entire generated video. You reference them in the prompt with positional tokens: @image1, @image2, and so on. Every time the camera cuts, the character from @image1 maintains their face, hair, and skin tone. The background from @image2 stays consistent.

One API call. One video. Multiple shots. $1.50 for 15 seconds.

The composition system

The tricky part isn't the API — it's translating Naomi's creative intent into the right prompt format. When Naomi writes a video brief, she thinks in terms of handles: @natalie (our character), @morning-bedroom (a scene reference), @phone-close-up (a prop reference). These are human-readable names bound to actual assets in her workspace.

The composition system resolves these handles. It looks up each one — is it a pinned character, a session asset, a video frame, a URL? — and fetches the actual image data. Then it assigns roles based on context: the first handle referenced in a character context gets the identity role, a scene reference gets the scene role. Each role carries instructions for the model about what to preserve and what to adapt.

Finally, it renders two versions of the prompt: one agent-facing (with @natalie handles for readability) and one model-facing (with @image1 positional tokens for the API). The model-facing version includes a role preamble for each reference so the generation model knows exactly what to lock.

The prompt style

Seedance responds well to what we're calling "sensory-dense" prompts. Instead of "a woman looks at her phone," we write "soft bedroom glow catches her face as she tilts the phone toward camera — the screen's blue light washes across her cheekbones, eyes narrowing slightly." The motion language matters: camera pushes, drifts, pans. Subject motion: breathing, blinking, subtle head turns.

The VIDEO_PLAYBOOK handbook teaches Naomi when to reach for video at all — motion should add emotional value, not just movement. A still frame of a quote is fine as a still frame. But a character's expression shifting from exhaustion to resolve? That's where video earns its cost.

What this replaces

Before Seedance, a multi-shot character-consistent video required:

  1. Generate shot 1 ($0.50+)
  2. Generate shot 2 with the same reference ($0.50+)
  3. Hope the face didn't drift
  4. Stitch via Remotion ($0 but slow, requires infra)
  5. Re-render if character drifted in any shot

Now it's:

  1. Compose references + prompt
  2. One Seedance call ($1.50)
  3. Done

The character doesn't drift because it was never generated separately. It's one continuous generation with one consistent world model. That's not an optimization — it's a fundamentally different approach.