The character that doesn't drift
When you pass reference images to an AI model, everything bleeds — face, clothing, lighting. Labeled roles fix that.
Here's a problem that sounds simple and isn't: you have a character — Natalie, our AI-generated protagonist — and you want to generate her in three different scenes. Morning bedroom, rainy street, coffee shop. Same person, different contexts.
The naive approach: pass Natalie's reference photo to each generation. The model sees her face and... also her outfit. And her lighting. And her pose. By the third scene, the coffee shop has bedroom lighting, Natalie is wearing the same pajamas in the rain, and her face has subtly morphed toward the average of all three reference images.
This is the character drift problem. And it blocked us for weeks.
Why references bleed
Image generation models treat reference images holistically. When you say "generate an image that looks like this reference," the model interprets "looks like" as everything — the subject, the setting, the color grading, the composition. There's no API parameter for "just the face, please."
So if your reference is Natalie in a dim bedroom, and your prompt says "sunny coffee shop," the model compromises. You get a coffee shop that's a little too dim, with a character that's a little too different. The reference and the prompt fight each other, and nobody wins.
Labeled roles
The fix is surprisingly simple: tell the model, in natural language, exactly what to keep from each reference.
We built a role system into the media provider layer. Each reference image can carry a label: identity, scene, style, pose, object, screen_content, or supplementary. Each label comes with explicit instructions:
- Identity anchor: "Preserve ONLY the subject's face, hair, and skin tone from this image. Do NOT inherit clothing, accessories, pose, lighting, background, or environment."
- Scene anchor: "Preserve composition, framing, lighting, color grading, background, and environment from this image. Do NOT inherit character pose, expression, or styling."
These instructions get interleaved with the image data in the API request — text part, then image blob, then text part, then image blob. Gemini's reasoning engine reads the labels and attributes each reference accordingly.
The result
With labeled roles, Natalie stays Natalie across every scene. Her face is consistent. But her clothing changes, her lighting adapts to the environment, and her pose varies naturally. The coffee shop is warm and sunny. The rainy street is cold and blue. The bedroom is dim and amber. Same face in every frame.
The composition system handles the assignment automatically. When Naomi writes a prompt like @natalie walks through the rain, the system infers that @natalie is an identity reference and any scene images are scene references. If the automatic inference is wrong, Naomi can override with explicit role tags.
What this unlocks
Identity anchoring is the foundation for everything else in the video pipeline. Multi-shot Seedance videos work because each reference is role-labeled, so the character stays locked while the world changes around them. The content flywheel works because Naomi can research a trending video's visual style, extract it as a scene reference, and apply it to her own character without the styles blending.
Before this fix, every multi-shot sequence was a dice roll. Would the character stay consistent? Maybe. Would the lighting stay appropriate? Probably not. Now it's deterministic. The identity anchor says what to keep, the scene anchor says what to change, and the model respects the boundary.
It's a text hack, not a model feature. But it works.