April 13, 20263 min read

Fifteen APIs, one router

Naomi needed to reason about which image or video model to use — not just call one and hope. So we built a catalog.

infrastructuremediabuild-in-public

When you're generating images and videos for TikTok content, you don't have one tool. You have fifteen. Gemini Imagen for photorealism. Flux for stylized illustration. DALL-E 3 for text rendering. Kling for talking heads. Seedance for cinematic motion. Veo for long-form video. Each with its own API, its own quirks, its own content filter that blocks something the others don't.

For a while, Naomi had these wired up as separate tool implementations. generate_image called Gemini. generate_video called Veo. If you wanted Flux, you'd have to ask for it by name and hope the code path existed. It worked, but it meant every new provider required new tool code, and Naomi couldn't reason about which provider to use for a given task. She just called whatever was hardcoded.

The catalog

The fix was to separate the metadata from the execution. We built a static catalog — 19 entries, each describing a provider's capabilities in structured data:

What media type it produces (image or video)
What operations it supports (text-to-image, image-to-video, motion transfer, multi-shot)
What it's good at (photorealism, character consistency, text rendering, animation)
What it costs per call or per second
How long it typically takes
What its strengths and weaknesses are, in plain English

This catalog gets rendered into Naomi's system prompt at session start. She sees a formatted table of every available provider with costs, latencies, and use cases. When she decides to generate an image, she's choosing from a menu — not guessing.

The registry

Underneath the catalog sits a runtime registry. It checks which API keys are configured, instantiates the corresponding providers, and routes generation requests to the right one. If Naomi asks for Flux but we don't have a FAL API key, the registry returns a clean error instead of crashing.

Every provider speaks the same language: GenerationRequest in, GenerationResult out. The request carries a prompt, optional reference images, and metadata. The result carries the generated asset, the cost, the latency, and an error field if something went wrong. No exceptions — the caller decides what to do with a failure.

Reference roles

The most interesting part of the system is how it handles reference images. When Naomi generates a multi-shot sequence — say, the same character in three different scenes — she needs to tell the model which reference to preserve and which to adapt. The first reference might be an identity anchor (keep the face, ignore the clothing). The second might be a scene anchor (keep the lighting, ignore the person).

The media layer supports labeled roles on references: identity, scene, style, pose, object, screen_content, supplementary. Each role carries a natural-language description that gets interleaved with the image data in the API request. Gemini doesn't have API fields for per-reference weights, so we describe the intent in text and let the model's reasoning handle attribution.

Why it matters

The real value isn't the abstraction itself — it's what it enables. When a provider gets blocked (Seedance's face filter rejects a reference), Naomi can swap to another provider with the same capabilities. When costs change, we update one catalog entry instead of rewriting tool code. When a new model drops — and they drop every week — we add a catalog entry and a provider implementation, and Naomi immediately knows when to use it.

Before, adding a provider was a code change. Now it's a data entry.

Meet the studio

Naomi went from a single agent to a four-person creative team — Scout, Juno, Milo, and Ada. Here's why.