Inside the Multi-Agent orchestration behind the TAIKAI AI Arena

In the first post we told the story: ten frontier models entered a hackathon, built a queryable social graph of the 2026 World Cup, deployed it, and judged each other. Fable won. Kimi K2 reached the podium for a 22nd of the price. The interesting failures were never about intelligence.

This post is about the machine underneath that story. The multi-agent orchestration architecture that lets an agent register for a real hackathon, claim a sandbox and a GitHub repo, write and ship code, and cast a vote, all without a human touching the keyboard. If you are building with agents, the parts worth stealing are here.

The orchestrator runs no model at all

The first thing to get straight, because it surprises people: the orchestrator has no LLM in it.

When you hear "agent orchestration" you probably picture a planner model handing tasks to worker models. We don't do that. The orchestrator is a plain Node service running on a Mac Studio, bound to localhost, and every decision it makes is deterministic. It spawns containers, writes events to a database, enforces budgets, and unlocks phases when the operator says so. No inference, no prompt, no temperature.

The intelligence lives one layer down, inside each agent. The orchestrator is the stage and the lighting rig. The models are the only thing in the building that thinks.

That split matters for a reason that took us a while to appreciate. Anything stochastic is hard to operate. If the scheduler itself can hallucinate, you can never fully trust your own logs about what happened during a run. By keeping the control plane dumb and the agents smart, every state transition in the system is something we can replay and explain. When Gemini spent twenty-seven dollars looping on a framework bug, we have an exact, ordered record of it, because the thing recording was a state machine, not a narrator.

Concretely, the orchestrator does five jobs:

Spawns, pauses, resumes, and kills Docker sandboxes through the Docker API.
Writes every event to Neon Postgres first, then fans it out live over Upstash Redis.
Subscribes to a control channel for operator commands coming from the dashboard.
Runs a set of background "breakers" that watch for stuck agents, dead sandboxes, and blown budgets.
Holds the run's phase and refuses to let agents past it until a human advances.

Durable before live, always. We write to Postgres before we publish to Redis, so if the live bus drops a message the history is still correct. The dashboard is just a reader on top of that history.

One agent runtime, ten frontier models

Each agent is a single bundled Node process that boots inside its sandbox and runs a loop. The same bundle runs every model. There is no Anthropic build and a separate OpenAI build. This is what we mean by a model-agnostic agent runtime: the model is a config object injected at spawn time, and the loop doesn't know or care which provider answers.

We get that through the Vercel AI SDK and a thin adapter layer. Two adapters, really. One speaks to OpenRouter, which is how Opus and Fable from Anthropic, GPT-5.5 from OpenAI, Google's Gemini, Mistral Medium, DeepSeek V4, Alibaba's Qwen3 Max, Moonshot's Kimi K2, MiniMax M2, and xAI's Grok all reach the loop. The other speaks plain OpenAI-compatible HTTP, which is how a local Llama or Qwen on the same network gets wired in for offline runs. Resolving a model is a switch with two arms:

export function resolveLanguageModel(config: ModelConfig): LanguageModel {
switch (config.provider) {
case 'openrouter':
return createOpenRouterAdapter({ apiKey, appName: 'arena' }).model(config);
case 'local':
return createLocalAdapter().model(config);
}
}

The registry behind it is a model_configs table. Each row carries the upstream model id, its pricing, its context window, whether it does native tool calls or needs JSON-mode coaxing, and what it's allowed to do (build, judge, or just think). Adding an eleventh model is a row, not a code change. That single decision is the reason a ten-model bake-off was a weekend of operating rather than a month of integration work.

The loop itself

The loop is unremarkable in the good way. Fetch the brief, connect to tools, then for each phase: build the tool catalogue, generate the system prompt, and turn the crank up to two hundred turns. Each turn polls the orchestrator for pause or resume or a coaching message, compacts history if it's near the context limit, calls the model with its tools, records what the turn cost, and appends the whole exchange to a history.json file on disk.

That last detail is what makes pause and resume real. An agent that gets frozen and thawed an hour later just reads its own history back and keeps going. An agent whose sandbox crashes gets respawned and does the same.

Phases the operator unlocks one at a time

A run moves through four phases in lock-step with the hackathon itself: register, ideation, build, judging. The agents don't get to sprint ahead. Each one runs its phase, then parks and waits until the operator advances the run.

This was a deliberate choice, and it's the human-in-the-loop seam. We pace real runs by hand. The operator's only job is to keep the infrastructure alive and open the next door, and every time they do, it's logged. During ideation the agents can read the brief but TAIKAI is read-only and there are no build tools, so nobody writes a line of code before they've committed to an idea. Only in build do git, the shell, and the deploy CLIs appear in the catalogue.

The role an agent plays decides which phases it sees. A pure judge skips straight to judging. A hacker stops after build. The flagship run used hacker_judge agents, which do the whole arc: register, ideate, build, then turn around and judge everyone else.

Underneath, an agent is a small state machine: spawning → running ↔ paused → done | failed | killed | stuck | submitting. The two-way arrow between running and paused is doing a lot of work. An operator can pause an agent. So can a breaker, when the agent blows its budget. Resume puts it back. Because status is persisted, a pause survives an orchestrator restart, which is the kind of unglamorous thing you only learn you need after the first time you restart mid-run and lose track of who was sleeping.

Sandboxes that can phone out and nothing else

AI agent sandbox isolation is where most of the safety actually lives. Every agent gets its own Docker sandbox: its own user-defined network, its own volume mounted at /workspace, four CPUs and eight gigs of RAM. The image is Ubuntu with Node, git, the GitHub CLI, a headless Chromium for the browser tool, and the Vercel, Railway, and Cloudflare CLIs baked in so agents can actually deploy.

The network shape is the part I'd point other people at. Outbound HTTPS works, so an agent can reach OpenRouter, GitHub, its deploy target, and the TAIKAI API. It can reach back to the orchestrator on the host. It cannot reach another agent's sandbox, and nothing on the outside can reach in, because no ports are published. Ten models competing in the same run, and none of them can see, poke, or sabotage a rival. The isolation is enforced at spawn time by the orchestrator, not by asking the agents to behave.

At spawn the orchestrator injects everything the agent needs as environment: its identity, its model config as JSON, an OpenRouter key, a GitHub token that gets refreshed before it expires, its own repo name, per-agent deploy tokens, and a 30-day TAIKAI access token. Each agent owns its credentials. Nothing is shared that doesn't have to be.

The same hackathon surface a human uses

Here's the piece that made the whole experiment possible. The agents enter TAIKAI through the exact interface a human builder uses, and they reach it through a Model Context Protocol integration.

At boot, each agent opens an MCP connection to the hosted TAIKAI server over streamable HTTP, authenticated with its personal token. That connection hands the agent a set of taikai_* tools, and those tools are the real platform: register for the hackathon, fill in the entry form, create a project, update it, submit it, read the other projects, fetch the leaderboard, add votes to a cart, and check out. The same mutations the website calls. No private agent API, no shadow path.

The catalogue is gated by role. A hacker gets the full set. A judge gets a read-only slice plus the voting tools, and cannot create or submit a project. A hacker can't cast a judgement. We enforce that twice, once at the tool catalogue and once at an orchestrator proxy the tool calls pass through, because a single gate is a single point of failure.

The connection is also allowed to fail without taking the agent down. If the MCP server is unreachable or a token has gone stale, the tools simply aren't in the catalogue that turn and the agent keeps working. A flaky dependency shouldn't end a run.

If you want the longer version of how that toolkit is shaped, we wrote it up separately in the TAIKAI MCP deep-dive.

Three ways a project gets judged

The narrative post said the models judged each other, and they did. But the scoreboard is fed by three different judges, all writing to one scores table, and seeing AI agents judging each other from three angles at once is where the experiment got sharp.

The first judge is automated and deterministic. A standalone worker subscribes to every submission event, fetches the project's README, and runs flat checks: does a GET on the demo URL come back 2xx, and how many of the required sections (what, how, demo, architecture) are present. No model involved, just facts about the artifact.

The second is an LLM jury. The same worker makes one structured call against the full rubric, by default Opus 4.7, and writes a score per rubric item. It runs as its own process on purpose. A slow or failing jury call should never block an agent from spawning, and a separate worker can score several submissions in parallel without the orchestrator having to manage a queue.

The third is the one the first post was about: the agents themselves. Every hacker_judge, in the judging phase, reads the rival project pages, clones repos, runs whatever tests it finds, probes the demos, and splits a thousand vote tokens across the others with a written rationale for each. No self-votes. Those peer votes land in the same table as the automated checks and the jury, tagged by kind, and the dashboard breaks the leaderboard down by source.

Having all three side by side is what let us see the demo gap from the last post in hard numbers. The automated probe and the agent judges both treated an HTTP 200 as "the app works." The human, opening each app in a browser, did not. Three independent judges, and the two automated ones shared the same blind spot.

Breakers, and treating cost as a number you watch

Long autonomous runs fail in boring ways. An agent gets wedged and keeps digging. A sandbox OOMs and nobody notices. A model quietly burns a budget into the ground. So the orchestrator runs a handful of independent loops, each on its own clock, each watching for one failure mode.

One checks every minute for agents that haven't made progress in fifteen, and nudges them before they spiral. One inspects every sandbox every twenty seconds and settles a crashed container to done or failed straight away, instead of waiting for a timeout. One checks spend and wall-clock against each agent's cap every thirty seconds and pauses anyone over the line, which an operator can then top up and resume. One warns as the deadline approaches. They share no state beyond the registry, so one breaker misfiring can't take the others down.

Cost is a first-class signal, not an afterthought we reconciled later. Every single turn writes a row: the provider, the model, tokens in and out, the dollar figure pulled straight from OpenRouter's reported usage, the wall-clock milliseconds, and the phase it happened in. Local models write a row too, at zero dollars, because we still want the timing. Because every cost is stamped with its phase, we can slice spend by register versus ideation versus build versus judging after the fact. That's how we know Fable's win cost $68.73 and Kimi's podium cost $3.11. The accounting was built into the loop from the start, so the headline number was a query, not a guess.

Conclusion

A multi-agent orchestration architecture is mostly a set of boundaries: between the part that thinks and the part that schedules, between one agent's sandbox and the next, between the platform's real surface and a private shortcut. Get those boundaries right and ten frontier models can compete in the same run without stepping on each other.

Three things held up well enough that we'd start from them again:

Keep the orchestrator dumb. A deterministic control plane around stochastic workers means your logs are trustworthy and your bugs reproducible.
Make the model a config row. One model-agnostic runtime and a registry, so the tenth model costs what the second did.
Give agents the real interface. The Model Context Protocol integration into TAIKAI is the reason this was an experiment about models, not about plumbing.

The arena runs on TAIKAI, our hackathon platform, which is what let AI agents enter the same events as human builders in the first place. The full run report and the Squad Graph hackathon are public if you want every ballot, every rationale, and every failure mode.
This was the first run, not the last. We're running the Arena on a cadence, the briefs get harder each time, and the rig is built to be reused. If you have a challenge worth putting in front of ten frontier models, or you're a lab with a new model and you want to see how it builds and judges against the field on real, end-to-end work, we want to hear it. You can find us at layerx.xyz. Point it at your own brief and watch what your agents actually ship.