Most teams build their first agent the way they wrote their first prompt: they pick the best model available and route everything through it. It works in the demo. Then the agent goes multi-step, the loop runs forty, eighty, two hundred times per task, and the bill arrives. Tokens are the cost of goods sold for any agentic product, and a single frontier model handling every trivial decision is the fastest way to make the unit economics impossible. The fix is not a cheaper model. It is a heterogeneous architecture — expensive reasoning where it pays, cheap execution everywhere else — and the design pattern that makes it concrete is Plan-and-Execute.
The Cost Trap of Frontier-for-Everything
A single-turn chatbot has a forgiving cost profile. The user sends one message, the model sends one back, and the token count is bounded by the conversation. An agent is a different animal. An agent reasons, calls a tool, reads the result, reasons again, calls another tool, and repeats until the task is done. Each iteration re-sends the accumulated context. A task that takes sixty steps does not cost sixty times a single turn — it costs far more, because the context window grows with every step and the whole transcript is re-billed on each pass.
This is why token pricing quietly breaks budgets that looked comfortable on paper. We have written before about how token pricing is breaking enterprise AI coding budgets — Uber burning its 2026 AI budget in four months, Microsoft pulling internal Claude Code access because engineers used it too well. Those stories are about seat-based licensing colliding with consumption-based reality. The agent version of the same problem is worse, because the consumption is not driven by a human typing — it is driven by a loop that never gets tired.
The structural error in frontier-for-everything is treating every step as equally hard. It is not. In a real agent run, the genuine reasoning — deciding the approach, recovering from an unexpected tool result, weighing two ambiguous paths — is a small fraction of the steps. The rest is mechanical: format this output, extract that field, decide whether a string matches a pattern, call the next tool in an obvious sequence. Paying frontier prices for the mechanical majority is the trap. You are buying a surgeon to apply bandages.
"The economics of running agents at scale demand heterogeneous architectures: expensive frontier models for complex reasoning and orchestration, mid-tier models for standard tasks, and small language models for high-frequency execution."
Three Tiers, Not One Model
The mental shift is from "which model do I use?" to "which model does this step deserve?" A mature agent stack is heterogeneous by design, with at least three tiers. Frontier models — the most capable, most expensive — handle orchestration, planning, and the hard judgment calls. Mid-tier models handle standard work that needs competence but not brilliance: summarizing a document, drafting a function from a clear spec, classifying with nuance. Small language models (SLMs) handle the high-frequency, low-ambiguity execution that dominates the step count: field extraction, format conversion, routing decisions, boolean judgments.
The pricing gap between tiers is the entire reason this works. To make the spread concrete, look at the published numbers for one vendor's lineup. Claude Fable 5 — the long-horizon agentic model we covered in our Fable 5 adoption guide — lists at roughly $10 per million input tokens and $50 per million output, while Opus 4.8 sits near $5 input and $25 output. A capable mid-tier model is a fraction of that, and an SLM is a fraction again. When a step that an SLM could handle for cents instead runs through Fable 5, you are paying a 20x to 50x premium for an answer that did not need it.
Routing by complexity is not a vague aspiration; it is a concrete decision table. For each kind of step your agent takes, you can name the tier it deserves and the failure mode you accept at that tier. The discipline is to write that table down before you build, then enforce it in the orchestration layer rather than hoping the model "just figures it out."
Reasoning & orchestration
- • Decomposing a goal into an ordered plan
- • Recovering when a step fails unexpectedly
- • Weighing two ambiguous, high-stakes paths
- • Final review of a large, risky change
Standard, bounded tasks
- • Drafting code from an explicit spec
- • Summarizing a document with nuance
- • Multi-class classification with context
- • Generating tests for scoped functions
High-frequency execution
- • Extracting a field from structured text
- • Format and schema conversion
- • Boolean / yes-no routing decisions
- • Templated transforms on known shapes
The reason this table has to live in the orchestration layer rather than inside a single model's reasoning is that models are bad at pricing their own effort. Ask a frontier model to "use the cheapest approach" and it will still spend its full reasoning budget deciding what cheap means, because that is the only mode it has. The economy comes from the system around the model deciding, deterministically, which model even gets invoked for a given step. That decision is code — a router — and the router is where your cost discipline either lives or leaks away.
A practical router keys off the shape of the step, not a vague difficulty score. Steps that produce a structured output against a known schema, transform data between formats, or answer a closed-set question are routed to the SLM tier by default. Steps that require synthesizing across multiple sources, drafting something open-ended, or making a classification with real nuance go mid-tier. Only steps that decide strategy, recover from an unexpected state, or carry irreversible consequences reach the frontier tier. The router does not need to be clever; it needs to be explicit, logged, and easy to audit when a step lands on the wrong tier.
The Plan-and-Execute Pattern in Detail
Plan-and-Execute is the architecture that turns the tier table into a running system. It splits the agent into two roles. The planner — a frontier model — looks at the goal once, with full context, and produces an explicit, ordered plan: a list of steps, each with a clear input, a clear expected output, and the tool or model that should carry it out. The executor — a cheaper model, or a fleet of them — then runs each step in turn, never re-deliberating about strategy, only doing the bounded work the plan assigned.
The savings come from where the expensive thinking happens. In a naive ReAct-style loop, the frontier model is invoked on every single iteration, re-reading the entire growing transcript to decide the next micro-action. In Plan-and-Execute, the frontier model is invoked once to plan, and then — ideally — only again when something goes off-script. The executor steps run on models that cost a fraction as much, and because each executor step is scoped to one bounded task, its context window stays small. You save on both the per-token rate and the token count.
The way you handle the off-script case determines whether the architecture holds. The robust version uses a re-planning threshold and escalation. An executor step that fails validation does not silently retry on the same cheap model forever. It escalates: first to a mid-tier model, then — if it still cannot produce a valid result — back to the planner, which can revise the remaining plan with full context. Most steps never escalate. The few that do are exactly the hard cases that justify frontier pricing, which is the whole point.
Quality Guardrails That Make the Savings Real
A cheaper model is also a more error-prone model, and the entire cost argument falls apart if those errors leak into the output. The discipline that protects you is verification at the tier boundary. Every executor step should emit a structured, validatable result — a JSON object against a schema, a value in a known enum, a diff that applies cleanly. The orchestration layer checks that result deterministically before accepting it. If the check fails, the step escalates. This is the same maker-checker shape that disciplined agent systems use everywhere: the cheap maker proposes, a stricter checker verifies, and only validated work flows downstream.
Crucially, the checker does not have to be another expensive model. Most executor outputs can be validated with code — schema validation, type checks, "does this code compile," "is this value in range." Deterministic verification is free and faster than any model call. Reserve model-based checking for the cases where correctness is genuinely a judgment call, and even then prefer a mid-tier checker over a frontier one. The cost of getting an executor step wrong is not the token price of that step; it is the downstream blast radius, which is exactly what guardrails contain.
The Quality Question, Answered Honestly
The objection every engineer raises first is fair: won't cheaper models produce worse work? On the steps you route to them, yes, they are individually less capable — and that is exactly why routing matters. The bet is not that an SLM is as smart as a frontier model; it is that the steps you send to the SLM are ones where smartness is not the binding constraint. Extracting a date from a structured record does not get better with more reasoning. Converting JSON to a known schema has a correct answer that a small model reliably hits. For these steps, the frontier model's extra capability is spent on nothing, and the cheaper model's output is indistinguishable.
The quality risk lives entirely at the routing boundary: a step that genuinely needed judgment getting sent to a model that lacks it. That is why the validation and escalation machinery is not optional polish — it is the mechanism that catches a misroute before it becomes a defect. A cheap step that fails its schema check escalates rather than ships. A cheap step that produces a valid-but-subtly-wrong result is the dangerous case, and it is the same "almost right" failure mode that plagues all AI output — which is why the steps you route down should be ones where "valid" and "correct" coincide, verifiable in code. Reserve the genuinely ambiguous work, where almost-right hides, for the tier that can actually tell the difference.
A Worked Cost Example
Numbers make the abstraction land. Take an agent that processes a batch of 1,000 documents. For each document it does one genuine reasoning step — deciding how to handle this particular case — and roughly twenty mechanical steps: extracting fields, normalizing formats, validating values, routing to the right downstream handler. That is 1,000 reasoning steps and 20,000 execution steps across the batch.
21,000 frontier-tier calls
- • 1,000 reasoning steps on the frontier model
- • 20,000 execution steps on the frontier model
- • Every mechanical step billed at premium input + output rates
- • Context re-billed as each per-document loop grows
1,000 frontier + 20,000 cheap calls
- • 1,000 planning steps on the frontier model
- • 20,000 execution steps on an SLM tier
- • Mechanical steps billed at a fraction of frontier rates
- • Small, scoped contexts per executor step keep token counts low
In the first design, the frontier model absorbs all 21,000 calls. In the second, the frontier model handles only the 1,000 that require judgment, and the 20,000 mechanical steps — the bulk of the spend in the first design — move to a tier that costs an order of magnitude or more less per token, with smaller contexts on top. When the dominant volume of calls drops from frontier pricing to SLM pricing, total cost falls by the kind of margin O'Reilly describes: up to 90%. The reasoning quality on the steps that matter is unchanged, because those steps still run on the frontier model.
"You do not save money by using a worse model. You save money by refusing to spend frontier dollars on steps that were never hard in the first place — and spending them, fully, on the steps that are."
When Heterogeneous Routing Is Not Worth It
The pattern earns its keep at scale and frequency. If your agent runs a handful of times a day, the engineering cost of building a router, a tier table, escalation paths, and per-tier validation will dwarf the token savings. Premature tier-splitting is its own trap — you can spend a week saving pennies. Start with a single capable model, measure where the tokens actually go, and introduce tiers only where the volume justifies the complexity. The decision table is most valuable when it is informed by real traffic, not guessed at on day one.
Heterogeneous architecture also adds operational surface. You now depend on more models, more failure modes, and more places where a version bump on a mid-tier model silently shifts your quality. The escalation logic must be tested as carefully as the happy path, because a misrouted hard step on a cheap model is precisely the kind of "almost right" output that slips past review. Where this discipline pays off — high-volume, repetitive, long-running agents — it pays off enormously. Where it does not, it is premature optimization wearing an architecture diagram.
Conclusion: Spend Like an Engineer, Not a Tourist
The teams shipping agents profitably in 2026 are not the ones with access to the best model. They are the ones who treat token cost as a first-class engineering constraint and route work to the cheapest model that can do it correctly. The Plan-and-Execute pattern is the concrete expression of that discipline: think hard, once, at the top; execute cheaply, many times, below; and verify at every boundary so the savings never come at the cost of correctness.
Frontier models are not the enemy. Spending frontier dollars on mechanical work is. Used where it counts — planning, recovery, the genuinely hard judgment calls — the best model is worth every cent. Used for everything, it is the line item that kills the product before it ships. The architecture decides which one you get.
Tags
Share
Building something like this? See how we ship it or start a project.