AI Economics·14 min read·June 10, 2026

Stop Paying Frontier Prices for Every Token: The Plan-and-Execute Pattern That Cuts Agent Costs 90%

XYZBytes Team

XYZBytes

Most teams build their first agent the way they wrote their first prompt: they pick the best model available and route everything through it. It works in the demo. Then the agent goes multi-step, the loop runs forty, eighty, two hundred times per task, and the bill arrives. Tokens are the cost of goods sold for any agentic product, and a single frontier model handling every trivial decision is the fastest way to make the unit economics impossible. The fix is not a cheaper model. It is a heterogeneous architecture — expensive reasoning where it pays, cheap execution everywhere else — and the design pattern that makes it concrete is Plan-and-Execute.

The Cost Trap of Frontier-for-Everything

A single-turn chatbot has a forgiving cost profile. The user sends one message, the model sends one back, and the token count is bounded by the conversation. An agent is a different animal. An agent reasons, calls a tool, reads the result, reasons again, calls another tool, and repeats until the task is done. Each iteration re-sends the accumulated context. A task that takes sixty steps does not cost sixty times a single turn — it costs far more, because the context window grows with every step and the whole transcript is re-billed on each pass.

This is why token pricing quietly breaks budgets that looked comfortable on paper. We have written before about how token pricing is breaking enterprise AI coding budgets — Uber burning its 2026 AI budget in four months, Microsoft pulling internal Claude Code access because engineers used it too well. Those stories are about seat-based licensing colliding with consumption-based reality. The agent version of the same problem is worse, because the consumption is not driven by a human typing — it is driven by a loop that never gets tired.

FIG. 02 — COST REDUCTION, PLAN-AND-EXECUTE

Up to 90%

O'Reilly, 'The AI Agents Stack 2026' — heterogeneous routing where a capable planner directs cheaper executors, versus running a frontier model for every step

The structural error in frontier-for-everything is treating every step as equally hard. It is not. In a real agent run, the genuine reasoning — deciding the approach, recovering from an unexpected tool result, weighing two ambiguous paths — is a small fraction of the steps. The rest is mechanical: format this output, extract that field, decide whether a string matches a pattern, call the next tool in an obvious sequence. Paying frontier prices for the mechanical majority is the trap. You are buying a surgeon to apply bandages.

"The economics of running agents at scale demand heterogeneous architectures: expensive frontier models for complex reasoning and orchestration, mid-tier models for standard tasks, and small language models for high-frequency execution."

O'Reilly, 'The AI Agents Stack 2026'

Three Tiers, Not One Model

The mental shift is from "which model do I use?" to "which model does this step deserve?" A mature agent stack is heterogeneous by design, with at least three tiers. Frontier models — the most capable, most expensive — handle orchestration, planning, and the hard judgment calls. Mid-tier models handle standard work that needs competence but not brilliance: summarizing a document, drafting a function from a clear spec, classifying with nuance. Small language models (SLMs) handle the high-frequency, low-ambiguity execution that dominates the step count: field extraction, format conversion, routing decisions, boolean judgments.

The pricing gap between tiers is the entire reason this works. To make the spread concrete, look at the published numbers for one vendor's lineup. Claude Fable 5 — the long-horizon agentic model we covered in our Fable 5 adoption guide — lists at roughly $10 per million input tokens and $50 per million output, while Opus 4.8 sits near $5 input and $25 output. A capable mid-tier model is a fraction of that, and an SLM is a fraction again. When a step that an SLM could handle for cents instead runs through Fable 5, you are paying a 20x to 50x premium for an answer that did not need it.

$10 / $50

Fable 5 — input / output per 1M tokens

$5 / $25

Opus 4.8 — input / output per 1M tokens

FIG. 03 — Published frontier-tier pricing. Mid-tier and SLM execution sits an order of magnitude or more below this. Source: vendor pricing, XYZBytes analysis

Routing by complexity is not a vague aspiration; it is a concrete decision table. For each kind of step your agent takes, you can name the tier it deserves and the failure mode you accept at that tier. The discipline is to write that table down before you build, then enforce it in the orchestration layer rather than hoping the model "just figures it out."

FIG. 04 — FRONTIER TIER

Reasoning & orchestration

• Decomposing a goal into an ordered plan
• Recovering when a step fails unexpectedly
• Weighing two ambiguous, high-stakes paths
• Final review of a large, risky change

FIG. 04 — MID-TIER

Standard, bounded tasks

• Drafting code from an explicit spec
• Summarizing a document with nuance
• Multi-class classification with context
• Generating tests for scoped functions

FIG. 04 — SLM TIER

High-frequency execution

• Extracting a field from structured text
• Format and schema conversion
• Boolean / yes-no routing decisions
• Templated transforms on known shapes

The reason this table has to live in the orchestration layer rather than inside a single model's reasoning is that models are bad at pricing their own effort. Ask a frontier model to "use the cheapest approach" and it will still spend its full reasoning budget deciding what cheap means, because that is the only mode it has. The economy comes from the system around the model deciding, deterministically, which model even gets invoked for a given step. That decision is code — a router — and the router is where your cost discipline either lives or leaks away.

A practical router keys off the shape of the step, not a vague difficulty score. Steps that produce a structured output against a known schema, transform data between formats, or answer a closed-set question are routed to the SLM tier by default. Steps that require synthesizing across multiple sources, drafting something open-ended, or making a classification with real nuance go mid-tier. Only steps that decide strategy, recover from an unexpected state, or carry irreversible consequences reach the frontier tier. The router does not need to be clever; it needs to be explicit, logged, and easy to audit when a step lands on the wrong tier.

The Plan-and-Execute Pattern in Detail

Plan-and-Execute is the architecture that turns the tier table into a running system. It splits the agent into two roles. The planner — a frontier model — looks at the goal once, with full context, and produces an explicit, ordered plan: a list of steps, each with a clear input, a clear expected output, and the tool or model that should carry it out. The executor — a cheaper model, or a fleet of them — then runs each step in turn, never re-deliberating about strategy, only doing the bounded work the plan assigned.

The savings come from where the expensive thinking happens. In a naive ReAct-style loop, the frontier model is invoked on every single iteration, re-reading the entire growing transcript to decide the next micro-action. In Plan-and-Execute, the frontier model is invoked once to plan, and then — ideally — only again when something goes off-script. The executor steps run on models that cost a fraction as much, and because each executor step is scoped to one bounded task, its context window stays small. You save on both the per-token rate and the token count.

The way you handle the off-script case determines whether the architecture holds. The robust version uses a re-planning threshold and escalation. An executor step that fails validation does not silently retry on the same cheap model forever. It escalates: first to a mid-tier model, then — if it still cannot produce a valid result — back to the planner, which can revise the remaining plan with full context. Most steps never escalate. The few that do are exactly the hard cases that justify frontier pricing, which is the whole point.

Quality Guardrails That Make the Savings Real

A cheaper model is also a more error-prone model, and the entire cost argument falls apart if those errors leak into the output. The discipline that protects you is verification at the tier boundary. Every executor step should emit a structured, validatable result — a JSON object against a schema, a value in a known enum, a diff that applies cleanly. The orchestration layer checks that result deterministically before accepting it. If the check fails, the step escalates. This is the same maker-checker shape that disciplined agent systems use everywhere: the cheap maker proposes, a stricter checker verifies, and only validated work flows downstream.

Crucially, the checker does not have to be another expensive model. Most executor outputs can be validated with code — schema validation, type checks, "does this code compile," "is this value in range." Deterministic verification is free and faster than any model call. Reserve model-based checking for the cases where correctness is genuinely a judgment call, and even then prefer a mid-tier checker over a frontier one. The cost of getting an executor step wrong is not the token price of that step; it is the downstream blast radius, which is exactly what guardrails contain.

The Quality Question, Answered Honestly

The objection every engineer raises first is fair: won't cheaper models produce worse work? On the steps you route to them, yes, they are individually less capable — and that is exactly why routing matters. The bet is not that an SLM is as smart as a frontier model; it is that the steps you send to the SLM are ones where smartness is not the binding constraint. Extracting a date from a structured record does not get better with more reasoning. Converting JSON to a known schema has a correct answer that a small model reliably hits. For these steps, the frontier model's extra capability is spent on nothing, and the cheaper model's output is indistinguishable.

The quality risk lives entirely at the routing boundary: a step that genuinely needed judgment getting sent to a model that lacks it. That is why the validation and escalation machinery is not optional polish — it is the mechanism that catches a misroute before it becomes a defect. A cheap step that fails its schema check escalates rather than ships. A cheap step that produces a valid-but-subtly-wrong result is the dangerous case, and it is the same "almost right" failure mode that plagues all AI output — which is why the steps you route down should be ones where "valid" and "correct" coincide, verifiable in code. Reserve the genuinely ambiguous work, where almost-right hides, for the tier that can actually tell the difference.

A Worked Cost Example

Numbers make the abstraction land. Take an agent that processes a batch of 1,000 documents. For each document it does one genuine reasoning step — deciding how to handle this particular case — and roughly twenty mechanical steps: extracting fields, normalizing formats, validating values, routing to the right downstream handler. That is 1,000 reasoning steps and 20,000 execution steps across the batch.

FIG. 06 — FRONTIER-FOR-EVERYTHING

21,000 frontier-tier calls

• 1,000 reasoning steps on the frontier model
• 20,000 execution steps on the frontier model
• Every mechanical step billed at premium input + output rates
• Context re-billed as each per-document loop grows

FIG. 06 — PLAN-AND-EXECUTE

1,000 frontier + 20,000 cheap calls

• 1,000 planning steps on the frontier model
• 20,000 execution steps on an SLM tier
• Mechanical steps billed at a fraction of frontier rates
• Small, scoped contexts per executor step keep token counts low

In the first design, the frontier model absorbs all 21,000 calls. In the second, the frontier model handles only the 1,000 that require judgment, and the 20,000 mechanical steps — the bulk of the spend in the first design — move to a tier that costs an order of magnitude or more less per token, with smaller contexts on top. When the dominant volume of calls drops from frontier pricing to SLM pricing, total cost falls by the kind of margin O'Reilly describes: up to 90%. The reasoning quality on the steps that matter is unchanged, because those steps still run on the frontier model.

"You do not save money by using a worse model. You save money by refusing to spend frontier dollars on steps that were never hard in the first place — and spending them, fully, on the steps that are."

XYZBytes analysis, June 2026

When Heterogeneous Routing Is Not Worth It

The pattern earns its keep at scale and frequency. If your agent runs a handful of times a day, the engineering cost of building a router, a tier table, escalation paths, and per-tier validation will dwarf the token savings. Premature tier-splitting is its own trap — you can spend a week saving pennies. Start with a single capable model, measure where the tokens actually go, and introduce tiers only where the volume justifies the complexity. The decision table is most valuable when it is informed by real traffic, not guessed at on day one.

Heterogeneous architecture also adds operational surface. You now depend on more models, more failure modes, and more places where a version bump on a mid-tier model silently shifts your quality. The escalation logic must be tested as carefully as the happy path, because a misrouted hard step on a cheap model is precisely the kind of "almost right" output that slips past review. Where this discipline pays off — high-volume, repetitive, long-running agents — it pays off enormously. Where it does not, it is premature optimization wearing an architecture diagram.

Conclusion: Spend Like an Engineer, Not a Tourist

The teams shipping agents profitably in 2026 are not the ones with access to the best model. They are the ones who treat token cost as a first-class engineering constraint and route work to the cheapest model that can do it correctly. The Plan-and-Execute pattern is the concrete expression of that discipline: think hard, once, at the top; execute cheaply, many times, below; and verify at every boundary so the savings never come at the cost of correctness.

Frontier models are not the enemy. Spending frontier dollars on mechanical work is. Used where it counts — planning, recovery, the genuinely hard judgment calls — the best model is worth every cent. Used for everything, it is the line item that kills the product before it ships. The architecture decides which one you get.

Keep reading

Developer Tools

17 min read·Jun 2026

Claude Fable 5 for Development Teams: The Pricing Math, the Safeguard Router, and the June 23 Cliff

Fable 5 costs 2x Opus 4.8 — and pays for itself on long-horizon work. A practical adoption guide: the safeguard router's product implications, the 30-day retention compliance trap, and a concrete two-week evaluation plan before the June 23 usage-credit cliff.

XYZBytes

AI & Automation

12 min read·Jun 2026

The 10x Engineer Is Dead. The 10-Agent Engineer Is Here

Gartner logged a 1,445% surge in multi-agent inquiries while pure implementation roles fell 17%. The engineers winning in 2026 orchestrate ten agents — decomposition, verification, and taste are the new leverage.

XYZBytes

Developer Productivity

19 min read·Jun 2026

Why Token Pricing Is Quietly Breaking Every Enterprise AI Coding Budget

Seat-based licensing is dead. Microsoft killed internal Claude Code because engineers used it too well, and Uber burned its 2026 AI budget in four months.

XYZBytes

AI Economics·14 min read·June 10, 2026

Stop Paying Frontier Prices for Every Token: The Plan-and-Execute Pattern That Cuts Agent Costs 90%

XYZBytes Team

XYZBytes

The Cost Trap of Frontier-for-Everything

FIG. 02 — COST REDUCTION, PLAN-AND-EXECUTE

Up to 90%

O'Reilly, 'The AI Agents Stack 2026' — heterogeneous routing where a capable planner directs cheaper executors, versus running a frontier model for every step

"The economics of running agents at scale demand heterogeneous architectures: expensive frontier models for complex reasoning and orchestration, mid-tier models for standard tasks, and small language models for high-frequency execution."

O'Reilly, 'The AI Agents Stack 2026'

Three Tiers, Not One Model

$10 / $50

Fable 5 — input / output per 1M tokens

$5 / $25

Opus 4.8 — input / output per 1M tokens

FIG. 03 — Published frontier-tier pricing. Mid-tier and SLM execution sits an order of magnitude or more below this. Source: vendor pricing, XYZBytes analysis

FIG. 04 — FRONTIER TIER

Reasoning & orchestration

• Decomposing a goal into an ordered plan
• Recovering when a step fails unexpectedly
• Weighing two ambiguous, high-stakes paths
• Final review of a large, risky change

FIG. 04 — MID-TIER

Standard, bounded tasks

• Drafting code from an explicit spec
• Summarizing a document with nuance
• Multi-class classification with context
• Generating tests for scoped functions

FIG. 04 — SLM TIER

High-frequency execution

• Extracting a field from structured text
• Format and schema conversion
• Boolean / yes-no routing decisions
• Templated transforms on known shapes

The Plan-and-Execute Pattern in Detail

Quality Guardrails That Make the Savings Real

The Quality Question, Answered Honestly

A Worked Cost Example

FIG. 06 — FRONTIER-FOR-EVERYTHING

21,000 frontier-tier calls

• 1,000 reasoning steps on the frontier model
• 20,000 execution steps on the frontier model
• Every mechanical step billed at premium input + output rates
• Context re-billed as each per-document loop grows

FIG. 06 — PLAN-AND-EXECUTE

1,000 frontier + 20,000 cheap calls

• 1,000 planning steps on the frontier model
• 20,000 execution steps on an SLM tier
• Mechanical steps billed at a fraction of frontier rates
• Small, scoped contexts per executor step keep token counts low

"You do not save money by using a worse model. You save money by refusing to spend frontier dollars on steps that were never hard in the first place — and spending them, fully, on the steps that are."

XYZBytes analysis, June 2026

When Heterogeneous Routing Is Not Worth It

Conclusion: Spend Like an Engineer, Not a Tourist

Keep reading

Developer Tools

17 min read·Jun 2026

Claude Fable 5 for Development Teams: The Pricing Math, the Safeguard Router, and the June 23 Cliff

XYZBytes

AI & Automation

12 min read·Jun 2026

The 10x Engineer Is Dead. The 10-Agent Engineer Is Here

XYZBytes

Developer Productivity

19 min read·Jun 2026

Why Token Pricing Is Quietly Breaking Every Enterprise AI Coding Budget

Seat-based licensing is dead. Microsoft killed internal Claude Code because engineers used it too well, and Uber burned its 2026 AI budget in four months.

XYZBytes