Developer Tools·14 min read·June 28, 2026

Eval-Driven Development: Why Evals Are the New Unit Tests for AI

XYZBytes Team

XYZBytes

Ten years ago, test-driven development separated the engineers who shipped reliable software from the ones who shipped working demos. The discipline — write the test first, make it pass, refactor — was uncomfortable until it wasn't, and then it became table stakes. In 2026, the same inflection is happening for AI. The teams building agents that stay in production are not the ones with the best models or the most clever prompts. They are the ones who wrote their evals before they wrote their agents.

TDD Professionalized Software. EDD Is Doing the Same for AI.

The analogy between test-driven development and eval-driven development is not flattering metaphor — it is an accurate structural map. In both disciplines, the key move is identical: you specify the desired behavior before you implement it. In TDD, that specification is a unit test that starts failing. In EDD, that specification is an eval the agent cannot yet pass. In both cases, the specification forces you to make your requirements concrete enough to measure. Vague requirements cannot be tested. They also cannot be built reliably, but the test reveals the vagueness first — before the build cost compounds.

The parallel continues at the process level. TDD did not succeed because testing was new — teams tested before TDD. It succeeded because it changed when and how testing happened. Writing the test first, rather than last, changed the feedback loop from "write code, ship, discover bugs" to "write spec, implement, verify." EDD changes the AI feedback loop the same way. The teams building LLM applications without evals are running the old loop: build a prompt, call the model, read the output, form a subjective impression. The teams running EDD have a number. That number is either moving in the right direction or it is not, and the difference is visible before the code ships.

The industry is converging on this quickly. What was an advanced practice among safety-conscious labs in 2024 is now the differentiating discipline between AI teams whose systems hold up in production and those whose systems degrade. The build cost of an LLM application has collapsed; the reliability cost has not moved. Evals are how you pay that cost deliberately rather than letting your users pay it for you.

What Eval-Driven Development Actually Means

Anthropic's engineering guidance on building with Claude is explicit on this point. The recommended practice is to build evals that define planned capabilities before an agent can actually fulfill them, and then iterate until the agent passes. Owning and iterating on evals should be as routine and as non-optional as maintaining unit tests. If a capability exists in the product, there should be an eval for it. If there is no eval, the capability is not tested — which, for an AI system that behaves non-deterministically, is equivalent to saying it is not understood.

"Build evals to define planned capabilities before agents can fulfill them, then iterate. Owning and iterating on evals should be as routine as maintaining unit tests."

Anthropic engineering guidance, 2026

A secondary benefit of writing evals first is that it stress-tests whether your requirements are concrete enough to build. This sounds obvious but it rarely is. Teams regularly begin building agents with requirements like "it should summarize the customer conversation accurately" or "it should handle edge cases gracefully." Try writing an eval for either of those and you will immediately confront the ambiguity: accurately by whose standard? What counts as graceful? The eval cannot be written until the requirement is specified precisely enough to measure — and that precision is exactly what you need before you start building. EDD turns fuzzy product requirements into a forcing function for precision.

EDD is also the right frame for ongoing iteration, not just initial development. Every time the model changes, the prompt changes, or the context window policy changes, the evals catch regressions before users do. This is where the unit-test analogy holds most tightly: a test suite that runs on every commit, blocks on failure, and reports a specific failing case is infinitely more useful than a general impression that "the model seems worse this week." That vagueness is exactly what the trust crisis in AI development is made of — Stack Overflow's survey of 49,000+ developers found that 84% use AI tools but only 3% highly trust them, and the top frustration is output that is almost right but not quite. Evals are the mechanism that makes "almost right" into a measurable, improvable number instead of a recurring source of erosion.

Why Output-Only Evals Lie

The most common mistake in LLM evaluation is grading only the final output. An agent produces an answer; the eval checks whether the answer is correct. This approach misses everything that happened in between — every tool call, every retrieval step, every intermediate reasoning move that led to the conclusion. For simple single-turn tasks, output grading is sufficient. For the multi-step agent workflows that actually run in production, it is systematically misleading.

FIG. 02 — PASS-RATE INFLATION FROM OUTPUT-ONLY GRADING

20–40%

Industry eval benchmarking, 2026 — agents scored only on final outputs pass 20–40% more test cases than full trajectory evaluation reveals

The mechanism is straightforward once you see it. An agent can reach a correct final answer via an incorrect trajectory — it can stumble into the right answer after making mistakes that would fail in a slightly different context. When you only grade the output, those bad trajectories are scored as passing. When you grade the full trajectory — every step, every tool call, every branch decision — those cases are correctly identified as failures. The delta between output-only and trajectory grading is typically 20–40% more test cases passing under the lenient approach. That 20–40% is not a win you earned; it is a false signal hiding the production failures your users will eventually discover for you.

Trajectory evaluation catches three categories of failure that output evaluation misses entirely. First, lucky outputs — correct answers reached by incorrect means, which will fail when the lucky shortcut is not available. Second, efficiency failures — agents that reach the right answer after an excessive number of tool calls or retrieval steps, burning latency and tokens in ways that do not show up in a pass/fail grade on the output alone. Third, brittleness — agents that follow a fragile path which breaks on minor input variation. A trajectory eval catches all three; an output eval misses all three. The difference is the gap between a demo that impresses and a system that ships.

The Four-Part Measurement Loop That Actually Ships

Knowing what to measure is the starting point. The harder question is building a measurement loop trustworthy enough to gate real decisions — to block a deploy, to approve a model upgrade, to sign off on a capability for production. That loop has four components, and all four need to be present for the measurement to mean anything.

FIG. 03 — THE MEASUREMENT LOOP THAT ACTUALLY SHIPS

Four components. All four required.

1. A golden dataset drawn from real failures. Synthetic test cases miss the distribution of actual production inputs. The most valuable evals come from real requests that failed — edge cases your agents surfaced in production, adversarial inputs your users tried, the exact queries that broke your system last month. Harvest these continuously and build them into the eval set as they accumulate. A golden dataset that never grows is a golden dataset that is slowly becoming irrelevant.

2. Graders you trust. Code-based graders — exact match, regex, schema validation, assertion on tool calls — are cheap, reliable, and should be used wherever deterministic criteria exist. Human graders are expensive but essential for calibrating everything else. Do not skip human grading and go straight to LLM judging. You will be calibrating your automated judge against an unknown standard, which means you have no standard.

3. An LLM judge calibrated to a human gold set. LLM judges are necessary for evaluating open-ended outputs that resist code-based grading. But an uncalibrated LLM judge is not a grader — it is a strong opinion. Before you trust the judge to gate CI, verify that its scores correlate with human labels on a representative sample. A judge that agrees with humans 90%+ of the time on your gold set is worth gating on. One that agrees 70% of the time is not.

4. A CI gate that blocks regressions. The eval suite runs on every meaningful change — model version, prompt version, retrieval policy, tool list. A regression in a tracked metric blocks the change from shipping. Without this gate, the eval suite is informational, not protective. It tells you things got worse after the fact rather than preventing the worse version from reaching users.

The four components reinforce each other. The golden dataset gives the graders real signal to measure. The calibrated judge makes it possible to grade what deterministic graders cannot reach. The CI gate turns the measurement into a constraint on the engineering process, not just a dashboard number. Remove any one of them and the system has a gap: a golden dataset without a CI gate is monitoring after the fact. A CI gate grading only outputs is blocking on a partial signal. The loop has to be complete to be useful.

The Consistency Math No One Talks About

LLM applications are non-deterministic. The same input produces different outputs across runs, and that variability creates a specific measurement trap that traditional software testing does not prepare engineers for. In unit testing, a test either passes or it does not — there is no concept of "this test passes 70% of the time." In LLM eval, that is exactly the reality, and how you handle it determines whether your evals give you accurate signals or optimistic ones.

70%

per-trial pass rate (looks production-ready)

pass^k

all-runs consistency (the number that governs production)

developers who fully trust AI-generated code

FIG. 04 — The consistency gap and the trust gap. Source: industry eval benchmarking and Stack Overflow Developer Survey 2026

Consider a 70% per-trial pass rate. Run the agent once — it passes. Run it again — it passes. The eval dashboard looks healthy. But pass^k, the probability that every run of a k-trial sequence succeeds, tells a different story. For a 70%-per-trial agent over four trials, pass^4 is 0.7^4 ≈ 24%. For a production workflow that executes dozens of times a day, an agent at 70% per trial is failing continuously. The best-case pass rate that makes a demo look good is not the number that governs production behavior.

The correct practice is to report both metrics — per-trial pass rate and all-runs consistency across a fixed k appropriate to your deployment frequency — and to gate CI on the all-runs metric, not just the per-trial one. This forces the team to build agents that are actually reliable at the k that matches how often they run in production, not agents that pass the eval on its best day. It is also why only 4% of developers fully trust AI-generated code: they have seen the 70% become 24% in practice and internalized that best-case performance is not production performance. Evals that report pass^k honestly are the mechanism that closes that gap — not by eliminating non-determinism, but by measuring it accurately enough to manage it.

Embracing Non-Determinism Without Losing Rigor

The measurement rigor demanded by pass^k should not be confused with demanding that LLM outputs be deterministic. They are not, and trying to force determinism through aggressive temperature suppression or excessive output constraint typically degrades quality without actually solving the underlying variability. The right framing — which Red Hat's engineering team has articulated for production agentic systems — is to agree on what "good enough" means early, accept that there are many correct answers per turn, and build the eval suite to accommodate that range rather than penalize it.

"Agree on 'good enough' early. Accept many correct answers per turn. Don't expect perfection — that variability makes a comprehensive eval suite more important, not less."

Red Hat agentic AI engineering practice, 2026

In practice, this means writing evals that grade on the properties that matter — accuracy, completeness, format compliance, safety — rather than on exact string match. It means building a range of acceptable outputs into the golden dataset, so the judge is not penalizing legitimate variation in wording or structure. And it means being deliberate about which metrics need to be maximized versus which ones need floors. An agent used for code generation needs to be functionally correct; it does not need to produce identical code every run. An agent handling customer-facing communication needs to hit a tone floor; it does not need to reproduce the same sentence twice.

This is also where context engineering and eval-driven development intersect directly. Evals end up testing precisely what context engineering produces — the exact slice of information the agent sees at inference time, assembled through selection, retrieval, compaction, and memory. A well-structured eval reveals whether your context assembly is giving the agent what it needs, in a format it can use, without the noise that drives non-determinism upward. Poorly assembled context appears in the eval results as higher variance and lower consistency. The eval is, among other things, a diagnostic on your context pipeline — a signal that no amount of model tuning can replace.

Post-Launch Monitoring: The Test You Forgot to Write

A complete eval suite at launch is necessary but not sufficient. The distribution of real production inputs shifts over time in ways that are difficult to anticipate. Users discover edge cases that never appeared in your golden dataset. External data sources change in ways that alter what your retrieval returns. Model providers update weights without changing the version string. Any of these events can degrade an agent's behavior without triggering any of your pre-launch evals, because pre-launch evals measure the distribution you sampled when you built the test suite, not the distribution that emerges from real usage over months.

FIG. 05 — VIBES-BASED SHIPPING

Deploy and hope the distribution holds

• Evals run only before launch, then sit idle
• Failures discovered by users in production
• No systematic capture of production traces
• Judge calibration done once, never revisited
• Regressions noticed weeks after they appear
• Trust erodes with no measurable path to rebuild it

FIG. 05 — EVAL-DRIVEN SHIPPING

Deploy with a living measurement loop

• Evals run pre-launch and continuously in CI
• Production traces harvested into the golden dataset
• Systematic human review on a regular cadence
• Judge re-calibrated as production inputs evolve
• Regressions caught before users notice them
• Trust accumulates on a documented, improving metric

Post-launch monitoring has two components that must both be in place. The first is production-trace harvesting: a systematic sample of real requests, tool calls, and outputs captured from the live system and routed back into the eval pipeline. These traces populate the golden dataset with production-distribution inputs that no synthetic test suite can replicate, and they do it continuously rather than as a one-time seeding exercise. The second is systematic human review: a regular cadence of humans checking the calibration of the LLM judge against real production outputs, not just the original gold set. As production inputs drift — and they always drift — the judge's calibration drifts with them unless you actively maintain it.

Neither of these practices requires significant tooling. What they require is deliberateness — a named owner for the eval suite, a scheduled review cycle, and a pipeline that routes production failures back into the dataset rather than treating them as one-off incidents to be fixed and forgotten. This is the same discipline that keeps a mature test suite useful over time: treating the suite as a living artifact that reflects the current state of the system and its actual users, not a snapshot taken at launch. At runtime, the complement to this is the maker-checker pattern — an independent, adversarially-prompted verification agent that catches failures at inference time, complementing the eval harness that catches them during development. Together they form a full-stack reliability discipline: evals gate what ships, maker-checker catches what slips through at runtime.

The Moat Is the Discipline, Not the Dashboard

The eval tooling ecosystem in 2026 is mature. There are capable open-source frameworks, commercial platforms, and model-provider-native eval suites available at every budget tier. Any team can stand up a dashboard displaying eval metrics within a day. That is not eval-driven development. Purchasing evaluation infrastructure without the practices that make it useful is equivalent to buying a CI server and never writing a test. The tool does not create the discipline; the discipline creates the value — and the two are easy to confuse when the tooling is shiny and the discipline is unglamorous.

The teams building durable competitive advantage with AI are not the ones with the best eval dashboard. They are the ones whose engineering culture treats evals the way mature engineering cultures treat unit tests — as a non-negotiable part of the build process, owned by the team, maintained with the same care as the product code, and extended with every new capability. When a new feature lands, an eval lands with it. When a production failure surfaces, the golden dataset grows. When a model version changes, the CI gate either clears or it blocks. This loop, run consistently over months, produces a system whose reliability is documented, measurable, and improvable — the only kind of reliability that survives contact with real users.

The uncomfortable truth is that evals are not exciting. They do not make a better demo. They do not change what the model can do. They do not show up in a product screenshot or a launch post. What they do is ensure that the capability you shipped today still works next month, that the behavior you promised is the behavior that gets delivered, and that when something breaks you know it in CI rather than in a customer complaint. That is the same value proposition unit tests had in 2014. It is why TDD became table stakes, and it is exactly why EDD is becoming table stakes in 2026. The teams that internalize the discipline early build a measurement culture that is genuinely hard to replicate. The teams that skip it are running on vibes — and vibes do not scale.

Keep reading

Developer Productivity

11 min read·Jun 2026

The 'Almost Right' Problem: 84% of Developers Use AI, Only 3% Highly Trust It

Stack Overflow's survey of 49,000+ developers found 84% use AI coding tools while only 3% highly trust them — and 66% name 'almost right' output as their top frustration. Why the trust gap is rational, and the verification workflows that close it.

XYZBytes

AI & Automation

14 min read·Jun 2026

One Agent Writes, Another Agent Checks: The Maker-Checker Pattern Keeping AI in Production

Self-review fails because an agent grading its own work is biased toward approving it. The maker-checker pattern — an independent, adversarially-prompted checker plus reversibility-sized human gates — is the antidote keeping AI safely in production.

XYZBytes

AI & Automation

14 min read·Jun 2026

Context Engineering Is the New Prompt Engineering — and It's a Real Job Now

Crafting the perfect prompt is now a baseline skill. Context engineering — curating exactly what an agent sees through selection, retrieval, compaction, and memory — is the discipline that replaced it, and the new job titles are real.

XYZBytes

Developer Tools·14 min read·June 28, 2026

Eval-Driven Development: Why Evals Are the New Unit Tests for AI

XYZBytes Team

XYZBytes

TDD Professionalized Software. EDD Is Doing the Same for AI.

What Eval-Driven Development Actually Means

"Build evals to define planned capabilities before agents can fulfill them, then iterate. Owning and iterating on evals should be as routine as maintaining unit tests."

Anthropic engineering guidance, 2026

Why Output-Only Evals Lie

FIG. 02 — PASS-RATE INFLATION FROM OUTPUT-ONLY GRADING

20–40%

Industry eval benchmarking, 2026 — agents scored only on final outputs pass 20–40% more test cases than full trajectory evaluation reveals

The Four-Part Measurement Loop That Actually Ships

FIG. 03 — THE MEASUREMENT LOOP THAT ACTUALLY SHIPS

Four components. All four required.

The Consistency Math No One Talks About

70%

per-trial pass rate (looks production-ready)

pass^k

all-runs consistency (the number that governs production)

developers who fully trust AI-generated code

FIG. 04 — The consistency gap and the trust gap. Source: industry eval benchmarking and Stack Overflow Developer Survey 2026

Embracing Non-Determinism Without Losing Rigor

"Agree on 'good enough' early. Accept many correct answers per turn. Don't expect perfection — that variability makes a comprehensive eval suite more important, not less."

Red Hat agentic AI engineering practice, 2026