Ten years ago, test-driven development separated the engineers who shipped reliable software from the ones who shipped working demos. The discipline — write the test first, make it pass, refactor — was uncomfortable until it wasn't, and then it became table stakes. In 2026, the same inflection is happening for AI. The teams building agents that stay in production are not the ones with the best models or the most clever prompts. They are the ones who wrote their evals before they wrote their agents.
TDD Professionalized Software. EDD Is Doing the Same for AI.
The analogy between test-driven development and eval-driven development is not flattering metaphor — it is an accurate structural map. In both disciplines, the key move is identical: you specify the desired behavior before you implement it. In TDD, that specification is a unit test that starts failing. In EDD, that specification is an eval the agent cannot yet pass. In both cases, the specification forces you to make your requirements concrete enough to measure. Vague requirements cannot be tested. They also cannot be built reliably, but the test reveals the vagueness first — before the build cost compounds.
The parallel continues at the process level. TDD did not succeed because testing was new — teams tested before TDD. It succeeded because it changed when and how testing happened. Writing the test first, rather than last, changed the feedback loop from "write code, ship, discover bugs" to "write spec, implement, verify." EDD changes the AI feedback loop the same way. The teams building LLM applications without evals are running the old loop: build a prompt, call the model, read the output, form a subjective impression. The teams running EDD have a number. That number is either moving in the right direction or it is not, and the difference is visible before the code ships.
The industry is converging on this quickly. What was an advanced practice among safety-conscious labs in 2024 is now the differentiating discipline between AI teams whose systems hold up in production and those whose systems degrade. The build cost of an LLM application has collapsed; the reliability cost has not moved. Evals are how you pay that cost deliberately rather than letting your users pay it for you.
What Eval-Driven Development Actually Means
Anthropic's engineering guidance on building with Claude is explicit on this point. The recommended practice is to build evals that define planned capabilities before an agent can actually fulfill them, and then iterate until the agent passes. Owning and iterating on evals should be as routine and as non-optional as maintaining unit tests. If a capability exists in the product, there should be an eval for it. If there is no eval, the capability is not tested — which, for an AI system that behaves non-deterministically, is equivalent to saying it is not understood.
"Build evals to define planned capabilities before agents can fulfill them, then iterate. Owning and iterating on evals should be as routine as maintaining unit tests."
A secondary benefit of writing evals first is that it stress-tests whether your requirements are concrete enough to build. This sounds obvious but it rarely is. Teams regularly begin building agents with requirements like "it should summarize the customer conversation accurately" or "it should handle edge cases gracefully." Try writing an eval for either of those and you will immediately confront the ambiguity: accurately by whose standard? What counts as graceful? The eval cannot be written until the requirement is specified precisely enough to measure — and that precision is exactly what you need before you start building. EDD turns fuzzy product requirements into a forcing function for precision.
EDD is also the right frame for ongoing iteration, not just initial development. Every time the model changes, the prompt changes, or the context window policy changes, the evals catch regressions before users do. This is where the unit-test analogy holds most tightly: a test suite that runs on every commit, blocks on failure, and reports a specific failing case is infinitely more useful than a general impression that "the model seems worse this week." That vagueness is exactly what the trust crisis in AI development is made of — Stack Overflow's survey of 49,000+ developers found that 84% use AI tools but only 3% highly trust them, and the top frustration is output that is almost right but not quite. Evals are the mechanism that makes "almost right" into a measurable, improvable number instead of a recurring source of erosion.
Why Output-Only Evals Lie
The most common mistake in LLM evaluation is grading only the final output. An agent produces an answer; the eval checks whether the answer is correct. This approach misses everything that happened in between — every tool call, every retrieval step, every intermediate reasoning move that led to the conclusion. For simple single-turn tasks, output grading is sufficient. For the multi-step agent workflows that actually run in production, it is systematically misleading.
The mechanism is straightforward once you see it. An agent can reach a correct final answer via an incorrect trajectory — it can stumble into the right answer after making mistakes that would fail in a slightly different context. When you only grade the output, those bad trajectories are scored as passing. When you grade the full trajectory — every step, every tool call, every branch decision — those cases are correctly identified as failures. The delta between output-only and trajectory grading is typically 20–40% more test cases passing under the lenient approach. That 20–40% is not a win you earned; it is a false signal hiding the production failures your users will eventually discover for you.
Trajectory evaluation catches three categories of failure that output evaluation misses entirely. First, lucky outputs — correct answers reached by incorrect means, which will fail when the lucky shortcut is not available. Second, efficiency failures — agents that reach the right answer after an excessive number of tool calls or retrieval steps, burning latency and tokens in ways that do not show up in a pass/fail grade on the output alone. Third, brittleness — agents that follow a fragile path which breaks on minor input variation. A trajectory eval catches all three; an output eval misses all three. The difference is the gap between a demo that impresses and a system that ships.
The Four-Part Measurement Loop That Actually Ships
Knowing what to measure is the starting point. The harder question is building a measurement loop trustworthy enough to gate real decisions — to block a deploy, to approve a model upgrade, to sign off on a capability for production. That loop has four components, and all four need to be present for the measurement to mean anything.
The four components reinforce each other. The golden dataset gives the graders real signal to measure. The calibrated judge makes it possible to grade what deterministic graders cannot reach. The CI gate turns the measurement into a constraint on the engineering process, not just a dashboard number. Remove any one of them and the system has a gap: a golden dataset without a CI gate is monitoring after the fact. A CI gate grading only outputs is blocking on a partial signal. The loop has to be complete to be useful.
The Consistency Math No One Talks About
LLM applications are non-deterministic. The same input produces different outputs across runs, and that variability creates a specific measurement trap that traditional software testing does not prepare engineers for. In unit testing, a test either passes or it does not — there is no concept of "this test passes 70% of the time." In LLM eval, that is exactly the reality, and how you handle it determines whether your evals give you accurate signals or optimistic ones.
Consider a 70% per-trial pass rate. Run the agent once — it passes. Run it again — it passes. The eval dashboard looks healthy. But pass^k, the probability that every run of a k-trial sequence succeeds, tells a different story. For a 70%-per-trial agent over four trials, pass^4 is 0.7^4 ≈ 24%. For a production workflow that executes dozens of times a day, an agent at 70% per trial is failing continuously. The best-case pass rate that makes a demo look good is not the number that governs production behavior.
The correct practice is to report both metrics — per-trial pass rate and all-runs consistency across a fixed k appropriate to your deployment frequency — and to gate CI on the all-runs metric, not just the per-trial one. This forces the team to build agents that are actually reliable at the k that matches how often they run in production, not agents that pass the eval on its best day. It is also why only 4% of developers fully trust AI-generated code: they have seen the 70% become 24% in practice and internalized that best-case performance is not production performance. Evals that report pass^k honestly are the mechanism that closes that gap — not by eliminating non-determinism, but by measuring it accurately enough to manage it.
Embracing Non-Determinism Without Losing Rigor
The measurement rigor demanded by pass^k should not be confused with demanding that LLM outputs be deterministic. They are not, and trying to force determinism through aggressive temperature suppression or excessive output constraint typically degrades quality without actually solving the underlying variability. The right framing — which Red Hat's engineering team has articulated for production agentic systems — is to agree on what "good enough" means early, accept that there are many correct answers per turn, and build the eval suite to accommodate that range rather than penalize it.
"Agree on 'good enough' early. Accept many correct answers per turn. Don't expect perfection — that variability makes a comprehensive eval suite more important, not less."
In practice, this means writing evals that grade on the properties that matter — accuracy, completeness, format compliance, safety — rather than on exact string match. It means building a range of acceptable outputs into the golden dataset, so the judge is not penalizing legitimate variation in wording or structure. And it means being deliberate about which metrics need to be maximized versus which ones need floors. An agent used for code generation needs to be functionally correct; it does not need to produce identical code every run. An agent handling customer-facing communication needs to hit a tone floor; it does not need to reproduce the same sentence twice.
This is also where context engineering and eval-driven development intersect directly. Evals end up testing precisely what context engineering produces — the exact slice of information the agent sees at inference time, assembled through selection, retrieval, compaction, and memory. A well-structured eval reveals whether your context assembly is giving the agent what it needs, in a format it can use, without the noise that drives non-determinism upward. Poorly assembled context appears in the eval results as higher variance and lower consistency. The eval is, among other things, a diagnostic on your context pipeline — a signal that no amount of model tuning can replace.
Post-Launch Monitoring: The Test You Forgot to Write
A complete eval suite at launch is necessary but not sufficient. The distribution of real production inputs shifts over time in ways that are difficult to anticipate. Users discover edge cases that never appeared in your golden dataset. External data sources change in ways that alter what your retrieval returns. Model providers update weights without changing the version string. Any of these events can degrade an agent's behavior without triggering any of your pre-launch evals, because pre-launch evals measure the distribution you sampled when you built the test suite, not the distribution that emerges from real usage over months.
Deploy and hope the distribution holds
- • Evals run only before launch, then sit idle
- • Failures discovered by users in production
- • No systematic capture of production traces
- • Judge calibration done once, never revisited
- • Regressions noticed weeks after they appear
- • Trust erodes with no measurable path to rebuild it
Deploy with a living measurement loop
- • Evals run pre-launch and continuously in CI
- • Production traces harvested into the golden dataset
- • Systematic human review on a regular cadence
- • Judge re-calibrated as production inputs evolve
- • Regressions caught before users notice them
- • Trust accumulates on a documented, improving metric
Post-launch monitoring has two components that must both be in place. The first is production-trace harvesting: a systematic sample of real requests, tool calls, and outputs captured from the live system and routed back into the eval pipeline. These traces populate the golden dataset with production-distribution inputs that no synthetic test suite can replicate, and they do it continuously rather than as a one-time seeding exercise. The second is systematic human review: a regular cadence of humans checking the calibration of the LLM judge against real production outputs, not just the original gold set. As production inputs drift — and they always drift — the judge's calibration drifts with them unless you actively maintain it.
Neither of these practices requires significant tooling. What they require is deliberateness — a named owner for the eval suite, a scheduled review cycle, and a pipeline that routes production failures back into the dataset rather than treating them as one-off incidents to be fixed and forgotten. This is the same discipline that keeps a mature test suite useful over time: treating the suite as a living artifact that reflects the current state of the system and its actual users, not a snapshot taken at launch. At runtime, the complement to this is the maker-checker pattern — an independent, adversarially-prompted verification agent that catches failures at inference time, complementing the eval harness that catches them during development. Together they form a full-stack reliability discipline: evals gate what ships, maker-checker catches what slips through at runtime.
The Moat Is the Discipline, Not the Dashboard
The eval tooling ecosystem in 2026 is mature. There are capable open-source frameworks, commercial platforms, and model-provider-native eval suites available at every budget tier. Any team can stand up a dashboard displaying eval metrics within a day. That is not eval-driven development. Purchasing evaluation infrastructure without the practices that make it useful is equivalent to buying a CI server and never writing a test. The tool does not create the discipline; the discipline creates the value — and the two are easy to confuse when the tooling is shiny and the discipline is unglamorous.
The teams building durable competitive advantage with AI are not the ones with the best eval dashboard. They are the ones whose engineering culture treats evals the way mature engineering cultures treat unit tests — as a non-negotiable part of the build process, owned by the team, maintained with the same care as the product code, and extended with every new capability. When a new feature lands, an eval lands with it. When a production failure surfaces, the golden dataset grows. When a model version changes, the CI gate either clears or it blocks. This loop, run consistently over months, produces a system whose reliability is documented, measurable, and improvable — the only kind of reliability that survives contact with real users.
The uncomfortable truth is that evals are not exciting. They do not make a better demo. They do not change what the model can do. They do not show up in a product screenshot or a launch post. What they do is ensure that the capability you shipped today still works next month, that the behavior you promised is the behavior that gets delivered, and that when something breaks you know it in CI rather than in a customer complaint. That is the same value proposition unit tests had in 2014. It is why TDD became table stakes, and it is exactly why EDD is becoming table stakes in 2026. The teams that internalize the discipline early build a measurement culture that is genuinely hard to replicate. The teams that skip it are running on vibes — and vibes do not scale.
Tags
Share
Building something like this? See how we ship it or start a project.