Amazon built an internal leaderboard called Kirorank that ranked employees by AI usage — tokens consumed, agents run, tools invoked. Then Amazon shut it down, for the most predictable reason in the history of management: employees gamed it. Engineers left agents running on make-work to climb the rankings, token bills ballooned, and the company discovered at scale what Goodhart's law has said for fifty years — when a measure becomes a target, it ceases to be a good measure. Kirorank's short life is the cleanest specimen yet of the defining pathology of enterprise AI in 2026: productivity theater, where organizations measure the consumption of AI and mistake it for the production of value.
Kirorank: A Perfect Experiment in Goodhart's Law
Give Amazon credit for one thing: the leaderboard made the implicit explicit. Most companies in 2025 and 2026 pressured employees to "use AI more" through softer channels — adoption dashboards reviewed by VPs, AI usage questions in performance reviews, all-hands slides celebrating token growth. Kirorank simply turned the pressure into a ranked list. And ranked lists get optimized.
The optimization took the form any economist would predict. Engineers spun up agents on tasks that did not need them. They re-ran generations they did not read. They let long-running agent loops grind through the night because the meter only counted up. Token consumption soared, costs soared with it, and the correlation between leaderboard position and useful output — weak to begin with — went negative: the people at the top of the rankings were, by construction, the people burning the most compute per unit of shipped work. Amazon killed the leaderboard. The instinct that built it survives in adoption dashboards at nearly every enterprise.
"When a measure becomes a target, it ceases to be a good measure. Token consumption was never a measure of productivity — making it a target just made it a measure of obedience."
The deeper problem is why leaderboards like Kirorank get built at all. Executives under board pressure to show "AI transformation" need a number that goes up. Real productivity numbers — cycle time, defect rates, delivery predictability — move slowly, are confounded by everything, and sometimes move the wrong way. Usage numbers move fast, climb reliably, and make a great slide. So organizations measure what is easy and call it what is important. That is the theater: the performance of transformation, staged with metrics chosen because they cooperate.
How Theater Becomes Policy
The escalation path is worth tracing, because it explains why intelligent organizations keep building Kiroranks. It starts at the board: directors read that competitors are "AI-native" and ask the CEO for evidence of transformation. The CEO delegates to a transformation office, which — possessing no lever over actual engineering outcomes — reaches for the lever it has: adoption tracking. Adoption tracking begets adoption targets; targets beget dashboards; dashboards beget comparisons between teams; and comparisons, once visible, function as leaderboards whether or not anyone builds the ranking UI. By the time the mandate reaches an individual engineer, "we believe AI will transform our business" has been transmuted into "your token usage is below the team median."
Each step in that chain is locally rational, which is what makes the pattern so durable. The board is right to ask; the CEO is right to delegate; the program office is right that you cannot manage what you do not measure. The failure is in the substitution — measuring the input because the outcome is hard, then forgetting the substitution happened. Engineers, who can detect a gameable metric from across the building, respond exactly as Amazon's did. The dashboard goes green, the belief hardens, and the organization becomes structurally incapable of noticing that nothing downstream has improved. Theater is not a lie anyone tells; it is a measurement error that everyone has an incentive not to correct.
The Bill Arrives: Uber's Four-Month Budget
Productivity theater would be merely embarrassing if it were free. It is not free. Uber reportedly blew through its entire 2026 AI budget in the first four months of the year — and the damning part is not the overrun but the accounting that came with it. Uber's COO said the spending had not produced measurable productivity gains. Read that carefully: not "gains were smaller than hoped," but that the company could not measure any. A spend large enough to exhaust an annual budget by April, with no detectable signal in output.
Uber is not an outlier; it is just unusually candid. The structural forces are general. Token-metered pricing means costs scale with enthusiasm rather than with value — a dynamic we dissected in why token pricing is breaking enterprise AI budgets — and adoption mandates manufacture enthusiasm by decree. Put metered pricing under a usage mandate and the budget outcome is arithmetic. The only surprising thing about Uber's April is that more companies have not announced their own.
Then there is the churn inside the consumption itself. The CEO of Entelligence AI made a claim viral enough to become shorthand: companies spend 44% of their tokens on bug fixes — for bugs their AI generated. Treat the number as directional rather than precise; it comes from a vendor with a dashboard to sell. But the mechanism it describes is well documented, and the accounting insult is exquisite. A usage dashboard counts the bug-generating tokens as productivity, then counts the bug-fixing tokens as productivity again. The meter spins twice for negative work. By Kirorank logic, the engineer whose agent writes and then repairs its own defects is twice as productive as the one who shipped clean code.
The Speed Boost That Bills You Later
Two essays defined the skeptical turn in this year's discourse, both reaching the front page of Hacker News within weeks of each other. The first, from James Shore, attacked the assumption that generated code is cheap code. His argument: AI-generated code does not reduce maintenance burden and may increase it, because the cost of software was never in the typing — it is in the decade of reading, debugging, and modifying that follows. Code produced faster than it is understood accumulates as a liability the organization has not priced. His phrase for the bargain stuck: "You're trading a temporary speed boost for permanent indenture."
The empirical record backs the polemic. As we documented in our analysis of the AI maintenance bill, codebases under heavy AI assistance show duplicate code multiplying roughly eightfold while refactoring activity collapses — exactly the signature you would expect when generation is free and comprehension is not. None of that debt appears on a usage dashboard. All of it appears, eighteen months later, as the mysterious slowdown that gets blamed on the remaining engineers.
The second essay — "I don't think AI will make your processes go faster" — supplied the systems argument. Engineering cycle time is dominated not by code production but by organizational latency: review queues, deployment windows, approval chains, decision-making delay. If coding is 20% of your lead time and AI doubles coding speed, you have bought a 10% improvement — before subtracting the new review burden that faster generation creates. Speeding up the step that was not the bottleneck is the oldest mistake in operations, and AI adoption programs are making it at industrial scale. The constraint is the organization, and tokens do not dissolve organizations.
Vanity Metrics vs. Outcome Metrics
The dividing line between theater and measurement is simple to state: vanity metrics count activity that correlates with spending; outcome metrics count results that correlate with the business. Tokens consumed, lines of AI code merged, agent-hours run, percentage of employees "active on AI" — every one of these goes up when you waste money, which is precisely what disqualifies them as success measures. They are cost metrics wearing a productivity costume.
Measure consumption (the theater)
- • Tokens consumed per team or engineer
- • Lines or share of AI-generated code merged
- • Agent-hours run, prompts sent, tools invoked
- • "% of engineers using AI daily"
- • Leaderboard rank (RIP Kirorank)
- • All rise when money is wasted — that is the tell
Measure results (the business)
- • Lead time for change, idea to production
- • Change failure rate and MTTR
- • Maintenance load: % capacity on rework and defects
- • Escaped defects per release
- • Revenue (or features shipped) per engineer
- • AI spend per unit of outcome — trending which way?
None of the outcome metrics are exotic — they are DORA metrics plus an honest cost line. What makes them rare in AI programs is that they are capable of delivering bad news. A usage dashboard cannot tell you your AI program is failing; it has no axis for failure. Lead time and change failure rate can — which is exactly why they belong in the review and exactly why the theater avoids them.
The Honest Numbers We Do Have
Skeptics of the skeptics will ask: where is the rigorous measurement, then? The most careful study to date remains METR's randomized trial of experienced developers working in their own mature codebases — the setting where most enterprise engineering actually happens. The result is the most uncomfortable data point in the industry: developers using AI assistance were 19% slower on their assigned tasks while estimating themselves to be roughly 20% faster. A forty-point gap between perceived and measured productivity, in the population most confident in its own calibration. We unpacked the study and its 2026 follow-up — which partially collapsed because participants refused to work without their tools — in our METR analysis.
The point is not that AI makes everyone slower — the study covers one setting, and other task classes show genuine gains. The point is that self-report, the substrate of nearly every vendor ROI study and internal adoption survey, runs 40 points optimistic in the best-measured case we have. Every AI program whose evidence base is "engineers say they are faster" is building on exactly the signal METR proved unreliable. That is not a reason to stop; it is a reason to instrument. Organizations get the measurement quality they insist on, and so far most have insisted on vibes.
How to Measure AI ROI Honestly
For organizations ready to leave the theater, the discipline looks like this. First, baseline before you scale: capture lead time, change failure rate, maintenance share, and cost-per-engineer for at least a quarter before the rollout, or accept that you will never be able to attribute anything. Second, treat AI spend as an investment with a thesis — each deployment names the constraint it targets, the metric that should move, and the date by which it should move. Third, segment by task class rather than averaging: AI assistance produces real gains on well-scoped greenfield work and near-zero on legacy debugging, and a blended average buries the signal both ways. Fourth, count the full cost — tokens, tooling, review time absorbing the larger PR volume, and the rework share. The 44% figure may be a vendor's number, but your own rework share is measurable in your own tracker, and it belongs on the same slide as the savings claim.
And fifth — the cultural piece — stop rewarding usage. Developer sentiment is already running ahead of the measurement: engineers describe themselves as dependent on tools that the best available evidence says sometimes slow them down — the perception gap the METR trial quantified. When the workforce already over-attributes its productivity to AI, a leaderboard pushing for more usage is pouring accelerant on a calibration problem. The organizations getting real returns run the opposite culture: AI use is unremarkable, defaults are sensible, and the only thing celebrated is shipped outcomes.
What a Quarter of Honest Measurement Looks Like
The composite picture from teams that run this discipline is instructive precisely because it is undramatic. Week one hurts: the baseline reveals that nobody actually knows the current lead time, the deployment data lives in three systems, and the "we ship 40% faster with AI" folklore has no artifact behind it. Weeks two through six produce the first honest segmentation — AI assistance is cutting turnaround dramatically on test scaffolding and migration work, doing nothing detectable on the legacy services, and quietly inflating PR size in ways the review team feels but had not named. That segmentation alone usually redirects a third of the spend.
By the end of the quarter, the program has a different vocabulary. Instead of "adoption is at 87%," the review says: lead time on well-scoped feature work is down 22%, change failure rate is flat, maintenance share rose two points and is being watched, and the spend per shipped feature fell for the first month since rollout. Some of those numbers will be disappointing. That is the feature, not the bug — a measurement system incapable of disappointing you is a stage prop. And executives discover, usually with relief, that a true mixed result defends better in front of a board than a vanity dashboard, because it comes with a plan attached to the parts that are not working.
"The companies winning with AI do not know their token count off-hand. They know their lead time, their failure rate, and their cost per shipped outcome — and their AI spend answers to those numbers, not the other way around."
Conclusion: After the Theater
Kirorank deserves a footnote in management history, because it ran the experiment every adoption dashboard implies and published the result by dying. Rank people by AI consumption and you get AI consumption — purchased at full price, delivering bug-churn that the meter counts twice, while the organizational bottlenecks that actually govern delivery sit untouched. Amazon was merely the company honest enough to kill the metric. The pressure that created it — boards demanding transformation evidence, executives needing a number that goes up — still governs most AI programs, and it is currently writing Uber-shaped budget stories across the industry.
The correction is not less AI. Used against the right constraints and measured against outcomes, these tools produce real, bankable gains. The correction is the end of the theater: retiring consumption metrics, baselining honest ones, and letting AI spending face the same question every other line item faces — what did we get? Companies that ask it now will spend less and ship more than the ones still applauding the dashboard. The curtain call is overdue.
Tags
Share
Building something like this? See how we ship it or start a project.