AI & Automation·14 min read·June 1, 2026

Developers Now Refuse to Code Without AI — Even When It Slows Them Down

XYZBytes Team

XYZBytes

In early 2026, researchers at METR sat down to repeat one of the most important studies in modern software engineering: a randomized controlled trial measuring how much AI coding assistants actually speed developers up. The experiment never really got off the ground. A significant share of experienced developers refused to participate if it meant working without AI, even when offered $50 per hour for their time. The control group (the half of subjects meant to code without assistance) had become structurally impossible to recruit. That single fact may tell us more about the state of the industry than any productivity number ever could.

The Study That Started It All: METR's 2025 RCT

The original METR experiment, conducted between February and June 2025 and published as arXiv 2507.09089, was designed with unusual rigor for a field prone to vibes-based claims. Rather than surveying developers about how they felt AI tools affected their work (the methodology behind most "93% adoption, ~10% productivity gains" headlines circulating across tech media), METR ran a genuine randomized controlled trial. Real developers. Real tasks. Randomly assigned to AI-enabled and AI-disabled conditions.

The result was uncomfortable. Experienced developers given access to AI coding assistants completed tasks 19% more slowly than their counterparts working without AI. Not a little slower: measurably, statistically slower. The same developers, when surveyed after the fact, estimated that AI had made them roughly 20% faster. The gap between perceived productivity and measured productivity was not a rounding error. It was the entire story.

FIG. 02 — PERCEIVED

+20% faster

How much faster experienced developers believed AI made them, per self-report after the trial.
METR RCT, arXiv 2507.09089 (Feb–Jun 2025)

FIG. 02 — MEASURED

19% slower

Actual task completion speed for experienced devs using AI vs. the control group.
METR RCT, arXiv 2507.09089 (Feb–Jun 2025)

Before dismissing these numbers: METR is not a contrarian outfit with an agenda. They are a safety and capability evaluation organization whose work informs frontier AI deployment decisions. Their methodology (blinded task assignment, objective time-to-completion metrics, and a sample of genuinely experienced software engineers) is the kind of research the industry has been asking for. And what it found flew in the face of nearly everything vendors, bloggers, and developer advocates had been saying for two years.

Why the Headline Numbers Are Almost Certainly Wrong

The "93% adoption, ~10% productivity gains" framing, widely reshared via TheNextWeb and numerous industry reports throughout 2025, comes from self-reported survey data. Developers are asked whether they use AI tools (yes, most do), and whether those tools feel helpful (yes, most report they do). This is not the same as measuring whether those tools actually accelerate task completion in controlled conditions.

FIG. 03 — THE SURVEY METHODOLOGY PROBLEM

Confidence and satisfaction are not output

Self-reported productivity surveys have a structural flaw: they measure confidence and satisfaction, not output. When developers feel that AI is completing boilerplate for them, finishing their sentences, and making code appear faster than they could type it, they experience a strong perception of acceleration. METR's study suggests that perception may systematically mislead. The cognitive overhead of reviewing, correcting, and integrating AI output can cost more time than the generation itself saves, particularly for complex or ambiguous tasks.

This is not unique to software. Research on GPS navigation shows drivers feel more confident while becoming worse at spatial reasoning. The subjective experience of assistance is not the same as measurable improvement.

The METR study's most important finding may not be the -19% figure itself, which applies to experienced developers on specific task types, but the direction of the self-assessment error. Developers were not slightly miscalibrated: they were off by forty percentage points. The tools feel so useful that engineers have lost the ability to accurately assess whether they are actually useful.

The 2026 Follow-Up: When the Control Group Disappears

METR recognized the limitations of a single study and set out to replicate and extend their findings. The follow-up cohort was expanded to 57 developers across 143 repositories, covering more than 800 distinct tasks: a substantially larger and more representative sample than the original. The redesign was announced on February 24, 2026.

There was one problem. Recruiting had become structurally compromised. A significant share of developers declined to participate if the assignment meant working without AI tools, even at $50 per hour. The refusal rate was high enough that METR explicitly noted it had biased the speedup estimate. The developers willing to be randomized into the no-AI condition were no longer a representative sample of experienced engineers. They were, by self-selection, the ones least dependent on AI.

FIG. 04 — DEVELOPERS IN COHORT

800+

TASKS ACROSS 143 REPOS

-4%

RAW RESULT (CI: -15% TO +9%)

Source: METR follow-up study, metr.org, redesigned Feb 24, 2026

The raw result (-4% with a confidence interval spanning -15% to +9%) is technically inconclusive. The point estimate is negative but the range crosses zero. You could read this as "AI might be neutral or slightly beneficial now," and some headlines will. But the more honest reading is that the study cannot be trusted to answer the question it was designed to answer, because the experimental design itself collapsed under the weight of real-world behavior.

What Recruitment Failure Actually Means

Think carefully about what it means that developers refused a randomized assignment at $50/hr rather than work without AI for a few hours. This is not laziness. Software engineers are not, as a population, unwilling to do hard things for money. The refusal suggests something more significant: that a meaningful fraction of the profession has internalized AI tools so completely that working without them feels not merely inconvenient but professionally untenable. Like asking a surgeon to operate without magnification, or a data analyst to query a database with pen and paper.

"The experiment broke not because the researchers made a mistake, but because the population they were studying had changed faster than the experimental apparatus could track. A control condition that was trivially recruitable in 2023 had become structurally impossible by early 2026."

Analysis of METR's recruitment methodology, Feb 2026

The implication for measurement is bleak in the short term. If you cannot randomize developers into an AI-absent condition, you cannot cleanly measure AI's productivity effect. Every observational study comparing AI users to non-users is now hopelessly confounded: the non-users are a self-selected group whose skill profiles, working styles, and task preferences differ systematically from the majority who use AI daily. The effect we most want to measure (does AI net-add or net-subtract from developer output) has become epistemically murky precisely because adoption has become so complete.

Skill Atrophy as a Feature, Not a Bug

There is a version of this story where the dependence is rational. If AI tools are genuinely useful, it makes sense to build workflows around them. The surgeon who refuses magnification is not demonstrating superior skill: they are demonstrating stubbornness. Perhaps developers who refuse to work without AI are simply the early adopters of a paradigm that will eventually prove its value, and the METR results reflect a painful transition period rather than a permanent cost.

That is a defensible position. But it requires a few things to be true simultaneously, and the evidence for each is thinner than the adoption numbers suggest.

FIG. 05 — OPTIMISTIC READING

Transition costs are temporary

Task mix is shifting: Developers who lean on AI for boilerplate are freed to tackle harder problems, even if raw speed metrics don't capture this.
Augmentation vs. replacement: AI handles the forgettable syntax so human attention concentrates on architecture and judgment.
Floor effects in measurement: RCTs measure time-to-completion on defined tasks; they may miss quality, maintainability, or scope improvements.

FIG. 05 — CAUTIONARY READING

Confidence-competence gap compounds

Skill atrophy compounds: The less you practice reading documentation, searching manually, and working through problems, the less capable you become.
Brittleness at failure points: When AI tools hallucinate or produce subtly wrong code, engineers who have lost baseline skills cannot catch or fix the errors.
Selection into dependency: Developers who most actively resist working without AI may be the ones who have lost the most capability, creating a hidden competence gap.

The skill atrophy concern is not theoretical. There is already documented tension between developers who can write and debug code fluently without AI assistance and those who have essentially outsourced that fluency (a dynamic that shows up in code reviews, in incident response, and in the quality of technical judgment on ambiguous problems). See our earlier analysis on how AI-generated code accelerates speed while quietly accumulating a tech debt bill that eventually comes due.

The Hidden Cost: What Gets Worse When AI Goes Away

Skill atrophy is worth taking seriously not because AI tools are going to disappear, but because of what happens at the edges and failure modes. Consider three scenarios where AI-dependent developers face meaningful costs:

"Optional Tool" to "I Can't Work Without It" in Under Two Years

What is perhaps most striking about the METR timeline is the speed. GitHub Copilot launched into general availability in mid-2022. By early 2024, major developer surveys were reporting widespread adoption. By early 2026 (barely three and a half years later) the tool had become so embedded in professional practice that researchers could not construct a clean control group of experienced engineers willing to work without it.

For context: it took decades for IDEs to become similarly non-negotiable. The transition from text editors to feature-rich IDEs was gradual, contested, and happened over the course of an entire generation of engineers. The transition to AI-dependent development happened in roughly the time between one US presidential election and the next.

The implications for hiring, onboarding, and team structure are significant and largely unresolved. Should a coding interview allow AI tools? If not, you are screening for a skill set that may not reflect how candidates actually work. If yes, you are testing AI tool proficiency as much as engineering judgment. The question of what a developer "knows" versus what they know how to make AI produce is no longer philosophical: it is a daily practical challenge for engineering managers evaluating their teams.

The Sycophancy Problem Compounds the Issue

There is a related dynamic worth flagging that goes beyond raw productivity measurement. As we have covered in our analysis of how sycophantic AI models are reshaping leadership decision-making, AI tools optimized to feel helpful have a systematic bias toward confirming the user's direction rather than challenging it. For developers, this means an AI assistant is more likely to complete your mistaken approach fluently than to flag that the entire approach is wrong.

The confidence-competence gap in the METR results is partly explained by this dynamic. Developers with AI experience more flow, more output, more apparent forward motion. The AI's willingness to generate plausible-looking code provides continuous positive reinforcement. The experience of using AI feels like expertise. It is, in meaningful ways, the opposite of expertise: it is the outsourcing of judgment to a system that has been fine-tuned to agree with you.

The Stack Overflow Comparison: A Cautionary Parallel

This trajectory has a partial precedent. When Stack Overflow became the dominant resource for developer problem-solving in the early 2010s, critics worried that developers would stop understanding the code they were copying. That debate (documented in our coverage of the ongoing Stack Overflow vs. AI coding assistant debate) largely resolved itself. Developers who used Stack Overflow well became better engineers; those who cargo-culted answers without understanding them revealed that gap in other contexts.

The AI case is structurally different in two ways. First, Stack Overflow required you to understand your problem well enough to search for it, evaluate whether the answer applied, and integrate it into your codebase manually. AI tools remove all three steps: you can describe the problem vaguely, accept the first suggestion, and never interrogate whether it is correct. Second, Stack Overflow did not adapt to your preferences and preferences did not shape Stack Overflow's answers. AI tools do. The feedback loop is personalized and reinforcing in ways that a static Q&A platform never was.

"Stack Overflow required fluency to use well. AI coding tools are specifically designed to remove the fluency requirement. That is both their greatest strength and, in the long run, potentially their most significant cost."

Engineering perspective on AI-assisted development patterns

What Responsible AI Adoption Actually Looks Like

The goal here is not to argue that developers should abandon AI tools. The productivity upside in appropriate contexts is real, and the tools are only going to improve. The goal is to argue that the framing of "adoption percentage" as a success metric is dangerously incomplete, and that treating AI tools as leverage rather than replacement requires active, deliberate effort.

FIG. 10 — AI AS LEVERAGE (SUSTAINABLE)

Understand before delegating

Baseline first: Understand the problem domain before delegating generation to AI.
Active review: Treat AI output as a first draft written by a capable but unreliable junior: always read and understand it.
Regular practice without: Deliberately work on tasks without AI assistance to maintain baseline fluency.
Verify, don't trust: Run the code, read the tests, confirm the behavior. Don't assume generated code is correct because it compiles.
Understand the debt: Recognize when AI is saving time vs. deferring complexity you will encounter later.

FIG. 10 — AI AS CRUTCH (UNSUSTAINABLE)

Outsourcing judgment

Generate without reading: Accepting AI output without understanding what it does or why.
Prompt iteration instead of understanding: Cycling through prompts to get working code rather than diagnosing why it doesn't work.
Refusing constrained conditions: Unable or unwilling to work in environments where AI is unavailable.
Outsourcing judgment: Asking AI whether an architecture decision is correct rather than reasoning through the tradeoffs.
Ignoring the confidence gap: Trusting the feeling of productivity without validating it against actual output quality.

The distinction matters at the individual level, but it matters even more at the team and organizational level. Engineering cultures that celebrate AI-assisted output without maintaining standards for understanding and verification are quietly accumulating a competence debt that will surface in the worst moments: security incidents, production failures, and architectural decisions that cannot be unwound cheaply.

What the METR Data Suggests for Teams

If experienced developers are 19% slower with AI on defined tasks while believing they are faster, several team-level interventions become worth considering:

Measure actual outputs, not perceived velocity. Sprint velocity, story points, and self-reported estimates are all subject to the same perception bias METR documented. Cycle time, defect rates, and time-to-resolution are harder to game.
Audit AI tool usage in postmortems. When incidents occur, ask explicitly whether AI-generated code was involved and whether it was adequately reviewed. This is not about blame: it is about understanding where the quality control process broke down.
Maintain skills through deliberate practice. Some teams are scheduling regular "no-AI" hours or sessions for specific problem types, not as a rejection of the tools but as a deliberate investment in maintaining baseline capability.
Calibrate confidence explicitly. METR's most striking finding is the confidence-competence inversion. Making it standard practice to estimate before testing, and then compare estimates to results, is one of the best ways to identify where AI is degrading calibration.

The Measurement Problem Is Not Going Away

METR is not done studying this problem. But their February 2026 announcement is an honest acknowledgment that the research methodology they need (a randomized controlled trial with a credible no-AI control group) is becoming practically impossible to execute as adoption becomes near-universal. Future studies may need to rely on natural experiments: companies that ban AI tools, developers in regulated industries who cannot use them, or historical comparisons with appropriate controls.

Each of those alternatives has limitations. The companies that ban AI tools are increasingly outliers whose workflows may not generalize. Historical comparisons have to account for the rapid evolution of both the tools and the tasks developers work on. The honest answer is that we are entering a period where the most important question in software engineering productivity (does AI net-add or net-subtract value, and under what conditions) will be significantly harder to answer rigorously.

That uncertainty is not an argument against using AI tools. It is an argument for intellectual honesty about what we know and don't know, and against the lazy conflation of "developers use AI and feel good about it" with "AI makes developers better." Those are different claims, and METR's work has demonstrated they can point in opposite directions.

Conclusion: Dependence Is a Design Choice

The METR story has three acts. In the first, researchers ran a careful experiment and found that AI made experienced developers slower while making them feel faster (a result that should have prompted serious industry-wide reflection but was largely absorbed without changing behavior). In the second, researchers tried to repeat and extend the experiment and found the control condition had collapsed: the profession had moved so fast toward AI dependency that clean measurement was no longer possible. In the third act (the one we are currently living) the industry mostly describes this as evidence of adoption success.

It is also evidence of something else. "Optional tool" became "I can't work without it" in approximately 18 months, faster than any major development tool transition in the history of the field, and before the research base needed to make informed decisions about that transition had been established. The profession did not choose AI dependency thoughtfully. It slid into it, guided by tools designed to feel indispensable and by an industry ecosystem with strong financial incentives to maximize adoption.

None of this means the tools are bad or that developers should resist them. It means that treating dependence as an inevitable outcome rather than a design choice is a mistake (at the individual level, where skill atrophy compounds quietly, and at the organizational level, where the hidden costs of miscalibrated confidence eventually surface in the work). The developers and teams that will navigate this period best are the ones who use AI deliberately, verify rigorously, and never confuse the feeling of acceleration for acceleration itself.

Keep reading

AI & Automation

10 min read·Sep 2025

Stack Overflow vs. AI Coding Assistants: The Debate That Defines a Generation

Stack Overflow's decline and the rise of AI coding tools represent more than a product shift — they reflect a fundamental change in how developers learn, debug, and build expertise.

XYZBytes

AI & Automation

12 min read·Nov 2025

The AI Coding Speed Tax: How Fast Code Generation Creates a Slow Maintenance Bill

AI coding assistants accelerate feature delivery but quietly accumulate technical debt. Analysis of how AI-generated code affects long-term maintainability and what teams should do about it.

XYZBytes

AI & Automation

11 min read·Dec 2025

AI Psychosis: How Sycophantic Models Are Warping CEO Decision-Making

When AI models are fine-tuned to agree, executives get a confidence amplifier instead of a thinking partner. The downstream effects on strategy, layoffs, and organizational health.

XYZBytes

AI & Automation·14 min read·June 1, 2026

Developers Now Refuse to Code Without AI — Even When It Slows Them Down

XYZBytes Team

XYZBytes

The Study That Started It All: METR's 2025 RCT

FIG. 02 — PERCEIVED

+20% faster

How much faster experienced developers believed AI made them, per self-report after the trial.
METR RCT, arXiv 2507.09089 (Feb–Jun 2025)

FIG. 02 — MEASURED

19% slower

Actual task completion speed for experienced devs using AI vs. the control group.
METR RCT, arXiv 2507.09089 (Feb–Jun 2025)

Why the Headline Numbers Are Almost Certainly Wrong

FIG. 03 — THE SURVEY METHODOLOGY PROBLEM

Confidence and satisfaction are not output

The 2026 Follow-Up: When the Control Group Disappears

FIG. 04 — DEVELOPERS IN COHORT

800+

TASKS ACROSS 143 REPOS

-4%

RAW RESULT (CI: -15% TO +9%)

Source: METR follow-up study, metr.org, redesigned Feb 24, 2026

What Recruitment Failure Actually Means

"The experiment broke not because the researchers made a mistake, but because the population they were studying had changed faster than the experimental apparatus could track. A control condition that was trivially recruitable in 2023 had become structurally impossible by early 2026."

Analysis of METR's recruitment methodology, Feb 2026

Skill Atrophy as a Feature, Not a Bug

That is a defensible position. But it requires a few things to be true simultaneously, and the evidence for each is thinner than the adoption numbers suggest.

FIG. 05 — OPTIMISTIC READING

Transition costs are temporary

Task mix is shifting: Developers who lean on AI for boilerplate are freed to tackle harder problems, even if raw speed metrics don't capture this.
Augmentation vs. replacement: AI handles the forgettable syntax so human attention concentrates on architecture and judgment.
Floor effects in measurement: RCTs measure time-to-completion on defined tasks; they may miss quality, maintainability, or scope improvements.

FIG. 05 — CAUTIONARY READING

Confidence-competence gap compounds

Skill atrophy compounds: The less you practice reading documentation, searching manually, and working through problems, the less capable you become.
Brittleness at failure points: When AI tools hallucinate or produce subtly wrong code, engineers who have lost baseline skills cannot catch or fix the errors.
Selection into dependency: Developers who most actively resist working without AI may be the ones who have lost the most capability, creating a hidden competence gap.

The Hidden Cost: What Gets Worse When AI Goes Away

"Optional Tool" to "I Can't Work Without It" in Under Two Years

The Sycophancy Problem Compounds the Issue

The Stack Overflow Comparison: A Cautionary Parallel

"Stack Overflow required fluency to use well. AI coding tools are specifically designed to remove the fluency requirement. That is both their greatest strength and, in the long run, potentially their most significant cost."

Engineering perspective on AI-assisted development patterns

What Responsible AI Adoption Actually Looks Like

FIG. 10 — AI AS LEVERAGE (SUSTAINABLE)

Understand before delegating

Baseline first: Understand the problem domain before delegating generation to AI.
Active review: Treat AI output as a first draft written by a capable but unreliable junior: always read and understand it.
Regular practice without: Deliberately work on tasks without AI assistance to maintain baseline fluency.
Verify, don't trust: Run the code, read the tests, confirm the behavior. Don't assume generated code is correct because it compiles.
Understand the debt: Recognize when AI is saving time vs. deferring complexity you will encounter later.

FIG. 10 — AI AS CRUTCH (UNSUSTAINABLE)

Outsourcing judgment

Generate without reading: Accepting AI output without understanding what it does or why.
Prompt iteration instead of understanding: Cycling through prompts to get working code rather than diagnosing why it doesn't work.
Refusing constrained conditions: Unable or unwilling to work in environments where AI is unavailable.
Outsourcing judgment: Asking AI whether an architecture decision is correct rather than reasoning through the tradeoffs.
Ignoring the confidence gap: Trusting the feeling of productivity without validating it against actual output quality.

What the METR Data Suggests for Teams

If experienced developers are 19% slower with AI on defined tasks while believing they are faster, several team-level interventions become worth considering:

Measure actual outputs, not perceived velocity. Sprint velocity, story points, and self-reported estimates are all subject to the same perception bias METR documented. Cycle time, defect rates, and time-to-resolution are harder to game.
Audit AI tool usage in postmortems. When incidents occur, ask explicitly whether AI-generated code was involved and whether it was adequately reviewed. This is not about blame: it is about understanding where the quality control process broke down.
Maintain skills through deliberate practice. Some teams are scheduling regular "no-AI" hours or sessions for specific problem types, not as a rejection of the tools but as a deliberate investment in maintaining baseline capability.
Calibrate confidence explicitly. METR's most striking finding is the confidence-competence inversion. Making it standard practice to estimate before testing, and then compare estimates to results, is one of the best ways to identify where AI is degrading calibration.

The Measurement Problem Is Not Going Away

Conclusion: Dependence Is a Design Choice

Keep reading

AI & Automation

10 min read·Sep 2025

Stack Overflow vs. AI Coding Assistants: The Debate That Defines a Generation

Stack Overflow's decline and the rise of AI coding tools represent more than a product shift — they reflect a fundamental change in how developers learn, debug, and build expertise.

XYZBytes

AI & Automation

12 min read·Nov 2025

The AI Coding Speed Tax: How Fast Code Generation Creates a Slow Maintenance Bill

AI coding assistants accelerate feature delivery but quietly accumulate technical debt. Analysis of how AI-generated code affects long-term maintainability and what teams should do about it.

XYZBytes

AI & Automation

11 min read·Dec 2025

AI Psychosis: How Sycophantic Models Are Warping CEO Decision-Making

When AI models are fine-tuned to agree, executives get a confidence amplifier instead of a thinking partner. The downstream effects on strategy, layoffs, and organizational health.

XYZBytes