The 40% Problem. AI & Code Refactoring

April 27, 2026

By Hubert Brychczynski

Artificial Intelligence,
Refactoring,
Software Engineering

The term “automated refactoring” can strike fear into the heart of a software engineer. That’s because LLMs have a tendency to optimize legacy code in ways that look elegant on the surface but often turn out broken on closer inspection.

Research from 2024 through early 2026 corroborates and quantifies this phenomenon. At the same time, it points to potential solutions.

This article distills several studies into a single picture of where AI refactoring stands today, where it goes wrong, and how it impacts software development.

In a follow-up, we will also discuss approaches that let teams capture the productivity gains from AI in refactoring without inheriting the bugs.

Refactoring and the Maintenance Problem

Writing new code is a small slice of software work. In a 2015 research paper, Minelli and colleagues determined that engineers spend 70% of their hours reading code to understand it; 25% on meetings and navigation; and only 5% writing or editing actual code (Figure 1).

Fig. 1: Software engineer’s time allocation - via IEEE, as cited by CodeSense.

Another paper points at why the imbalance exists. In their research on factors affecting software project maintenance, Dehaghani and Hajrahimi reveal that maintenance accounts for around 90% of a software product’s total cost.

The study identifies and ranks a total of 32 factors that affect software maintenance costs. Based on interviews with 40 experts, the highest-ranked factors include product complexity, application understanding, and documentation quality.

Businesses can improve those factors and drive down overall maintenance costs through code refactoring - provided it is effective. Failing that, refactoring can only exacerbate the issues it was supposed to mitigate.

If developers spend only a fraction of their time writing or editing code and the bulk of it trying to understand what’s already written, then we should hold AI-assisted refactoring to a higher standard than AI-assisted coding: coding attacks the 5% slice; refactoring attacks the 70%.

Refactoring Is a Harder AI Problem Than Code Generation

Refactoring sounds like a subset of coding, but the research community treats it as a distinct task with its own evaluation rules. The SWE-Refactor team at Concordia University put it plainly in their February 2026 paper: refactoring demands precise, behavior-preserving edits that improve program structure. That constraint makes automated evaluation so hard.

Code generation has a permissive success criterion: does the output solve the problem? Refactoring has a strict one: does the output solve the problem the same way the original code did, while being structurally better? An LLM that generates a working function from a spec has succeeded. An LLM that rewrites a working function and changes its semantics by one branch has failed, regardless of how clean the rewrite looks.

So, how well does AI handle that stricter task?

The 40% Ceiling

Two studies published two years apart converge on roughly the same number.

In 2024, CodeScene collected over 100,000 real-world code smells from JavaScript and TypeScript codebases, pointed several state-of-the-art LLMs at them, and checked the results by running the repositories’ existing test suites and re-scoring code health. PaLM 2 code, the best performer, produced functionally correct refactorings in 37% of cases. GPT-3.5 landed at 30%. Spot checks against GPT-4 showed marginal gains at substantially higher cost and latency. Out-of-the-box, a majority of attempted refactorings broke the code.

The Concordia team released SWE-Refactor in February 2026 and reached a matching conclusion on Java. Their benchmark comprises 1,099 developer-written, behavior-preserving refactorings mined from 18 production Java projects: 922 atomic cases (single transformations like Extract Method) and 177 compound cases (multi-step transformations like Extract and Move Method). Each instance carries the full repository context and original test suite.

Across nine widely-used LLMs, DeepSeek-V3 topped the chart at 41.58%, GPT-4o-mini at 39.85%. A follow-up run on 200 stratified instances through OpenAI Codex (GPT-5.1-Codex) with full repository access, the closest thing to a production agentic setup, handled 82.6% of atomic refactorings but dropped to 39.4% on compound ones.

Look at where the failures cluster. Models do fine on localized, single-file edits. They collapse when the refactoring requires repository-level reasoning: tracking call sites across files, updating signatures consistently, moving a method and rewiring every caller. The SWE-Refactor authors report that most failures come from mismatched edits, where the model performs only part of the requested compound refactoring or produces an alternative transformation that misses the specified operation.

Two years apart, two research teams, two languages, two methodologies. Both land in the same place: AI alone gets about 40% of complex refactorings right (Figure 2).

Fig. 2: Success rates of out-of-the-box LLMs on complex refactorings, from two independent studies on JavaScript and Java, respectively (2024, 2026).

How AI Breaks Code

Percentages alone abstract away the gravity of AI failures. The CodeScene team catalogued the recurring patterns from their review of AI-generated refactorings:

The model drops a branch, often an if block handling an edge case. When the dropped branch contains input validation, the refactoring becomes a security vulnerability.
The model inverts boolean logic: e.g., a && b becomes !(a && b), and the unit tests that don’t cover the specific combination pass anyway.
In JavaScript, the model mishandles this binding when extracting an expression into a new function. The extracted code runs, returns nothing meaningful, and the surrounding logic takes the wrong path without raising an error.

None of these failures flag themselves during code review. A senior engineer reading the diff sees a tidier function and approves. The bug surfaces weeks later in production.

The Second-Order Damage: Developer Toil

The cost of AI refactoring errors shows up twice. First in the broken code, then in the team that has to live with it.

SonarSource’s 2026 State of Code Developer survey report found that 88% of developers had experienced at least one negative impact of AI on technical debt. 53% said AI had produced code that looked correct but was unreliable; 40% said AI had generated unnecessary or duplicative code; and 29% said AI had introduced buggy code outright (Figure 3).

Fig. 3: Developer-reported negative impact of AI on technical debt (via SonarSource).

The survey focused on AI-generated code in general rather than on specific use cases such as refactoring, but it is safe to assume that AI will make similar errors in refactoring as it has been found to in coding. Of all the error types developers identified in AI-generated code, the first one is simultaneously the most prevalent and the most insidious: producing code that looks right but works wrong. Code that looks obviously wrong gets fixed. By contrast, code that only appears fine taxes the reviewer trying to catch it.

SonarSource calls this new pattern “the great toil shift.” AI clears certain kinds of drudgery (like debugging poorly documented code) while creating new kinds in return. Developers who use AI heavily report that their toil has moved from writing code to managing technical debt and correcting AI-generated code.

This is not to say that AI has no benefit on software engineering: the survey pegs average personal productivity up 35%. But the time budget for “toil work” remains steady at 23-25% regardless of how much AI is involved. The work shape changes. The total load does not.

If you’re tracking team throughput against defect rates, this is the trap. You can buy speed in the SDLC’s early phases and pay for it in the later ones, and the accounting is hard to see until the tech debt bill arrives.

What Next?

The research catalogs the failures, quantifies the ceiling, and names the second-order tax on developer time - but naming a problem is not solving it. In fact, building systems that get AI refactoring past the 40% baseline is possible and documented, with the approaches converging on similar architectural ideas. Part 2 will explore them in more detail.

Frequently Asked Questions

Around 40%. Two independent studies (2024 and 2026, JavaScript and Java) found that out-of-the-box LLMs produce correct refactorings about 40% of the time on complex tasks.

Examples recorded in research include dropped branches, inverted booleans, or broken bindings. More importantly, though, the output tends to look tidier than the original despite the bugs, prompting engineers to approve it for production.

No. Data suggests AI mostly shifts the workload without meaningfully reducing it. Overall productivity goes up around 35%, but heavy AI users spend the freed-up time on new work that AI creates: managing technical debt and rewriting AI's output.