Flipping The Odds. How To Make AI Refactoring Work In Your Favor

May 20, 2026

By Hubert Brychczynski

Artificial Intelligence,
Refactoring,
Software Engineering

AI-assisted refactoring sits in an uncomfortable spot. The productivity case is real; the reliability case much weaker.

Research from between 2024 and early 2026 shows that general-purpose LLMs, pointed at real refactoring tasks without guardrails, produce broken code more often than working code.

Our previous article discussed the evidence: the 40% ceiling that general-purpose LLMs hit on complex refactorings; the specific ways AI breaks code in refactoring; and the second-order cost on developer time as work shifts from writing code to correcting what AI has produced.

This piece offers a way forward. Research demonstrates that it is possible to refactor code with AI and either work around or raise the 40% baseline. Read on for two strategies, the tools that implement them, and the architecture that ties them together.

Two paths forward

The research points at two complementary strategies for getting AI refactoring to a state where a director can defend its deployment (Figure 1).

Fig. 1: Two paths toward more robust refactoring with AI.

Narrow, purpose-built tools

The broad-model approach (i.e.: point a general LLM at a codebase and ask it to refactor) hits 40%. However, there are purpose-built and open source tools that can be used on top of or alongside AI-assisted refactoring to eliminate or flag its errors. These tools have been either trained or engineered for a narrow slice of the refactoring problem, which allows them to perform better than off-the-shelf large language models.

CodeT5

Salesforce’s CodeT5 is an open-source model trained on code-specific tasks. According to a study on AI in code optimization and refactoring, the tool provides a 15% productivity lift when integrated into Salesforce’s internal development workflow, alongside measurable reductions in duplicate code.

While CodeT5 does not try to solve refactoring in general, it aids the task through code completion, summarization, and translation within a bounded scope.

RefactoringMiner

RefactoringMiner, an open-source AST-based tool for Java, works at the opposite end of the spectrum: it does not generate code at all but instead detects refactoring patterns through static analysis and surfaces opportunities.

One financial services firm was reported to have gained a 25% reduction in technical debt after a year of using RefactoringMiner to identify modularization candidates in its trading platform.

RefactoringMiner is also the detector the SWE-Refactor benchmark uses to verify whether LLM-generated refactorings match the intended transformation: the deterministic tool grades the probabilistic one.

Both cases share a similar approach: augment general-purpose AI with a specialized tool rather than throwing it on a task it can’t reliably solve.

Agentic verification layer

The second approach also combines AI-assisted refactoring with an additional tool, but the tool itself is custom-built and trained on case-specific data.

CodeScene’s 2024 whitepaper documents the creation of a fact-checking model, which was trained on a data lake of 100,000 real-world refactoring samples with known-correct ground truth.

Once a general LLM produces candidate refactorings, the fact-checking model evaluates each candidate and assigns it a confidence score in a series of steps, arriving at a final precision score. Candidates below a given threshold get rejected.

The results are telling. Raw GPT-3.5 produces correct refactorings 26-33% of the time depending on the code smell category (with GPT-4 only slightly better at the expense of speed and cost). The verification layer filters out the faulty refactorings to the point where 96-99% of what passes through is correct (Figure 2).

Fig. 2: Refactoring correctness rates before and after adding a verification layer (via: CodeScene)

CodeScene's team makes the mechanism explicit: the LLM's success rate stays at 30-something percent. The verification layer doesn't improve the model but rejects bad outputs. Shipped-to-production rate climbs to 98% because the 60-something percent of bad outputs never make it past the filter.

SWE-Refactor found a compatible result through a different approach. Their multi-agent setup pairs a Developer agent that generates refactored code with a Reviewer agent that runs static analysis against the output and feeds critiques back for iteration. This architecture raised GPT-4o-mini’s successful refactoring count from 438 to 579 on the benchmark (an uptick of 32%), outperforming both simple prompting and retrieval-augmented generation.

Vibe, then verify

These approaches to refactoring with AI have a name in the SonarSource report: “vibe, then verify.” Let AI generate. Do not let it merge. Between the two, interpose a verification layer (rule-based checks like test runs, static analysis, and AST-based refactoring detection, or model-based reviewers trained on ground-truth data) that rejects bad output without needing human review.

The shift is structural. The generator stays probabilistic. Your team owns the verifier and the final outcome.

The Takeaway

The research is consistent enough to draw firm conclusions. Out-of-the-box LLMs refactor complex code correctly about 40% of the time, and the failure modes (dropped branches, inverted booleans, broken bindings) are the ones that pass code review and surface later as production bugs. The baseline has not moved in two years.

The path past the 40% success ratio runs through verification, not better prompts or larger models. CodeScene’s fact-checked pipeline filters refactorings down to 98% confidence rate on a bounded set of refactoring types. SWE-Refactor’s multi-agent workflow raises compound-refactoring success by a third over simple prompting. The shape of the answer is the same in both cases: probabilistic generation, deterministic verification, and the architectural discipline to keep those two roles separate.

How it works at Janea Systems

In most cases, a proper verification layer is not something you can buy. Instead, you have to create it, using three key competences that rarely sit together: working knowledge of the domain being refactored; production experience with AI agents; and internal data on where AI helps and where it does not.

Janea Systems has worked across all three.

Refactoring Bing Maps

On the domain side, our team refactored Microsoft Bing Maps' deep learning infrastructure. When we found the TensorFlow implementation in Bing's DeepCAL model running 10x slower than its PyTorch, we rewrote the training algorithm, restructured the input pipeline, and broke oversized single files into smaller TFRecord files for parallel I/O. The result? TensorFlow execution became 50x faster, training ran 7x faster, and dual-GPU throughput gained 30%. Read the case study.

Strictly speaking, our work straddled the line between refactoring and optimization. Martin Fowler draws a distinction between refactoring and optimization by purpose rather than technique: the two often share the same transformations, but refactoring targets clarity while optimization targets speed. For Bing Maps, we drew on both. The engineering discipline is the same either way: deep familiarity with the existing system, rigorous testing against the original behavior, and the commitment not to break what works.

Building agentic AI for FinTech

On the agentic side, we built a multi-agent system for a U.S. financial services client. The system ended up handling 15,000 to 50,000 cases per day - with role-specific agents for different audiences (account holders, collectors, internal teams) on a shared orchestration layer extensible to a planned portfolio of around 20 assistants.

The architectural answer is, in principle, similar to what the refactoring research describes: separate the probabilistic generator from the deterministic constraints around it. In our case, that meant tight feedback loops with subject matter experts, modular architecture, and observability built into the orchestration layer so variance could be measured and corrected. Read the case study.

Keeping tabs on AI-assisted coding

On the measurement side, we ran three controlled experiments on AI-assisted coding across five engineering domains to make sure what works and what doesn’t.

Our engineers completed tasks 30% faster with AI - on average. Specific gains concentrated where prompt-engineering skill and domain expertise were the strongest: 67% faster in front-end engineering, 56% faster in back-end. Machine learning gained 24%; data engineering 10%.

The takeaway from these experiments is not only which engineering domains stand to gain the most from AI augmentation, but also - and perhaps more importantly - what kinds of tasks we can leave to AI versus where we should stay vigilant.

Across all tested domains, AI earned its keep on the same three task types: kickstarting projects with boilerplate, starter templates, and architecture scaffolding; pulling in documentation, API references, and best practices the engineer would otherwise search for; and acting as a sounding board for explaining or brainstorming an approach.

The failure modes clustered just as consistently. Engineers across every domain reported AI suggestions that were wrong, broken, illogical, inscrutable, outdated, non-standard, generic, or overcomplicated.

Back-end engineers noticed a subtler trap: when AI-generated code contained errors, they struggled to trace the root cause because they had not authored or reasoned through the code themselves.

The bottom line: lean on AI when you own the surrounding reasoning and can verify the output quickly; exercise caution the moment the code needs to integrate with an existing system, handle domain-specific data, or touch infrastructure configuration.

The biggest risk remains the same for both AI coding and refactoring: plausible code that an engineer can gloss over too easily - whether due to lack of context, insufficient understanding, or plain human oversight.

Ready to refactor with AI?

Domain experience, agentic architecture, measured AI practice: this is the shape of the engineering team you need if you want to move AI refactoring from the 40% baseline to something your senior engineers will trust with production code.

That team is us. Let’s talk.

Frequently Asked Questions

Pair the LLM with a verification layer that rejects bad output before it merges. The pattern across the research is consistent: probabilistic generation, deterministic verification, architectural separation between the two roles.

Narrow purpose-built tools (CodeT5 for code-specific tasks, RefactoringMiner for AST-based static analysis on Java) and custom verification layers trained on ground-truth refactoring data to score and reject low-confidence candidates.

Trust AI for boilerplate, scaffolding, documentation lookups, and brainstorming. Verify whenever the code integrates with an existing system, handles domain-specific data, or touches infrastructure configuration.