...

The AI Agent Reality Check. Why Scaling Fails and How to Build for Reliability

January 09, 2026

By Hubert Brychczynski

  • Artificial Intelligence,

  • Agentic AI,

  • AI Engineering,

  • Rapid Prototyping

...

The promise of AI agents is that they will autonomously and reliably perform complex tasks with minimal human supervision and a low error rate.

Reality paints a different picture.

For starters, we don’t even have an exact definition of an agent – which allows disingenuous providers to repackage old technology as AI solutions. But even for agentic systems in the truest sense of the word, enterprise adoption falters, scaling efforts fail, token costs spiral out of control, and errors compound, often leading to near catastrophic outcomes.

Agentic systems can and do work, but they require specific conditions and a deliberate engineering approach to provide real value and not jeopardize your production on the off chance.

The present and future of agentic AI

McKinsey’s State of AI report, published November 2025, reveals that 39% of all adopters are still “experimenting” with agents. Regardless of business function, less than 10% of respondents say they have successfully scaled their solutions across any single capacity. ROI estimates are also modest at most, with the majority of respondents attributing <5% of their earnings to AI.

Analysts from Gartner aren’t optimistic about the future, either. They predict that 40% of all agentic AI projects will fail by the end of 2027, owing to exacerbating costs, quality and security issues, vague or inconsistent outcomes, and unrealistic expectations.

That being said, agentic AI isn’t going away, at least according to Gartner. Once the dust settles and leaders realize that agents aren’t a miracle cure but an emerging technology with tangible benefits when applied judiciously on a case-by-case basis, adoption will rise. As a result, Gartner expects agentic AI to automate up to 15% of workplace decision-making by 2028.

Building agents that last

The path to reliable, production-ready, and cost-efficient agentic AI is riddled with pitfalls but navigable. The first step should be changing the mindset. Agents aren’t replacements for human roles - the technology isn’t there yet. This is reflected in research. When asked about the projected impact of AI agents on employment, 43% of respondents in the McKinsey report said they didn’t expect any changes, while 13% anticipated increases.

The reason it’s important to keep in mind that agents aren’t humans isn’t idealism but pragmatism. When employers remain skeptical about the effect of AI agents on job displacement, they’re implicitly admitting that, deployed at scale, agents aren’t as reliable or flexible as humans, and sometimes even not as affordable. For example, Google researchers found that although a single agent can occasionally match a human in a sequential task, performance drops by 39 to 70% when more agents try to tackle the same task, quickly depleting available token budget.

It is this inherent unreliability, rigidity, and costliness of AI agents that needs to be addressed when designing agentic solutions for the enterprise.

Caveat 1: Engineer for security and observability

Stories like the one about an agent that deleted an entire database are not uncommon. For more examples, refer to our article on vibe coding vs. system stability. One might be tempted to scrutinize the frequency of such incidents and brush them off as edge cases, “not something that’s going to happen to me”. We would argue that the potential consequences of these failures are too grave to dismiss them on the grounds of probability. Furthermore, in systems with multiple agents involved, the odds of failure compound and the risk increases.

To ensure maximum security, engineer your agents with least-privilege access, granular insight, and rollback capabilities. Least-privilege guardrails means that you have control over what an agent can do and what it can’t; granular insight allows you to identify with precision the moment in the process that something went wrong; while rollback capabilities (i.e. various forms of backup and recovery systems) enable you to undo the damage caused by an agent’s error.

A good example of “granular insight” is our work for BigFilter, where we developed a rapid prototype of a fact-checking tool with AI and RAG on the backend. The entire fact-checking process was divided into discrete steps, providing users with insights into the tool’s reasoning process.

Caveat 2: Add deterministic logic and human in the loop

The controlled randomness that allows LLMs to mimic human conversations is also one of their greatest flaws when trying to deploy them in production for anything more complex than a conversation. We know by now that hallucinations aren’t going away anytime soon (and may be getting worse). But even a 5% hallucination rate (a tall order in and of itself) would compound in complex workflows with every step, translating to 59% success rate at 10 steps, according to the calculations in an article on HackerNoon (see Figure 1 adapted for reference). Salesforce CTO Muralidhar Krishnaprasad recently remarked that LLMs start ignoring instructions beyond as few as eight directives - an observation that underpins the company’s latest decision to dial back its reliance on AI-only systems.

Fig. 1: Error compounding in multi-agent workflows (adapted from a HackerNoon article by Utkarsh Kanwat)

Fig. 1: Error compounding in multi-agent workflows (adapted from a HackerNoon article by Utkarsh Kanwat)

Salesforce’s solution is to equip their agentic systems with “deterministic triggers” such as Flows, Apex, or APIs. If this sounds a lot like coding, that’s because it is. Custom code imposes rule-based performance on AI’s inherently probabilistic models - which means that engineer expertise is as valuable as ever for creating robust agents. Another, perhaps more obvious solution, is to maintain humans in the loop at mission-critical checkpoints. As the HackerNoon writer put it: “AI handles complexity, humans maintain control, and traditional software engineering handles reliability.”

Caveat 3: Build incrementally

Organizations are also increasingly learning that implementing AI agents requires a bespoke rather than a one-size-fits-all approach. This is corroborated by McKinsey research, which identifies several key best practices shared by companies with the largest returns on investment from AI. Apart from humans in the loop, clearly defined roadmap, leadership alignment, or stakeholder engagement, other crucial factors include iterative solution development and rapid development cycles.

These observations align with our experience around applying rapid prototyping principles to AI-driven development. As we pointed out in a series of articles, rapid prototyping is an incremental approach to software design that minimizes the risk of failure in implementations, including those of agentic systems. The key is to build quickly, in quarterly rather than monthly intervals, and validate ideas before they cause too much harm.

An example rapid prototyping session can progress along the following stages:

  • high-speed validation at 30 minutes to 2 hours;
  • early prototype delivery and review at 2 to 3 days to support decision;
  • discovery-accelerating builds at 3 to 5 days;
  • continue across 1–3-month intervals with support and iterations.

In fact, remember the prototype for BigFilter we mentioned above? It was ready for testing after three months.

AI-driven software development makes rapid prototyping even rapid-er, if that’s possible (and a word). Equipped with AI-coding tools, our teams can deliver results roughly between 70 to 25% faster than without them, depending on the domain - which we can say with confidence, because we measured it.

Caveat 4: Mind the costs

Finally, artificial intelligence is expensive, sometimes prohibitively so, and it gets more expensive the more complex systems you want to build with it. As we discuss in the article on JSPLIT, Janea System’s response to the problem of prompt bloating in Model Context Protocol, “LLM computational complexity is quadratic to prompt length”. This means that every additional conversation within the same context window quadruples the overall cost (Figure 2). This is why, shockingly, some companies are finding that their multi-agent systems incur too much cost to justify their existence.

Fig. 2: Compounding token cost in conversational agents (adapted from a HackerNoon article by Utkarsh Kanwat)

Fig. 2: Compounding token cost in conversational agents (adapted from a HackerNoon article by Utkarsh Kanwat)

The solution to this problem boils down, again, to custom, incremental, and intentional design. Don’t go big from the get-go and don’t throw AI at everything, even at things a good-old program could do faster and cheaper. Instead, focus on granular solutions and enlist expert AI engineers to identify the most promising avenues for AI automation in your company. Otherwise, you might fall into the trap described by Anushree Verma, Senior Director Analyst at Gartner, who said: “Many use cases positioned as agentic today don’t require agentic implementations.” We would be willing to venture a lot of those use cases are causing the companies considerable - and unnecessary - costs.

Develop an Efficient AI Roadmap with Us

If you want to build agentic systems that are as secure, reliable, flexible, and cost effective as present-day technology allows, we are happy to assist. Have an idea in mind? We can help you rapid prototype it into existence; not sure where to start? Sign up for our AI Maturity Workshop to develop an actionable AI strategy, purpose-built for your company in less than 6 weeks.

Frequently Asked Questions

The most common cause of scaling failure is the "Hallucination Cascade." While an agent may have 95% accuracy in a single-turn demo, that error rate compounds in multi-step workflows. In a 10-step process, a 95% success rate drops to approximately 60%, making the system too unreliable for mission-critical tasks.

Enterprise reliability is built through deterministic guardrails and human supervision. This involves a hybrid architecture where the LLM handles reasoning, hard-coded logic (APIs, Flows, or custom code) handles execution, and human in the loop double checks mission-critical decisions. Key security measures include least-privilege access, granular observability, and rollback capabilities.

Instead of attempting to replace entire human roles, the most successful organizations focus on incremental, bespoke automation. Gartner predicts that by 2028, agents will automate 15% of workplace decision-making. To reach this, companies should adopt a rapid prototyping mindset: validating high-impact use cases in 1–3 month intervals. This minimizes the risk of "prompt bloating" and quadratic cost increases while ensuring the AI is purpose-built for specific business logic.

Related Blogs

Let's talk about your project

600 1st Ave Ste 330 #11630

Seattle, WA 98104

Janea Systems © 2026

  • Memurai

  • Privacy Policy

  • Cookies

Let's talk about your project

Ready to discuss your software engineering needs with our team of experts?