The Silent Killers Of AI ROI. Diagnosing MLOps Pipeline Bottlenecks, Part 2

In a previous article, we discussed three bottlenecks that hamper AI in production and our insights from solving them. Today, we continue the conversation on AI bottlenecks with a specific focus: accuracy.

Distrust: A Critical AI Bottleneck

A chatbot that takes 60 seconds to respond is an inconvenience; one that fumbles a piece of financial data is a liability.

Research shows that people gain confidence in using automation when it’s roughly above 75% reliable. Certain industries might consider this level too low still. In healthcare or finance, for instance, the smallest of errors can produce disastrous consequences.

The True Impacts of Inaccuracy In AI

Less regulated industries may feel tempted to dismiss a 10 or 20% error rate in AI as negligible. The point is, though, we can never truly know where hallucinations will occur, how grave they will be, or what kind of expertise we’ll need to expose them.

Percentage as a reliability metric serves AI providers more than it does actual users. For example, various benchmarks have placed LLM accuracy at around 80 or 90%. At the same time, when BBC journalists put genAI to the test and asked it to summarize news items, they found serious factual issues in half of the outputs.

In practice, this means that the lower the accuracy, the more effort users must put into verification. Gartner says this additional scrutiny can cancel out all productivity gains from AI.

Generative AI also makes users less likely to spot mistakes. A single prompt can yield a booklet’s worth of plausible-looking text in seconds. Human ownership is minimal. The argument, the logic, the references come from somewhere else. They look coherent, yet unfamiliar. These conditions will challenge even the most attentive reader’s ability for effective verification.

Connect the dots, and you’ll get the full picture: a groundbreaking technology that we know will always make an occasional mistake is overtaking numerous high-stakes industries. 8 times out of ten, it gets everything right; the two times it doesn’t are rare but hard to spot, predict, or explain. If you miss one, things can go sideways quickly.

Real-Life Consequences

AI’s tendency to plant random falsehoods in volumes of plausible output combined with our tendency to overlook them can make a perfect storm. The aftermath? Compromised businesses, alienated users, and slashed revenues.

Case in point: Deloitte. The consulting company suffered a series of scandals in 2025 when reports it produced for Australia, Albany, US, and Canada were found to contain AI-generated hallucinations. Reputation was damaged, reports retracted, fees partially refunded.

In the summer of the same year, Business Insider reported on a significant decline in traffic to vibe coding platforms. Vercel’s v0 was down by 64%, Lovable by 40%, and Bolt.new by 27%. The reasons for this exodus could only be speculated - but reliability could have certainly played a role.

A New Type of Technology

AI marks the first ever computer-based technology that is non-deterministic by design. Before, software developers could write a program and - all else being equal - expect it would act consistently under the same conditions. This is no longer true with AI.

The rules of how technology behaves may have changed; our expectations toward it have not. That’s why a high degree of accuracy is absolutely essential for any viable, production-ready AI solution.

Below, we go into the details of what can impact AI accuracy and what we did to solve it.

Accuracy Bottleneck #1: Agentic AI Accuracy Collapse at Scale

When Adding Capabilities Makes The Agent Less Reliable

Agentic AI promises autonomous systems that can take actions across hundreds of integrated tools and data sources. Model Context Protocol (MCP) provides the standardized infrastructure for this, enabling language models to interact with file systems, databases, APIs, web browsers, email clients, and other tools via specialized servers. When faced with a task, the model browses the servers to pick those best suited for the job.

But the more versatile we make the system, the less reliable it becomes.

Each MCP server adds tool descriptions to the LLM’s context window. The model performs reliably with 10, 20, or 30 tools. At 100 servers and more, however, the flood of contradictory and irrelevant information becomes overwhelming. Accuracy drops below 40%.

The model was supposed to reduce manual work. At 40% accuracy, users must not only verify every action it takes and fix a lot of the output.

Here’s how it might play out in practice: an employee asks an agentic system to “schedule a meeting and send the team a summary of last quarter’s sales by region.” The system needs to correctly identify which tools handle calendar operations versus which handle data retrieval. With hundreds of tools available, it might:

Select a file compression tool instead of calendar integration
Pull data from the wrong database table
Route the request to a deprecated API endpoint
Mix up tools from different functional domains entirely

Each error compounds. The meeting gets scheduled incorrectly. The sales data is wrong. The team receives misinformation. Employees waste time correcting the mistakes and lose confidence in the system. After a few iterations, they stop using it entirely and go back to manual processes.

This is the accuracy-as-cost problem. The direct cost is wasted compute on failed operations. The indirect cost is user abandonment when the system falls below the reliability threshold and supervision effort exceeds manual effort.

Real-World Case Study: JSPLIT Maintains 70% Accuracy Where Baseline Systems Fail

JSPLIT is our open-source solution to maintaining agentic reliability in MCP-intensive setups. The core insight is that accuracy degradation comes from cognitive overload: the model is trying to evaluate too many options simultaneously. The solution is to reduce the decision space by filtering out as many irrelevant options as possible.

The first step was organizing MCP servers into a hierarchical taxonomy of functional categories, such as: “Search and Information Retrieval,” “Memory and Knowledge Management,” or “Data Extraction and Manipulation”. JSPLIT uses this taxonomy as a starting point to identify the most relevant MCP servers for each query. As a result, the LLM receives the original user query along with a tailored list of MCP recommendations. This structured pre-filtering mitigates the accuracy collapse that kills agentic systems at scale.

We validated JSPLIT using a “needle in a haystack” design with approximately 2,000 MCP servers from the Smithery registry. We tested 200 query-server pairs with known ground truth, varying the number of irrelevant “noise” servers from 1 to 1,000. The correct target MCP was embedded in the noise, and we measured whether the system could find and use it correctly.

Figure 1 demonstrates the impact on accuracy. At low tool counts (under 10 servers), JSPLIT matched baseline performance, then fell slightly behind between 10 to 300 servers. It’s in the high-density environments - up to 1,000 noise MCP servers and beyond - that JSPLIT shows its true power.

Fig. 1: Accuracy gains in high-density MCP environments using JSPLIT

Baseline accuracy collapsed at around 100 servers, dropping to below 40% at 1,000 servers. This represents a system that has crossed from “useful with occasional errors” to “unreliable and requiring constant verification.”

By contrast, JSPLIT maintained consistent, 70% accuracy throughout, even at 1,000-server density. While not perfect, 70% accuracy represents a system users can rely on with reasonable confidence as opposed to one they must manually verify at every step.

JSPLIT also significantly decreases token cost. The difference - as shown in Figure 2. - is not trivial: the cost in the baseline rises steeply, while JSPLIT keeps it consistently low.

Fig. 2: Cost reduction in agentic MCP setups with JSPLIT

At 1,000 servers and above, the difference translates to a 100x token cost reduction-impressive and valuable, yet only secondary to reliability. A system that costs 100x less but works 40% of the time remains unusable. It’s the combination of cost efficiency and high accuracy that makes AI production-viable.

The Takeaway: Agentic AI reliability doesn’t degrade linearly but collapses catastrophically as a model gets overwhelmed with options. This can’t be solved with better prompts or more powerful models. Instead, structured filtering reduces the model’s decision space, keeping accuracy in a range that doesn’t require blanket verification.

Accuracy Bottleneck #2: SQL Query Generation

When On-the-Fly Computation Generates Wrong Answers

Internal chatbots that query SQL databases face a dual challenge: speed and correctness.

Typically, when an LLM receives a question that requires an SQL query for an answer, it generates SQL syntax, executes the query, validates results, and potentially retries if the query fails or produces suspicious output. This multi-step process introduces latency at every stage. More critically, it disrupts accuracy at every stage.

The LLM might misinterpret the schema and query the wrong table. It might generate syntactically valid but semantically incorrect SQL - e.g. join tables incorrectly or apply wrong filters. It might make mistakes in calculations. And even when it generates correct SQL, it might misinterpret the results when formatting them into natural language.

Each error type has different severity. A query that fails syntactically wastes time but doesn’t spread misinformation; the user knows that something went wrong. A query that succeeds but returns wrong data is dangerous: the user receives an authoritative-sounding answer that happens to be factually incorrect - and might inform real-life decisions.

Real-World Case Study: SQL-to-RAG Precomputation Mitigates Query Generation Errors

We worked with a client whose internal chatbot faced both speed and accuracy issues. Response times for some SQL queries exceeded 60 seconds; more problematic were the subtle errors in the output. The chatbot would occasionally apply incorrect filters or return wrong figures, leading users astray. They couldn’t trust it without manual verification, which defeated the automation’s purpose.

Our solution was an automated ingestion pipeline that precomputed answers to frequent queries and stored them as vector embeddings for faster and more reliable retrieval.

The pipeline analyzed chat logs to identify common question patterns, executed the corresponding SQL queries, converted them into natural-language summaries, and then stored those as vector embeddings in a database. When a user asked a question, the chatbot performed semantic similarity search to retrieve relevant precomputed answers from the vector store.

Accuracy improvement came from three mechanisms (Figure 3):

Fig. 3: Key elements for improving AI chatbot accuracy

First, SQL queries were generated once, validated for correctness, and tagged to facilitate future revisions to account for changes in the underlying data.
Second, mathematical calculations were performed offline and cached instead of relying on genAI’s limited arithmetic capabilities.
Finally, a separate suite ran curated test queries based on actual user scenarios, gauging output accuracy across a spectrum from correctness to relevance, groundedness, and style. Active both in development and later in production, the suite used a combination of LLM and human judges plus a double-track scoring approach to provide the most granular and in-depth insights into accuracy, allowing developers to improve it in a continuous feedback loop.

The first tangible benefit was faster response time, which went down from over a minute to a few seconds for the cached queries. The LLM no longer needed to follow convoluted prompt templates to run and interpret SQL; instead, it relied on similarity search.

Costs also decreased as the chatbot used a lot fewer tokens to interact with the vector store than it would for SQL generation. This, in turn, eased the load on the SQL database, lowering the costs even further.

Average accuracy reached a point where it made sense to release the chatbot to the public. Users could trust the output enough to forgo manual verification. The chatbot crossed the reliability threshold from “an interesting prototype that requires constant checking” to “a useful tool that merits reasonable caution.”

The Takeaway: SQL query generation introduces accuracy failures at multiple stages: schema interpretation, query construction, or calculation. Precomputing frequent queries combined with continuous reliability testing minimizes these failure modes. The result is a trustworthy system that users can rely on without constant verification.

Next Steps: Building Production-Grade AI

Production AI bottlenecks require surgical engineering interventions. Framework optimization needs kernel-level profiling and refactoring. Resource contention needs fail-safe mechanisms like load shedding. Memory constraints need dimension-aware compression. Agentic AI needs structured tool filtering. SQL generation needs precomputation and validation.

These engineering problems require understanding system behavior under production conditions and implementing fixes that address root causes.

There are many reasons ML systems can stall at the infrastructure layer: performance issues, reliability concerns, accuracy degradation, and more. Whatever they are, Janea Systems’ engineers can help diagnose and fix the specific bottlenecks that block your production. Let’s talk engineer-to-engineer about what to improve and how.