Up to 100x Token Cost Reduction In Agentic AI With JSPLIT

LLMs are evolving from simple conversational tools to semi-autonomous agentic systems via Model Context Protocol (MCP). Model Context Protocol enables language models to connect to external tools and data sources through a standardized protocol. By communicating with MCP servers, the model gains capabilities such as accessing files, browsing the web, managing emails, etc.

MCP provides the model with tool descriptions and specifications, which added to the context for use during prompting. As the tool library reaches hundreds or thousands of servers, prompts bloat exponentially, accuracy drops, token costs skyrocket, and the model fails inundated by irrelevant information.

We built JSPLIT, a taxonomy-driven framework, to solve this (read the research).

JSPLIT organizes MCP servers into hierarchical categories and selects only relevant tools for each query. This can reduce input token costs by up to 100x in high-density environments (i.e. 1,000 MCP servers) while stabilizing accuracy at 70% throughout.

This article explains the problem and our solution.

The MCP Opportunity: A Blessing and A Curse

LLMs on their own are text-based interfaces that pass for humans in conversations on topics ranging from nuclear fusion to gardening and literary criticism. Impressive. But business-wise, they’re essentially a resource-intensive parlor trick with limited viable applications. Common use cases include virtual assistants, erratic researchers, or other – but specialized – chat interfaces.

Model Context Protocol promises to increase the utility – and profitability – of generative AI by giving it more agency. Hence the term “agentic AI”, which denotes an architecture that enables LLMs to take autonomous action through integration and interaction with external tools. Businesses are taking note: 75% of enterprise leaders see disruptive transformation potential (PwC 2025), and 14% of organizations have already deployed agents in production (Capgemini 2025).

Backed by OpenAI, Microsoft, and Google, MCP eliminates the need for custom integrations by standardizing tool access through dedicated servers using JSON-RPC 2.0.

When an LLM receives a query, it combs through MCP servers to find the right tool for the task. Each server aids the decision process by adding contextual information (i.e. tool descriptions) to the prompt. While reasonable, it’s also the root of the problem. Ideally, you want AI to handle as many operations as possible; but the more servers the LLM has to check, the heftier the prompt becomes.

This phenomenon is known as prompt bloating. What works with 5 tools becomes unwieldy at 50 and breaks down at 500. Why? LLM computational complexity is quadratic to prompt length. If a prompt as much as doubles, it will consume four times the memory. Ironically, the mechanism that enables agentic AI also fundamentally constrains it.

The Hidden Costs of Prompt Bloating

Prompt bloating compounds across three dimensions as you scale.

Financial Impact

Token costs are calculated on total tokens processed. Suppose an LLM runs through 100 MCP servers. That’s 20,000 tokens injected per query, assuming a conservative number of 200 tokens per description. At production scale (thousands of queries daily), that means you’re burning money just to process tool descriptions – even before the LLM actually gets to the task.

Latency Spike

Excessive tool descriptions don’t only increase cost and memory use, but also latency. Response times become unpredictable as the LLM has to sort through a lot of irrelevant information, making production SLAs difficult to maintain.

Accuracy Drop

Most critically, prompt bloating degrades accuracy. Description-heavy prompts expose models to contradictory, obsolete, or irrelevant information, causing “cognitive decline” and a failure to focus on the task. As the tool count grows, accuracy deteriorates. High-density conditions with hundreds of servers involved can bring it down to below 40%.

Until recently, there was no effective workaround. Context summarization doesn't work because tool descriptions are already terse; sliding windows narrow down the context to only the most recent but in doing so jeopardize the model’s ability to select the right tool; and semantic prioritization still requires expensive computation across the full set.

This isn't a text management problem. It's a structured selection problem.

Introducing JSPLIT: Taxonomy-Driven Context Management

To address this challenge, the Research and Development team at Janea Systems developed JSPLIT: a taxonomy-driven framework for intelligent MCP server selection. Rather than injecting all tool descriptions, JSPLIT organizes servers into hierarchical categories, analyzes each query's semantic intent, and includes only relevant tools in the context.

The approach is explainable and debuggable. JSPLIT follows a two-phase process: (1) classify the user query to relevant taxonomy categories, (2) map categories to specific MCP servers (Fig*). This allows us to trace exactly why a tool was selected or excluded.

The core idea behind JSPLIT is to capture the functional scope of MCP servers by organizing them into broad, function-oriented categories with human-readable descriptions. The categories include "Search and Information Retrieval," "Memory and Knowledge Management," or "Data Extraction and Manipulation".

The use of natural language in category descriptions facilitates semantic matching between user intent and tool capabilities, shrinking the context to only the most relevant subset of MCP servers. In high-density scenarios, this approach results in leaner prompts, lower costs, and stabler performance.

Our validation demonstrated significant input token cost reduction with consistent accuracy in high-density conditions. More specifically, the gains are proportional to the number of MCPs: 1,000 noise servers translate to a 100x cost reduction.

How JSPLIT Works: Technical Breakdown

JSPLIT consists of three components: the Hierarchical Taxonomy (classification structure), the Taxonomy-MCPResolver (selection logic), and the LLM Call Loop (task resolution).

Phase 1: Taxonomy Selection

When a query enters the system, JSPLIT preprocesses the hierarchical taxonomy to retain only categories with at least one associated MCP currently available. This filtered structure is formatted as a tree and inserted into a classification prompt template. The prompt template instructs an LLM to select the most relevant three leaf-level categories. The LLM's output is parsed to extract a valid taxonomy identifiers (Figure 1).

Fig. 1: Taxonomy classification phase.

Phase 2: MCP Selection and Ranking

The resolver retrieves MCPs mapped to the identified category or categories. If only one matches, it's selected directly. If multiple candidates exist, the system generates a ranked-list prompt describing each server in a summarized form and asks the LLM to rank options. The top-k selections are validated and assembled (Figure 2).

Fig. 2: MCP selection phase.

Phase 3: Task Resolution Loop

Selected MCPs and the user query pass to the task execution LLM. At each iteration, the LLM either answers directly or invokes tools from selected servers. Tool outputs update the context. The loop continues until an answer is generated or the maximum number of iterations is reached. The final output includes the LLM’s answer (if generated), a list of MCP servers involved, and token used.

Taxonomy v1 and v2

We developed two taxonomy versions. Taxonomy v1 featured eight top-level categories and served as our learning foundation. Taxonomy v2 was expanded to include a total of eleven categories: Search and Information Retrieval, Memory and Knowledge Management, Simulation and Planning, Navigation and Mapping, Data Extraction and Manipulation, System and Device Control, Communication and Interaction, Specialized Domains, Developer Tools, Multi-Domain Orchestration, and Other.

The reason for the development of another taxonomy was insufficient performance at higher workloads in early testing. To address this, we expanded the number of categories in v2, refined the definitions to ensure classification consistency, and included fallback categories to handle edge cases.

Validation Results: Up to 100x Cost Reduction, Stabilized Accuracy

We validated JSPLIT using a "needle in a haystack" design: the correct target MCP server was embedded among randomly sampled, irrelevant "noise" servers. We tested with approximately 2,000 MCP servers from the Smithery registry and 200 query-server pairs with known ground truth, varying noise from 1 to 1,000 servers.

Dramatic Token Cost Reduction

Baseline cost exploded from the start, rising steeply with the growing number of noise servers. JSPLIT managed to keep the cost consistently low, at between $0.3 and $0.8 throughout. With 1,000 noise servers, the token cost reduction exceeded 100x compared to the baseline (Figure 3).

Fig. 3: Token cost comparison.

Accuracy Stabilization at Scale

JSPLIT matched baseline performance at low tool counts (<10 servers) and fell slightly behind at 10-250 servers. But as the server pool grew into the hundreds, results diverged dramatically:

Baseline accuracy deteriorated to below 40%
JSPLIT (Taxonomy v2) stabilized at around 70% accuracy (Fig. 4)

Fig. 4: Accuracy comparison.

This 30-point gap demonstrates that structured pruning is critical for maintaining agent effectiveness in high-complexity environments.

Taxonomy v2 showed comparable performance to v1 below 5 servers but improved thereafter, reinforcing the hypothesis that clearer category definitions reduce classification ambiguity.

Model Selection Trade-offs

We tested JSPLIT’s inner loop with three LLMs: two API models (GPT-4.1-mini, GPT-4.1) and one local (Qwen3-8B-AWQ). The API models achieved comparable accuracy, whereas Qwen3-8B-AWQ showed substantial drops in accuracy (Figure 5).

Fig. 5: Performance of JSPLIT’s inner loop depending on the LLM involved.

For production, we recommend small API models like GPT-4.1-mini to balance cost with reliability.

Limitations and Error Analysis

Our error analysis in high-density conditions (1,000 irrelevant MCPs) revealed patterns worth understanding for deployment decisions.

Classification confusion occurs between overlapping categories. Memory and Knowledge Management gets frequently mixed up with Search and Information Retrieval and Multi-Domain Orchestration. This most likely reflects an overlap between tools for storing/retrieving personal knowledge and general search services.

The Search category is disproportionately selected, reflecting its semantic similarity to categories centered on data access, content lookup, or general retrieval queries.

The Specialized Domains category suffer misclassification because industry-specific tools (such as those used in finance or healthcare) often reuse common backends, making them hard to distinguish without domain-specific context.

Static initial annotation currently limits classification adaptability in dynamic environments.

Moving Forward: Taxonomy v3

The limitations we have identified above prompted us to start developing Taxonomy v3, which will introduce more independent categories, separate domain as its own classification dimension, and support real-time classification mechanisms.

Ready to Scale Your Agent System?

Prompt bloating compounds with scale. What works at 10 tools fails at 100. JSPLIT provides a proven solution: 100x cost reduction and 70% accuracy stabilization at 1,000-server density. Find out more in our recent paper: JSPLIT: A Taxonomy-based Solution for Prompt Bloating in Model Context Protocol.

Work With Janea Systems

If you're building production agentic AI at scale, need deeply technical solutions to complex challenges, or value autonomous senior engineers solving novel problems, let's talk.

Janea Systems specializes in core technologies for software-centric products and internal tooling. We work with engineering leaders who have fast-moving, deeply technical charters and need reliable, surgical engineering resources.

Prompt bloating won't solve itself as your ecosystems grow. With structured, explainable approaches like JSPLIT, we can build agents that scale without sacrificing reliability.