From 0 to 99% Uptime. Optimizing High-Traffic Chatbots With Load Shedding

December 23, 2025

By Hubert Brychczynski

Load Shedding,
Database Bottleneck,
Sre,
Chatbot Uptime,
High Traffic Chatbots

Tomorrow's the big day. You have been working on a new service for over a year and it is finally ready to roll out. A lot of resources went into the research, development, and marketing. Now, a flock of potential users eagerly awaits the launch.

Something keeps you awake. What if the system can't handle a sudden surge in traffic? Sure, you can scale, but that's not a viable long-term strategy. There needs to be a better solution once you hit a scaling wall.

Setting Boundaries

Under constrained resources, saying “no” is often the best strategy. When more people knock on your door than you are able to fit in, what should you do? The answer: turn away a few guests and let the others in. In software engineering, this principle is known as "load shedding".

Introducing Load Shedding

"Load shedding" may sound cryptic. Essentially, though, it's a fancy name for something so ubiquitous that we could come up with a hundred different analogies for it. Can't book a seat on a train? Concert tickets sold out? A restaurant won't take a reservation? You're being load shed.

When limited capacity meets surplus demand, load shedding fulfils the maximum acceptable number of requests while delaying or bouncing the rest. This approach protects available resources, ensures operational stability, and keeps most customers satisfied.

Importantly, load shedding makes sense as long as the surplus in demand is occasional. If it persists, resources should grow to accommodate the new reality. Trains should run more often with more carriages; concerts should move to bigger halls; restaurants should get extended; and cloud storage should expand. A consistent rise in business justifies the expenses; load shedding keeps your budget in check before that.

As an architectural strategy, load shedding dates back to electrical engineering, where it was used to keep power grids operational during periods of high demand. Today, software engineers implement load shedding as a failsafe mechanism to uphold Service Level Objectives (SLOs) in the face of anomalous traffic spikes.

The Anatomy of Load Shedding

Load shedding in software engineering applications consists of four main elements. A load balancer or proxy processes requests; a monitoring system tracks performance metrics such as CPU and memory; a threshold detector initiates shedding when throughput limits are exceeded; and, optionally, a request classifier sorts requests into essential and non-essential categories, shedding only the latter (Figure 1).

Fig. 1: Load shedding mechanism explained

Leveraging Load Shedding for a Chatbot Application

Recently, our team implemented load shedding for a domain-specific chatbot. We worried the entire system would crash under production-level traffic if the load exceeded capacity. Early tests confirmed the concern: once requests exceeded a certain threshold, the error rate rose to nearly 100%.

The database was the culprit: too many parallel connections caused query execution times to balloon. Under heavy load, transactions exceeded the three-minute timeout, and users hit a wall. We needed a way to keep serving some users even at maximum capacity. Load shedding, focused squarely on protecting the database bottleneck, became our safety boundary.

Alternative Solutions

Before settling on load shedding, we considered other approaches. Scaling was our first instinct, but the budgetary constraints excluded it as the primary solution. Queuing requests was another option: we could store incoming queries and process them later. That, however, proved impractical in our use case. If the chatbot didn't respond within three minutes, users lost interest and started refreshing or opening new chats, compounding the overload. Returning answers a day later was out of the question, either. In all likelihood, users would never log back in.

Load shedding offered the cleanest solution. Instead of queuing requests until the system buckled, we rejected excess traffic instantly with a polite "try again later" message. This kept the system responsive for users within capacity and prevented the cascade of retries that would have brought everything down.

Capacity Testing

We started by calibrating the load-shedding threshold using stress tests. Initially, the client expected the chatbot to handle 300 queries per minute without degrading performance. That was our Service Level Objective. We ramped traffic incrementally: from 0 to 10, 20, 40, 80, and finally 150 queries per minute, in fifteen-minute intervals. Without load shedding, the system began failing at approximately 40-50 requests per minute after just three minutes of sustained load. We pinpointed the bottleneck as the breaking point.

After scaling database resources, the chatbot reliably handled 150 to 200 queries per minute. We presented the benchmark to the client, who accepted the validated capacity as the new SLO. Load shedding would keep the system operational while the threshold remained adjustable: if traffic patterns changed over time, we would recalibrate.

Implementation

With capacity established, we needed a mechanism to enforce it. We settled on a semaphore, which keeps count of incoming requests against an integer that represents the maximum number of parallel connections. This number came directly from our benchmarking results.

The semaphore sits in front of the database connection retrieval. Every incoming request needs to ask it for access. If connections are available, the request is routed to the database, and the count decreases by 1. Once the transaction completes, the count increases again, freeing capacity for the next in line (Figure 2).

Fig. 2: Semaphore implementation

If the counter hits zero - meaning the system is already handling its maximum concurrent load – the next request doesn't wait in line. Instead, the system immediately throws an exception and returns an error message, such as: "Try again later". This instant rejection is the core of load shedding. Rather than letting excess requests pile up and choke the database, we turn them away at the door, preserving responsiveness for those already inside.

Results

The improvement was dramatic. Before load shedding, overload lead to a total collapse with a near-100% error rate. After implementation, the figure dropped below 1%. Excess requests were gracefully turned away with an error message, while accepted requests received timely responses well within the three-minute limit. Users who got through never realized the chatbot was operating near its ceiling. Those who didn't could simply try again a moment later, rather than staring at a frozen screen or a cryptic failure message.

In the end, load shedding transformed a brittle system into a resilient one. All it took was knowing when to say "no".

Janea Systems: Engineering High-Performance AI at Scale

The chatbot project described above is one example of how Janea Systems approaches AI/ML optimization. Our team has a track record of delivering measurable performance gains across the stack, from model training to inference to cost management.

Here are three more examples:

JECQ: 6x Compression, 85% Accuracy for Vector Search

JECQ is Janea Systems’ open-source compression library for FAISS users. By matching compression level to the statistical relevance of each dimension, JECQ achieves a 6x reduction in memory footprint while maintaining 85% search accuracy. It’s an ideal solution for enterprise RAG systems and edge AI deployments where storage and compute are at a premium.

JSPLIT: Up to 100x Token Cost Reduction in Agentic AI

In agentic AI, “prompt bloating” arises as MCP servers compound the context window with their tool descriptions, driving up token use. JSPLIT by Janea Systems solves this problem with taxonomy-based filtering: it organizes available tools into a hierarchical structure. It parses the user's query to select only the most relevant subset. In benchmarks with hundreds of MCP servers, JSPLIT reduced input token costs by over two orders of magnitude compared to banchmark approaches.

Bing Maps: 50x Faster Inference, 7x Faster Training

For Microsoft's Bing Maps team, we optimized the deep learning pipeline behind geocoding and query annotation. Our engineers refactored TensorFlow and PyTorch implementations, automated error correction workflows, and streamlined batch processing. The outcome: 50x faster TensorFlow execution, 7x faster training runs, 2x higher batch throughput, and a 30% speedup on dual-GPU pipelines.

Go From Bottleneck To Breakthrough With Janea Systems

Ready to optimize your AI infrastructure? Whether you're facing throughput bottlenecks, runaway token costs, or training pipelines that can't keep pace with your roadmap, Janea Systems can help. Get in touch to discuss how we can accelerate your AI/ML operations.

Frequently Asked Questions

Load shedding is the practice of rejecting excess requests during traffic spikes so your service stays responsive within capacity. Use it when demand occasionally overwhelms the system and you need to protect SLOs without scaling indefinitely.

Set the threshold with stress testing and capacity benchmarking. Increase traffic in steps and identify the point where failures begin. Use the validated “safe” throughput and concurrency as your SLO baseline, then tune the shedding limit to match real production behavior.

A practical approach is to use a semaphore-based concurrency limit before a bottleneck. Once the concurrency limit is reached, the service immediately returns a fast “try again later” response rather than queuing, preventing retry storms and avoiding cascading failures.