February 12, 2026
By Hubert Brychczynski

Richard Dawkins coined the term "meme" to describe information that propagates through human minds like genes throughout populations. In technology, memes often outpace our ability for critical thinking. Terms like "big data," "real-time," or "AI-first" can become contagious and cloud judgment as easily as inspire it.
This article examines a specific pattern we've observed time and again: organizations deploying expensive, complex data infrastructures for problems that don't require it. The consequences include wasted budget, misplaced engineering resources, and delayed delivery. Face-value assumptions and imprecise language are often the culprit.
Certain phrases can spell trouble in technology meetings. "Big data," "real-time processing," "what if it scales," or "AI-powered" often justify significant investments with unclear return. We should be more probing instead, asking questions like: Do all data need distributed infrastructure? What does "real-time" mean for this use case? Is preparing for remote possibilities worth the engineering cost? Is this complex solution more efficient than simpler alternatives?
Asking these questions, however, might come across as confrontational. After all, it’s easier to approve the budget, check a box, or draw an upward chart. Examining assumptions, on the other hand, might expose poor decision making and cause a stir. But if a project is in early stages, the benefits of turning back will usually outweigh the risks.
Need a starting point? Scrutinize your database infrastructure. Chances are you’re splurging on big data for no reason at all.
Most technology leaders can't precisely define when "regular data" becomes "big data." Ask ten engineers and you'll get ten different answers based on volume, velocity, variety, or organizational context.
For the purpose of this article, we'll use a framework proposed by MotherDuck, a serverless cloud data warehouse.
MotherDuck provides a quantitative and functional definition for Big Data. Quantitatively speaking, they consider data as "big" when it takes up more than 10TB in storage or when scan sizes exceed 1TB. In practical terms, the moment a single machine can no longer process your data reliably tends to signal a transition to Big Data environment.
Following this logic, you can determine your Big Data by asking:
If the answer is no, congratulations. You belong to an elite five percent of companies that have big data.
Yes, you heard that right. Going by the definitions above, only one in twenty companies deserve the Big Data title.
How do we know?
In 2024, Amazon Science published a paper called "Why TPC is not enough: An analysis of the Amazon Redshift fleet". To support their research, the authors released a massive, historical dataset of more than 500 mln queries executed on 32 mln RedShift tables across a single quarter. MotherDuck used the RedShift dataset and analyzed the queries inside to infer database size distributions across the entire fleet. The results are plotted in Figure 1.
Fig. 1: Database size distribution
Taken together, this means that only about 5% of databases have Big Data (Figure 2).
Fig. 2: Databases that qualify as Big Data
Now, we know that only one in every twenty databases is big, but how often are they even queried?
Based on the Redshift data, only 1% of all queries in Big Data environments run against >10TB of data. What’s worse, that 1% represents the number of queries, not the volume. When we do query a Big Data table, how much of it do we read? The answer: less than 0.5%, on average.
So here’s what's happening, assuming a data is “big” when it exceeds 10TB in size:
So when somebody says "we have Big Data", they're usually saying something like:
When companies deploy Big Data infrastructure without Big Data, they waste resources in three different ways. First, there's the direct financial hemorrhage. Cloud providers charge by the hour for distributed computing resources. Run them 24/7 on data that fits in an Excel file, and you're essentially lighting money on fire.
Second, there's the engineering tax. Distributed systems are complex. They require specialized knowledge to be configured, maintained, and debugged. That means hiring expensive talent or training existing engineers on technologies they'll never use to full capacity.
Third, and perhaps most insidious, there's the opportunity cost. "What if it scales?" becomes a religion. Teams spend months architecting for problems they don't have, building space shuttles when what they need is bikes. Meanwhile, competitors who asked "why?" instead of "what if?" ship faster, iterate quicker, and capture more market share. Plus, should they get admitted to the Big Data club down the line, modern AI-driven development will bring their systems up to scale in under a few weeks for a couple hundred of bucks in token costs.
One of our senior engineers witnessed this pattern repeatedly across multiple organizations. Let's examine two real-world examples that illustrate how easily companies fall into the Big Data trap.
A company was running a data pipeline on AWS that cost approximately $30,000 per month. The architecture was textbook Big Data: AWS Glue streaming jobs, moving things around 24 hours a day, 7 days a week. It looked impressive on paper but was transferring roughly 11,000 rows of data per day.
To put that in perspective, 11,000 rows of data fits on a floppy disk. Yet the chosen architecture - AWS Glue streaming with real-time processing - was the most expensive option in AWS's data service catalog.
The situation became even more absurd when our engineer investigated further. Turned out the 11,000 rows of data supported office workers using Salesforce during business hours. No one worked at 4 AM on Saturday. There was no business justification for 24/7 operation.
The solution was simple: implement an on/off schedule. Switch off the pipeline at night and on weekends. The company went through with the change, and the results were instant: cost dropped by 60%, saving roughly $18,000 per month. No re-architecture needed. No complex migration. Just common sense applied to an existing system.
Another client engaged our team to work on their existing data infrastructure. The system was architected with massive parallelism in mind and hundreds of nodes active at once when the pipelines ran.The scale seemed reasonable until our engineer saw what was actually being processed.
The largest table contained no more than 7 million rows - think a single Excel file. Even accounting for all the tables, the system has historically processed between 50 to 100GB of data in total - a far cry from the 10TB Big Data threshold. The company was throwing massive compute at what would essentially fit on a pendrive - the computational equivalent of using an atomic bomb to kill mosquitoes.
Our engineer brought this discrepancy to the stakeholders' attention. The situation exposed a classic pattern: architects designed for scale when no scale was even in sight. They asked "what if it scales?" instead of "what do we actually have?" The system could handle massive growth but was significantly over-provisioned for reality.
These cases have something in common. Both teams were making grand plans for the future without really needing to. Because even if their dreams eventually came true - and the chances of that happening are infinitesimally small - the companies would have likely grown enough to have amassed the resources for scaling along the way.
How do you avoid becoming another cautionary tale? Three principles can protect you from the Big Data delusion.
Before any technology decision, exhaust the "why" questions. Why real-time? Why distributed? Why this particular service? Engineers should spend more time asking "why" than proposing solutions. Stakeholders should demand clear business justifications instead of buzzwords. If you can't articulate why something needs real-time processing or distributed computing, then it probably doesn't.
Never commit to infrastructure before understanding your actual data characteristics and business needs. Measure everything: data volumes, query patterns, growth rates. Use the minimum viable solution that meets your current needs. If your data fits in RAM, you don't need a distributed database. If your queries happen hourly, you don't need real-time processing. Build for what you have, not for what could be in five years.
Start with the simplest thing that could possibly work and scale up as you go along. Ground each increment of complexity in factual observation. It will cost you less to expand a simple system than you’d lose by overengineering a monstrosity.
The alternative is a tale as old as time: organizations deploying space shuttle infrastructure to cross the street for cigarettes.
The story of Discord (told in detail in these two articles) demonstrates how to scale responsibly.
In early 2015, Discord built their initial service in under two months using MongoDB - explicitly chosen for rapid iteration, with plans for easy migration when needed.
By November 2015, they hit 100 million messages. The data no longer fit in RAM, and latencies compounded.
The company analyzed usage patterns and migrated to Cassandra with 12 nodes in 2017. Five years later, they hit 177 nodes with trillions of messages, facing hot partitions, expensive maintenance, and cascading latencies.
They tested extensively on smaller clusters, built protective infrastructure, and migrated to ScyllaDB. 177 nodes became 72; performance improved dramatically.
Moral of the story? Iterate based on reality. Discord grew because they only migrated in response to observed behavior. What would happen if they hadn’t?
Wasted cloud bills aren’t the only cost of buzzword-driven development. When companies deploy Big Data infrastructure for small data problems, they burn through three critical resources simultaneously.
Money, obviously. Not only infrastructure costs, but also expenditures for payroll, training, or consulting - all taking away resources from actual business value.
Time and attention. Distributed systems are complex. Every minute spent troubleshooting a system you don't need is a minute not spent solving problems that matter.
Too often, companies worry about the future while ignoring the present. The question "what if we become Google?" eclipses a real risk: "what if we waste all our capital before we find a product-market fit?"
That’s why outside perspective becomes so invaluable. When everyone around is building distributed systems, questioning whether you need one feels futile. An external engineering team can ask the uncomfortable questions your internal team might not think about.
At Janea Systems, we practice what this article preaches.
When we identified that enterprise RAG systems were burning memory budget unnecessarily, we investigated the problem and built JECQ, an open-source tool that delivers 6x compression with minimal accuracy loss.
When agentic AI systems started hemorrhaging token costs, we ran tests and created JSPLIT, a framework that cuts costs up to 100x.
Our senior engineers bring the same approach - validate first, build second - to client engagements. We embed with your team to identify where you're over-invested in complexity and where targeted optimization actually delivers ROI.
The business impact:
Are you getting the ROI you expected from your infrastructure investments? If you're uncertain whether your engineering spend is sized right for your growth trajectory, or wondering when the results will justify the cost, we can help. Contact us to discuss how embedded senior engineers can eliminate waste and redirect resources toward what actually accelerates your business.
Start by forcing every requirement into measurable terms. Define what "real-time" means in minutes or seconds, tie it to a business SLA, and validate it against actual user behavior. Require proof of need before you approve distributed systems, and default to the simplest architecture that meets today’s constraints. Keep asking "why" until the justification is concrete.
You need big data tools when a single machine can no longer process your workload reliably, and when your storage and scan sizes truly justify distributed compute. If most queries touch small slices, run periodically, and your total data stays well below multi-terabyte scan patterns, a simpler stack often performs better and costs far less.
Probably, if you run always-on pipelines for business-hours use cases, pay for streaming when batch would meet the SLA, or maintain complex distributed infrastructure for modest data volumes. Look for a mismatch between compute spend and data reality: low row counts, small tables, infrequent queries, and heavy operational overhead are classic signals that the platform is bigger than the problem.
Ready to discuss your software engineering needs with our team of experts?