...

JECQ: Smart, Open-Source Compression for FAISS Users—6x Compression Ratio, 85% Accuracy

July 03, 2025

By Hubert Brychczynski

  • Artificial Intelligence,

  • Edge AI,

  • Vector Compression,

  • Machine Learning,

  • Quantization,

  • Similarity Search,

  • Retrieval Augmented Generation

...

Ever wonder how it takes just milliseconds to search something on Google, despite hundreds of billions of webpages in existence? The answer is Google’s index. By the company’s own admission, that index weighs in at over 100,000,000 gigabytes. That’s roughly 95 petabytes.

Now, imagine if you could shrink that index by a factor of six.

That’s exactly what Janea Systems did for vector embeddings—the index of artificial intelligence.

Read on to learn what vector embeddings are, why compressing them matters, how it’s been done until now, and how Janea Systems’ solution pushes it to a whole new level.

The Data Explosion: From Social Media to Large Language Models

The arrival of Facebook in 2004 marked the beginning of the social media era. Today, Facebook has over 3 billion monthly active users worldwide. In 2016, TikTok introduced the world to short-form video, and now has more than a billion monthly users.

And in 2023, ChatGPT came along.

Every one of these inventions led to an explosion of data being generated and processed online. With Facebook, it was posts and photos. TikTok flooded the web with billions of 30-second dance videos.

When data starts flowing by the millions, companies look for ways to cut storage costs with compression. Facebook compresses the photos we upload to it. TikTok does the same with videos.

What about large language models? Is there anything to compress there?

The answer is yes: vector embeddings.

Vector Embeddings: The Language of Modern AI

Think of vector embeddings as the DNA of meaning inside a language model. When you type something like “Hi, how are you?”, the model converts that phrase into embeddings—a set of vectors that capture how it relates to other phrases. These embeddings help the model process the input and figure out how likely different words are to come next. This allows the model to know the right response to "Hi, how are you?" is “I’m good, and you?” instead of “That’s not something you’d ask a cucumber".

The principle behind vector embeddings also underpins a process called “similarity search.” Here, embeddings represent larger units of meaning—like entire documents—powering use cases like retrieval-augmented generation (RAG), recommendation engines, and more.

It should be pretty clear by now that vector embeddings are central not just to how generative AI works, but to a wide range of AI applications across industries.

The Hidden Costs of High-Dimensional Data: Why Vector Compression is Crucial

The problem is that vector embeddings take up space. And the faster and more accurate we want an AI system to be, the more vector embeddings it needs - and the more space to store them. But this isn't just a storage cost problem: the bigger embeddings are, the more bandwidth in the PCI bus and in the memory bus they use. It's also an issue for things like edge AI devices - edge devices don't have constant internet access, so their AI models need to run efficiently with the limited space they've got onboard.

That's why it makes sense to look for ways to push compression even further - despite the fact that embeddings are already being compressed today. Squeezing even another 10% out of the footprint can mean real savings, and a much better user experience for IoT devices running generative AI.

At Janea Systems, we saw this opportunity and built an advanced C++ library based on FAISS.

FAISS—short for Facebook AI Similarity Search—is Meta’s open-source library for fast vector similarity search, offering an 8.5x speedup over earlier solutions. Our library takes it further by optimizing the storage and retrieval of large-scale vector embeddings in FAISS—cutting storage costs and boosting AI performance on IoT and edge devices.

The Industry Standard: A Look at Product Quantization (PQ)

Vector embeddings are stored in a specialized data structure called a vector index. The index lets AI systems quickly find and retrieve the closest vectors to any input (e.g. user questions) and match it with accurate output.

A major constraint for vector indexes is space. The more vectors you store—and the higher their dimensionality—the more memory or disk you need. This isn’t just a storage problem; it affects whether the index fits in RAM, whether queries run fast, and whether the system can operate on edge devices.

Then there’s the question of accuracy. If you store vectors without compression, you get the most accurate results possible. But the process is slow, resource-intensive, and often impractical at scale. The alternative is to apply compression, which saves space and speeds things up, but sacrifices accuracy.

The most common way to manage this trade-off is a method called Product Quantization (PQ) (Fig. 1).

jecq-fig-1.png

Fig. 1: PQ’s uniform compression across subspaces

PQ works by splitting each vector into equal-sized subspaces. It’s efficient, hardware-friendly, and the standard in vector search systems like FAISS.

But because each subspace in PQ is equal, it’s like compressing every video frame in the same way and to the same size—whether it’s entirely black or full of detail. This approach keeps things simple and efficient but misses the opportunity to increase compression on a case-by-case basis.

At Janea, we realized that vector dimensions vary in value—much like video frames vary in resolution and detail. This means we can adjust the aggressiveness of compression (or, more precisely, quantization) based on how relevant each dimension is, without affecting overall accuracy.

Janea Systems' Solution: JECQ - Intelligent, Dimension-Aware Compression for FAISS

To strike the right balance between memory efficiency and accuracy, engineers at Janea Systems have developed JECQ, a novel, open-source compression algorithm available on GitHub that varies compression by the statistical relevance of each dimension.

In this approach, the distances between quantized values become irregular, reflecting each dimension's complexity.

How does JECQ work?

  1. The algorithms starts by determining the isotropy of each dimension based on the eigenvalues of the covariance matrix. In the future, the analysis will also cover sparsity and information density. 
  2. The algorithm then classifies each dimension into one of three categories: low relevance, medium relevance, and high relevance. 
  3. Dimensions with low relevance are discarded, with very little loss in accuracy.
  4. Medium-relevance dimensions are quantized using just one bit, again with minimal impact on accuracy.
  5. High-relevance dimensions undergo the standard product quantization.
  6. Compressed vectors are stored in a custom, compact format accessible via a lightweight API.
  7. The solution is compatible with existing vector databases and ANN frameworks, including FAISS.

What are the benefits and best use cases for JECQ?

Early tests show memory footprint reduced by 6x, keeping 84.6% accuracy compared to non-compressed vector candidates. Figure 2 compares the memory footprint of an index before quantization, with product quantization (PQ), and with JECQ. 

jecq-figure-2.png

Fig. 2: Memory footprint before quantization, with PQ, and with JECQ

We expect this will lower cloud and on-prem storage costs for enterprise AI search, enhance Edge AI performance by fitting more embeddings per device for RAG or semantic search, and reduce the storage footprint of historical embeddings.

What Are JECQ’s License and Features?

JECQ is out on GitHub, available under the MIT license. It ships with an optimizer that takes a representative data sample or user-provided data and generates an optimized parameter set. Users can then fine-tune this by adjusting the objective function to balance their preferred accuracy–performance trade-off. 

Beyond Compression: A Track Record of LLM Optimization at Janea Systems

JECQ isn’t our first impactful project in the LLM and generative AI space:

  • For one client, we built a system that automates SQL-to-RAG precomputation in a chatbot, cutting average response time from over a minute to just a few seconds—and improving accuracy. Read the blog article.
  • For social impact company BigFilter, we used off-the-shelf AI components to rapidly develop a fact-checking proof-of-concept, delivered within three months. Read the case study.
  • For Bing Maps, we refactored deep learning pipelines and optimized training, resulting in a 50x speedup in TensorFlow, 7x faster training, and 2x faster batch processing. Get the full story.
  • For a collections software company, we developed a robust ETL data pipeline enabling historical data tracking in a Delta Lake with SCD Type 2. It features Azure Key Vault for security, supports horizontal scaling, and provides a foundation for ML applications including predictive analytics.

Ready to Elevate Your AI Stack?

At Janea Systems, we specialize in solving complex AI and LLM engineering challenges—and we back our solutions with tangible, measurable outcomes.

Whether you’re already using AI or just exploring its potential, we’re ready to help you build smarter, faster systems. Our experience spans finance, customer support, healthcare, geospatial, and edge AI, with a track record of delivering real results across industries.

Let’s talk about how we can help you do more with your data.

FAQ

What is the main trade-off in vector quantization?

The primary trade-off is between the compression ratio and search accuracy. More aggressive compression significantly reduces memory usage and storage costs, but it is a lossy process that can lead to a drop in the accuracy of search results because some of the original vector information is discarded. Techniques like oversampling and rescoring are often used to find a better balance and mitigate this accuracy loss.

How can I reduce the memory usage of a FAISS index?

FAISS provides several built-in methods for compression. The most common is Product Quantization (PQ), which can dramatically reduce the memory footprint by splitting vectors into parts and compressing each part separately. For even better performance on large datasets, PQ is often combined with an Inverted File index (IVF), which pre-clusters vectors to narrow down the search space. FAISS also supports other codecs, including Scalar Quantization (SQ) for milder compression and pre-processing transforms like PCA or OPQ.   

What makes JECQ different from standard Product Quantization?

Standard Product Quantization (PQ) applies uniform compression across a vector, treating every dimension as equally important. JECQ introduces a more intelligent, non-uniform and dimension-aware approach. It first analyzes the statistical relevance of each dimension and then applies a varied compression strategy: low-relevance dimensions are discarded, medium-relevance dimensions are quantized to a single bit, and only the highest-relevance dimensions undergo standard PQ. This allows JECQ to achieve a high compression ratio while preserving the most critical information needed for accuracy.

How much space can I save with JECQ relative to PQ?

In our tests on PubMedQA, a biomedical question-answering dataset, JECQ reduced the memory footprint by a factor of six compared to product quantization (PQ)—while maintaining 84.18% search accuracy relative to the uncompressed (flat) index. That means JECQ lets you store the same index in just 15% of the space required by PQ. That’s an 85% memory saving per megabyte of PQ-compressed vectors, with minimal loss in retrieval quality.

Is JECQ suitable for edge AI applications?

Yes, JECQ is exceptionally well-suited for edge AI. A primary challenge in deploying AI on edge devices is their severe limitation in memory and storage, which often makes it impossible to run large models locally. By significantly reducing the memory footprint of vector embeddings, JECQ makes it feasible to store and process large, effective vector indexes directly on these resource-constrained devices. This enables powerful on-device applications like private semantic search or Retrieval-Augmented Generation (RAG) without a constant need for a cloud connection.   

What is the license for JECQ?

JECQ is available on GitHub under the MIT license. This is a permissive open-source license, making it free to use and integrate into both academic and commercial projects.

Related Blogs

Let's talk about your project

600 1st Ave Ste 330 #11630

Seattle, WA 98104

Janea Systems © 2025

  • Memurai

  • Privacy Policy

  • Cookies

Let's talk about your project

Ready to discuss your software engineering needs with our team of experts?