June 13, 2025
By Hubert Brychczynski
Artificial Intelligence,
Software Engineering,
Machine Learning,
Retrieval Augmented Generation,
Chatbots
A page load that goes from 1 to 3 seconds makes users 32% more likely to bounce, according to Google.
Now imagine waiting a full minute for a chatbot to reply.
This isn’t uncommon in chatbots that use LLMs to pull answers from internal SQL databases. When the user sends a prompt, the system has to build an SQL query, run it, fetch the data, and format a response.
The process is not only slow but presents a lot of challenges:
This article shows how to improve both accuracy and speed by precomputing the conversion of SQL data to vector embeddings.
We focus on identifying frequent query types and preparing the responses ahead of time. In case the answer can't be found in the vector store, the system dynamically routes the request to a fallback SQL agent.
From SQL to vector store: building rapid-response chatbots with vector embeddings
Typically, when a user asks a chatbot something that SQL data can answer, an LLM agent follows a prompt template to (1) turn the user's question into an SQL query; (2) run the query; (3) rewrite the result in natural language; (4) evaluate the output; (5) return the answer or retry if needed (Figure 1).
Fig. 1: Chatbot interaction with SQL
This can work but is time-consuming and error-prone. If a result doesn’t pass internal checks, the agent keeps trying until satisfied. It might also generate an answer that sounds right, but isn’t, and pass it on to the unsuspecting user.
SQL databases hold discrete units of information, like menu items or prices, without any context. That’s a problem for LLMs, because they need natural-language clues to operate.
For example, imagine a user asks about dairy-free meals under 500 calories. The SQL database may contain all the necessary information for an answer, but it's likely spread across multiple tables. A person could piece it together. An LLM might not.
Most LLM-based chatbots need help to work with SQL. That help comes from long prompt templates explaining what to do. Without these, LLMs wouldn’t be able to translate the user’s intent into correct SQL queries.
The problem is that these lengthy prompt templates slow down LLMs, and also fail to prevent mistakes.
What if, instead of interpreting queries on the fly, we could translate them ahead of time into a language LLMs can easily understand and generate?
This is the idea behind precomputation. The goal is to convert common responses to vector embeddings and place them in a vector store that the LLM can easily read when similar questions come up.
LLMs are context-dependent, so it doesn’t make sense to copy the SQL database directly into a vector store. Instead, it's better to precompute answers into rich, narrative-like documents that provide the necessary context, and use vector embeddings to help LLMs find the relevant documents fast. This process is called encoding (Figure 2).
Fig. 2: Encoding
When a user asks a question, the chatbot searches the vector store for similar content using various metrics (e.g. cosine similarity, dot product, and others), and pulls out the most relevant documents. This process is called retrieval (Figure 3).
Fig. 3: Retrieval
Once the relevant documents have been retrieved, the LLM uses their content to generate the answer to the question. The documents are said to augment the generation - hence the term "Retrieval-Augmented Generation."
Retrieval-augmented generation is significantly faster and more cost-effective than executing a fresh SQL query each time, especially when queries are complex or require large result sets.
But it only works when the data is already encoded into vector embeddings.
To encode SQL queries manually, you’ll need to run common queries, prepare summaries, and save them as documents in a vector database like Qdrant or Pinecone. This will work, but take a lot of time and won’t exactly scale. Automation can solve the problem.
To speed up your chatbot responses, you can create an ingestion pipeline (Figure 4). The pipeline will automatically precompute answers to SQL-based questions and store them in a natural-language (read: LLM-friendly) format, so the chatbot will be able to generate replies a lot faster than when building them from scratch.
Think of it like a translation memory. Translators use specialized software to create and keep records of how they translated entire sentences and paragraphs. For every new project, the software combs the database for similar passages and suggests them to the translator. This saves them a lot of time they would otherwise have to spend translating the passages word after word, occasionally looking up words in the dictionary, doing research of things they don't remember researching, etc.
Fig. 4: Ingestion pipeline abstract flow
Here’s a detailed breakdown of the flow:
In our experience, this method can cut chatbot response time from roughly one minute to a few seconds. Accuracy and consistency also improve because the chatbot doesn’t build queries on the fly but pulls the answer from precomputed summaries.
In addition, there's fewer mathematical errors involved because calculations can be performed offline before precomputation, without relying on LLMs' imperfect inference mechanism. This further improves the correctness and the reliability of the chatbot's responses.
Finally, the solution saves costs in multiple ways: by reducing the number of tokens processed by the chatbot and by minimizing the load on the SQL database - especially in scenarios where the LLM agent would otherwise need multiple attempts to generate a correct query.
The solution is well-suited for interactive systems with underlying SQL databases, such as support bots, HR assistants, internal dashboards, and anywhere else that speed and accuracy matter because it uses vectorization to optimize performance.
Ideally, user questions in this scenario should involve similar SQL queries for easy templating, although they can use different filters. If the questions require complex calculations, the solutions will prevent chatbots from misinterpreting the math and producing incorrect results.
Structured databases are convenient, but LLMs can’t manipulate them directly. Our ingestion pipeline solves that by turning SQL queries into vector embeddings that the chatbot uses to provide fast and reliable answers. If you're using a chatbot on top of structured data and want to boost speed, cut costs, and improve results – get in touch!
A vector embedding is a numerical representation of a text summary that allows a chatbot to quickly retrieve semantically similar answers using similarity search.
Running live SQL queries takes time and may involve complex logic that LLMs struggle with. Precomputed vector embeddings are faster and more reliable.
Yes, but it’s time-consuming and hard to scale. Automation ensures consistency, speed, and the ability to update frequently.
It’s an automated system that fetches SQL data, transforms it into readable summaries, and stores those summaries as vector embeddings in a vector database.
Popular choices include Qdrant, Pinecone, and Weaviate.
Ready to discuss your software engineering needs with our team of experts?