It’s time to face the truth about retrieval augmented generation, or RAG: It’s a solution in need of its own solution.
RAG was intended to improve large language model performance and reduce hallucinations by enabling large language models (LLMs) to go beyond their training data with access to external knowledge bases. However, the real-world limits of traditional RAG systems have become painfully clear.
“To a large extent, RAG is flawed,” said Dinesh Nirmal, the senior vice president of IBM® Software. “Pure RAG is not really giving the optimal results that were expected.”
The RAG challenges routinely confronting users include limits on context windows and aggregation operations, inability to understand complex relationships, and low-quality outputs associated with suboptimal chunking. Namely, implementing RAG can also present security concerns, such as data leakage.
The good news is that advancements in artificial intelligence tools and strategies are helping compensate for traditional RAG’s flaws, resulting in more accurate generated responses to user queries. Let’s take a closer look at how to improve RAG performance.
Asking an LLM application powered by traditional RAG to perform aggregation operations (such as finding a sum) for an enormous dataset often isn’t difficult—it’s literally impossible. One thing hampering the system’s performance would be context window size: LLM context windows generally aren’t scalable enough to process, say, a collection of 100,000 invoices. In addition, the traditional RAG pipeline relies on vector databases, which are designed for similarity searches, not aggregation operations.
“Essentially, it means that a vector database is not enough to handle these cases,” explained IBM Distinguished Engineer Sudheesh Kairali. “The context window is an issue. The other is the inability to handle mathematical operations.”
Enter SQL RAG.
When LLM users seek answers from large datasets, combining a retrieval-augmented generation with an SQL can provide precise results, Kairali explained.
SQL includes built-in aggregation functions, and SQL databases have larger capacity than LLM context windows. If a business ingests its invoice data into an SQL database, it can use an LLM to convert queries—such as “What is the sum of all of last year’s invoices?”—into SQL, query the SQL database through RAG and arrive at the answer.
“You can do a lot of aggregations if you are able to build it,” Kairali said. After the SQL database performs an aggregation, “it only becomes a natural language processing (NLP) exercise for the LLM at that point.”
Discerning how different pieces of retrieved information or entities connect with each other is another weakness of traditional RAG. For example, consider the use case for a patient with a complex medical history. Through the traditional RAG retrieval process, an LLM might provide relevant information. This data could include details such as how many physicians that patient had seen in a year but might struggle to specify which treatments each physician prescribed.
GraphRAG, introduced in 2024 by Microsoft Research, addresses this challenge by processing and identifying relationships through knowledge graphs. GraphRAG organizes information as a network of nodes and edges representing entities and their relationships to each other.
“If a patient has gone to a hospital and the question is, show me all the previous visits he has made—that can be shown not just as a verbiage but as a knowledge representation via graph,” Nirmal explained. “You can look at different, multiple points and see the different doctors that he has visited, the different medicines he has taken, the treatments that he's undergone—all in a single graphical representation.”
GraphRAG, Nirmal noted, does have limitations because the rendering of a graph becomes more difficult as the volume of data increases. Mapping hundreds of thousands of nodes is more challenging than mapping just a few dozen, for instance. “Everything comes with limitations,” Nirmal said, “but the reason that GraphRAG is taking off is because of the limitations of pure RAG itself.”
Chunking is critical for RAG applications. In traditional chunking through embedding models, relevant documents are broken down at fixed points into smaller pieces, each represented in a vector database. However, this method might cause an LLM application to provide incomplete or inaccurate answers even when it’s using a semantic search machine learning algorithm on a domain-specific knowledge base.
“In this process, a lot of times you lose accuracy because you don't know where you’re chunking the data,” Nirmal explained. “Let's say you chunked, or you cut off, in the middle of a table, so when you bring back the table, you bring half of the table. Now you have lost the accuracy of it.”
Fortunately, better chunking strategies through agentic methods can improve information retrieval. This agentic chunking includes strategies such as creating overlapping chunks and dynamically altering chunk sizes based on context in retrieved documents. LLM orchestration frameworks can be helpful for this purpose. For instance, LangChain’s TextSplitters tools can divide up text into small, semantically meaningful chunks. Such strategies help avoid the loss of relevant information when a document is decomposed.
Agentic AI is helpful for chunking and it can also improve retrieval accuracy in other ways. Consider agentic RAG: It’s an advanced AI framework that can integrate RAG pipelines to query both structured data in SQL databases and unstructured data in document repositories, leveraging vector databases for similarity search.
Agentic RAG also enriches each chunk with metadata. This process correlates structured data (the metadata stored in a transactional database) with unstructured data to optimize retrieval accuracy.
“If we can really take the power of a vector database along with the transactional or SQL aspect of it and bring those two together,” Nirmal said, “we can really bring the accuracy and performance way up.”
Learn what it takes to conquer the three core challenges of unstructured data.
Data leakage is a known problem with AI systems in general, and LLMs that use RAG are no exception. Without the right measures in place, an LLM might provide low-level users information that they’re not authorized to access, from personal identifying information (PII) to sensitive financial data.
“This is a reality with RAG,” Kairali said. “When you start with proof-of-concept, everybody is happy. But then when you want to push it to production and you want to make sure it’s production-grade, you start understanding that there’s a data protection issue.”
Addressing the issue means preserving access control lists (ACLs) and other governance policies when unstructured data is ingested into multiple databases. “When the query is coming and data is retrieved, it’s important to make sure that the ACLs and governance policies are being honored,” Kairali said. “It’s basically an engineering problem.”
Solving this engineering problem can be made easier with the right data platforms, such as governed, open source-enabled data lakehouses. For instance, IBM’s watsonx.data, a hybrid, open data lakehouse, ensures that access controls are inherited from document source systems when data is retrieved. It also provides annotation for PII to prevent sensitive information from being shared.
As LLMs and other generative AI become more deeply ingrained in everyday workflows, improving RAG helps enterprises unlock greater value from their enterprise data. The right enterprise-level tools and strategies “enable higher performance and accuracy so that data becomes manageable and valuable,” Nirmal said. “That’s what every customer is looking for."
Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.