Retrieval-augmented generation (RAG)

You can use foundation models in IBM watsonx.ai to generate factually accurate output that is grounded in information in a knowledge base by applying the retrieval-augmented generation pattern.

 

This video provides a visual method to learn the concepts and tasks in this documentation.

Video chapters

[ 0:08 ] Scenario description
[ 0:27 ] Overview of pattern
[ 1:03 ] Knowledge base
[ 1:22 ] Search component
[ 1:41 ] Prompt augmented with context
[ 2:13 ] Generating output
[ 2:31 ] Full solution
[ 2:55 ] Considerations for search
[ 3:58 ] Considerations for prompt text
[ 5:01 ] Considerations for explainability

 

Providing context in your prompt improves accuracy

Foundation models can generate output that is factually inaccurate for various reasons. One way to improve the accuracy of generated output is to provide the necessary facts as context in your prompt text.

Example

The following prompt includes context to establish some facts:

Aisha recently painted the kitchen yellow, which is her favorite color.

Aisha's favorite color is 

Unless Aisha is a famous person whose favorite color was mentioned in many online articles that are included in common pretraining data sets, without the context at the beginning of the prompt, no foundation model can reliably generate the correct completion of the sentence at the end of the prompt.

If you prompt a model with text that includes fact-filled context, then the output the model generates is more likely to be accurate. For more details, see Generating factually accurate output.

 

The retrieval-augmented generation pattern

You can scale out the technique of including context in your prompts by using information in a knowledge base.

The following diagram illustrates the retrieval-augmented generation pattern. Although the diagram shows a question-answering example, the same workflow supports other use cases.

Diagram that shows adding search results to the input for retrieval-augmented generation

The retrieval-augmented generation pattern involves the following steps:

  1. Search in your knowledge base for content that is related to a user's question.
  2. Pull the most relevant search results into your prompt as context and add an instruction, such as “Answer the following question by using only information from the following passages.”
  3. Only if the foundation model that you're using is not instruction-tuned: Add a few examples that demonstrate the expected input and output format.
  4. Send the combined prompt text (instruction + search results + question) to the foundation model.
  5. The foundation model uses contextual information from the prompt to generate a factual answer.

The origin of retrieval-augmented generation

The term retrieval-augmented generation (RAG) was introduced in this paper: Retrieval-augmented generation for knowledge-intensive NLP tasks.

“We build RAG models where the parametric memory is a pre-trained seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.”

In that paper, the term RAG models refers to a specific implementation of a retriever (a specific query encoder and vector-based document search index) and a generator (a specific pre-trained, generative language model). However, the basic search-and-generate approach can be generalized to use different retriever components and foundation models.

Knowledge base

The knowledge base can be any collection of information-containing artifacts, such as:

  • Process information in internal company wiki pages
  • Files in GitHub (in any format: Markdown, plain text, JSON, code)
  • Messages in a collaboration tool
  • Topics in product documentation, which can include long text blocks
  • Text passages in a database that supports structured query language (SQL) queries, such as Db2
  • A document store with a collection of files, such as legal contracts that are stored as PDF files
  • Customer support tickets in a content management system

Retriever

The retriever can be any combination of search and content tools that reliably returns relevant content from the knowledge base, including search tools like IBM Watson Discovery or search and content APIs like those provided by GitHub.

Vector databases are also effective retrievers. A vector database stores not only the data, but also a vector embedding of the data, which is a numerical representation of the data that captures its semantic meaning. At query time, a vector embedding of the query text is used to find relevant matches.

IBM watsonx.ai does not include a vector database, but you can use the foundation models in watsonx.ai with any vector database on the market. The example notebooks illustrate the steps for connecting to popular vector databases, such as Chroma and Elasticsearch.

To help you implement a RAG pattern in which the retriever uses vectorized text, watsonx.ai offers an embedding API and embedding models that you can use to convert sentences and passages into vectors. For more information about this type of RAG implementation, see Using vectorized text with retrieval-augmented generation tasks.

Generator

The generator component can use any model in watsonx.ai, whichever one suits your use case, prompt format, and content you are pulling in for context.

Sample project

Import a sample project with notebooks and other assets that implement a question and answer solution by using retrieval-augmented generation. The project shows you how to do the following things:

  • Use HTML, PDF, DOC, or PPT files as the knowledge base and an Elasticsearch vector index as the retriever. (You must create the Elasticsearch service instance separately.)
  • Write a Python function that queries the vector index to search for information related to a question, and then inferences a foundation model and checks the generated answer for hallucinated content.
  • Use prompt templates that help you format effective prompts for the IBM granite-7b-lab and Meta Llama 3.1 foundation models.
  • Follow the pattern efficiently with RAG utilities from the watsonx.ai Python library.
  • Implement the next phase of a RAG implementation by including functions for collecting and analyzing user feedback about generated answers.

Try the Q&A with RAG Accelerator sample project.

Note: If you cannot create the sample project, try replacing the description field text.

Examples

The following examples demonstrate how to apply the retrieval-augmented generation pattern.

Retrieval-augmented generation examples
Example Description Link
Simple introduction Uses a small knowledge base and a simple search component to demonstrate the basic pattern. Introduction to retrieval-augmented generation
Simple introduction with Discovery This sample notebook uses short articles in IBM Watson Discovery as a knowledge base and the Discovery API to perform search queries. Simple introduction to retrieval-augmented generation with watsonx.ai and Discovery
Real-world example The watsonx.ai documentation has a search-and-answer feature that can answer basic what-is questions by using the topics in the documentation as a knowledge base. Answering watsonx.ai questions by using a foundation model
Example with LangChain Contains the steps and code to demonstrate support of retrieval-augumented generation with LangChain in watsonx.ai. It introduces commands for data retrieval, knowledge base building and querying, and model testing. Use watsonx and LangChain to answer questions by using RAG
Example with LangChain and an Elasticsearch vector database Demonstrates how to use LangChain to apply an embedding model to documents in an Elasticsearch vector database. The notebook then indexes and uses the data store to generate answers to incoming questions. Use watsonx, Elasticsearch, and LangChain to answer questions (RAG)
Example with the Elasticsearch Python library Demonstrates how to use the Elasticsearch Python library to apply an embedding model to documents in an Elasticsearch vector database. The notebook then indexes and uses the data store to generate answers to incoming questions. Use watsonx, and Elasticsearch Python library to answer questions (RAG)
Example with LangChain and a SingleStoreDB database Shows you how to apply retrieval-augmented generation to large language models in watsonx by using the SingleStoreDB database. RAG with SingleStoreDB and watsonx

Learn more

Try these tutorials:

Parent topic: Developing generative AI solutions