Overview

Large Language Models (LLMs) are often surprisingly knowledgeable about a wide range of topics but they are limited to only the data they were trained on. This means that clients looking to use LLMs with private or proprietary business information cannot use LLMs 'out of the box' to answer questions, generate correspondence, or the like.

Retrieval augmented generation (RAG) is an architectural pattern that enables foundation models to produce factually correct outputs for specialized or proprietary topics that were not part of the model's training data. By augmenting users' questions and prompts with relevant data retrieved from external data sources RAG gives the model 'new' (to the model) facts and details on which to base its response.

Conceptual Architecture

The conceptual architecture of a RAG solution showing the major components and the flow of interactions between them to respond to a user query.

Generative AI architecture patterns

The RAG pattern, shown in the diagram below, is made up of two parts: data embedding during build time, and user prompting (or returning search results) during runtime.

An AI Engineer prepares the client data (for example, procedure manuals, product documentation, or help desk tickets, etc.) during Data Preprocessing. Client data is transformed and/or enriched to make it suitable for model augmentation. Transformations might include simple format conversions such as converting PDF documents to text, or more complex transformations such as translating complex table structures into if-then type statements. Enrichment may include expanding common abbreviations, adding meta-data such as currency information, and other additions to improve the relevancy of search results.
An embedding model is used to convert the source data into a series of vectors that represent the words in the client data. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. The embeddings are stored as passages (called chunks) of client data, think of sub-sections or paragraphs, to make it easier to find information.
The generated embeddings are stored in a vector database. Any data source that supports 'fuzzy' queries that return results based on likely relevance, for example watsonx Discovery, can be used in the RAG architecture but the most common implementation uses a vector database such as Milvus, FAISS, or Chroma.

The system is now ready for use by end-users.
End-users interact with a GenAI enabled application and enter a query.
The GenAI application receives the query, and performs a search on the vector database to obtain the top most (we call this top K) pieces of information that most closely match the user's query. For example, if the user's query is "What is the daily withdrawal limit on the MaxSavers account", the search may return passages such as "The MaxSavers account is…", "Daily withdrawal limits are…". and "…account limits…".
The top passages, along with a prompt curated for the specific application are sent to the LLM.
The LLM returns a human-like response based on the user's query, prompt, and context information which is presented to the end-user.

IBM Product Architecture

An illustration of how IBM watsonx Discovery, IBM watsonx Assistant, and the SaaS version of watsonx.ai realize the RAG solution architecture.

The mapping of the IBM Watson and watsonx family of products to the RAG pattern is shown in the diagram above.

watsonx Discovery implements the pre-processing, embedding generation, and relevancy storage and retrieval functions of the pattern. For certain types of solutions, watsonx Discovery can also be used as the front-end generative AI application for users. Beyond simply replacing a vector database, watsonx Discovery offers out-of-the-box NLP enrichments including entity extraction, sentiment analysis, emotion analysis, keyword extractions, category classification, concept tagging and others.

For chat solutions, watsonx Assistant provides the user interface and also conversational capabilities such as remembering the subject of previous queries. For example, if a user asks "Tell me about the Toast-o-matic" and then "How much is it?" watsonx Assistant knows that "it" in the last query refers to the toaster in the first.

Finally, watsonx.ai provides a selection of large language models clients can chose from in a cloud hosting environment. With watsonx.ai, clients can train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with ease and build AI applications in a fraction of the time with a fraction of the data.

On-premise / private deployments

Some clients do not have watsonx.ai available in their local region, or may have security concerns or regulatory requirements that prevent them from using the watsonx.ai SaaS solution. For these clients, we offer watsonx.ai as a set of containerized services that can be deployed on Red Hat Openshift running within the clients' data centers, or within a virtual private cloud (VPC) within a cloud-service provider's infrastructure.

An illustration of how IBM watsonx Discovery, IBM watsonx Assistant, and watsonx.ai can be deployed on-premises to realize the RAG solution architecture.

Multiple language support

The majority of LLMs are trained on English language dominant text that contains a small percentage of text in other, often Western European, languages. For applications requiring multi-lingual or localized language support can implement a pre- and post-query translation step to translate inputs to the pre-processed documents' 'base' language, English for example, and translating the model outputs to the target language, eg. Spanish. This approach is shown in the diagram below.

A walkthrough of the RAG solution architecture illustrating the component interactions and flows to enable support multiple languages.

This approach alters the base RAG pattern as follows (setting the embedding generation steps aside):

A user enters a query in a language that differs from the dominant language of the pre-processed documentation. For example, a query in Spanish and an English dominant documentation base.
The generative AI application prompts a large language model to translate the user query to the language of the documentation base. In our example, from Spanish to English.
The translated query is used to retrieve the top K passages of information that are most relevant to the user's query.
The translated query and the retrieved context are sent to the LLM to generate a response.
The generative AI application again uses a large language model to translate the generated response to the user's target language. In our example, from English to Spanish.
The translated response, in Spanish, is presented to the end-user.

Experience suggests that 80% or higher accuracy, depending on the context and types of queries being submitted, in non-base-language results can be achieved using this approach. Emerging multi-language models, trained on larger percentages of different languages, are expected to achieve even higher accuracy.

Use Cases

RAG is a candidate solution for any business scenario where there is a large body of documentation and business rules that a user must consult to provide authoritative answers. It is also a strong solution for infusing LLM-based chatbots with proprietary or domain-specific knowledge and preventing hallucinations.

Candidate uses include:

Insurance underwriting and claims adjudication. RAG has many potential applications within the insurance industry. Underwriters and brokers require deep knowledge of thousands of pages of documentation covering the terms and conditions of hundreds of insurance products. Similarly, claims adjudicators may be required to have deep knowledge of the same documentation, as well as contracts with overrides and additional terms specific to individual clients. The RAG architecture pattern can serve as the architecture 'backbone' of solutions to assist underwrites, brokers, and adjusters with querying product and contract documentation to better respond to client queries and improve process productivity.
Call center agent support. Call center agents require deep knowledge of potentially hundreds of products and services, as well as commonly occurring product issues and their resolution. The RAG pattern is a strong architecture foundation on which to create solutions to assist agents in quickly finding answers to client requests.
Customer chatbots. RAG is strong enabler for creating customer-facing chatbots to answer questions. Combining the natural language abilities of Large Language Models and the enterprise-specific responses of RAG can deliver a compelling, conversational customer experience. Note that RAG on its own only delivers question-and-answer capabilities; it does not have the ability to 'transact', ie. interact with enterprise systems to pull information or update records. Additional components must be added to detect user intent and to interact with enterprise systems.
Support / helpdesk. Like call center agents, IT operations and support personnel require deep knowledge of the configuration of complex systems deployments along with knowledge of common and previously seen issues and their resolution. The RAG pattern is a strong architecture foundation on which to create solutions to assist support personnel with quickly finding relevant answers to reported problems and observed issues.

Architecture Decisions and Considerations

Many factors go into choosing a models that will work well for your project.

The model's license may restrict how it can be used. For example, a model's license may prevent it from being used as part of a commercial application.

The data set used to train the model training has a direct impact how well the model works for a specific application and significantly affects the risk that the model may generate non-sensical, offensive, or simply unwanted responses. Similarly, models trained on copyrighted or private data may open their users to legal liability. IBM provides full training data transparency and indemnification from legal claims arising from its models.

The size of the model, how many parameters it is trained with, and the size of its context window (how long of a passage of text can the model accept) affect model performance, resource requirements, and throughput. While it's tempting to go with a "bigger is better" philosophy and choose a 20 billion parameter model, the resource requirements and improvement (if any) in accuracy may not justify it. Recent studies have shown that smaller models can significantly outperform larger ones for some solutions.

Any fine-tuning applied to a model can affect its suitability for a task. For example, IBM offers two versions of the Granite model: one tuned for general chat applications, and another tuned to follow instructions.

Other considerations when choosing a model include:

Selection of model parameters, eg. the model temperature, to balance the creation of human-like text and factual responses. Setting the model temperature to a high value will generate consistent but potentially uninteresting or overly terse responses, while setting the temperature to a low value will introduce more variety into the responses but will add unpredictability in the response length and content.
Selection and implementation of model guardrails to guard against ineffective or offensive results.

The choice of model depends on the application, type of data and language support requirements. Embedding models may have to be extended to accurately encode and search on industry or client specific terms or acronyms.

Vectors databases are only one option for implementing the embedding data store. Watson Discovery provides additional tools and functionality that can improve the performance and accuracy of a RAG solution; and some 'traditional' databases provide vector storage and searching, and/or similarity searching that will support a RAG solution.

There are also numerous options for vector databases. Simple in-memory databases that are embedded directly in GenAI applications provide excellent run-time performance but may not scale well to large data sets, and may introduce significant operational challenges to keep current, or to scale to multi-server configurations. Other databases that use a central server architecture are easier to operate and scale but may not meet the performance needs of the specific solution.

There are a number of methods available to integrate the retrieval and generation model. Retrieving the top K passages and using them to augment the user query is simple and expedient but it can lack the nuance necessary to answer complex questions. Simple searching by keyword may also yield satisfactory results.

More complex solutions may use an LLM to generate multiple queries from the user's original query and use them to retrieve a larger set of passages. Additional logic may added to further sort and select the retrieved passages with the highest relevancy.

Preprocessing the data before feeding it into the RAG system is an important step to ensure that the input data is in a suitable format for the model. Simple methods invole breaking input data into fixed-size chunks with overlaps, eg. the last 10 characters of a chunk is the same first 10 characters of the next one, but this can miss nuances in the input data.

More advanced pre-processing could manipulate the input text to remove common word ending, eg. stopper, stopping, and stopped all become stop; remove un-informative 'stop' words such as the, as, is, and the like; and other techniques. These can substantially improve the relevancy of retrieved information but adds complexity to both the data embedding and user prompting phases.

Even more advanced techniques may operate on full sentences, to keep as much of the meaning as possible in the text.

Evaluating the performance of a RAG system can be challenging due to the complex nature of the task. Common evaluation metrics include perplexity, fluency, relevance, and coherence - as well the BLU and ROUGE metrics. It's important to choose metrics that align with the specific goals of the task and the desired outcomes.

RAG requires plain text, and the choice of conversion methods has a big impact on data quality. For example, when converting PDF files, how are tables, images and other metadata elements handled.

Generating a human-like response from an LLM requires substantial computing resources and can often take several seconds depending on the size of the model, the complexity of the user query, and the amount of augmented information passed to the model. Solutions that need to service large groups of users or require quick response times may need to implement a mechanism to cache model responses to frequently occurring queries.

Embedding proprietary, potentially confidential, and potentially personally identifiable information into LLM prompts is core and necessary to the RAG pattern. Organizations using hosted model platforms must be aware of provider policies such as prompt data retention and usage policies (eg. does the provider capture prompt data and use it for model re-training?), controls to prevent prompt data from 'leaking' to other users, etc; and balance these against their own information security policies and controls.

While transmission of some proprietary information is unavoidable, organizations can limit their exposure by including only document or URL references to the most sensitive information in the processed data. For example, rather than embedding a price discounting table into the RAG data, include only a description of the table and a reference or link to an internal document or website in the content.

Simple transport level security (TLS) on inter-zone communications may be enough to satisfy data security requirements but architects may need to consider providing additional protection by adding components to encrypt and decrypt prompts and responses before passing them across the zone-boundary.

The type of connection between the deployment zones has impacts on several non-functional requirements. Using a virtual private network (VPN) connection over the public Internet is a low-cost option but it may not fully allay security concerns, and may not be able to meet the solution's response time or throughput requirements. A private network connection to the model hosting environment comes at a much higher cost but offers significantly better security and provides architects with the ability to control for network latency and bandwidth.

Resources

Quick start: Prompt a foundation model with the retrieval-augmented generation pattern

You can use foundation models in IBM watsonx.ai to generate factually accurate output grounded in information in a knowledge base by applying the retrieval-augmented generation pattern.

IBM Generative AI Architecture

IBM's Generative AI Architecture is the complete IBM Generative AI Architecture in IBM IT Architect Assistant (IIAA), an architecture development and management tool. Using IIAA, architects can elaborate and customize the architecture to create their own generative AI solutions.