The RAG architecture, shown in the diagram above, can be partitioned into two sections:
The typical path taken to create a RAG solution is as follows:
An AI Engineer prepares the client data (for example, procedure manuals, product documentation, or help desk tickets, etc.) during Data Preprocessing. Client data is transformed and/or enriched to make it suitable for model augmentation. Transformations might include simple format conversions such as converting PDF documents to text, or more complex transformations such as translating complex table structures into if-then type statements. Enrichment may include expanding common abbreviations, adding metadata such as currency information, and other additions to improve the relevancy of search results.
The enriched information from the documents is broken down into smaller segments called chunks to manage the text more efficiently. This can be done using a number of chunking strategies, each of which have their own way of separating and grouping the information.
The chunks are then ingested into a database where they are meant to be stored and accessed during the use of the solution. Different retrieval strategies will require different types of databases; some even requiring multiple types of databases. However, the general RAG solution will only make use of a vector database. In this case an embedding model is used to convert the chunks into a series of vectors, which can then be stored in the Vector DB.
The system is now ready for use by end-users.
End-users interact with a GenAI enabled application and enter a query.
The GenAI application receives the query, and performs the employed retrieval strategy to obtain the top "K" (where K is just a place holder for a number) pieces of information that the strategy deems to most closely match the user's query. For example, if the user's query is 'What is the daily withdrawal limit on the MaxSavers account,' the search may return passages such as 'The MaxSavers account is…,' 'Daily withdrawal limits are…,' and '…account limits…'.
The original query, the top passages, and an instruction prompt, curated for the specific application, are sent to the LLM.
The LLM then follows the instruction prompt to generate and format an answer to the original user query using the retrived information.
Updated: November 15, 2024