This section covers tools and considerations for the storage and retrieval of the RAG document base. We deal with the two topics jointly since the storage technology limits the retrieval mechanisms available to query and retrieve stored data.
IBM Watsonx Discovery, powered by IBM Cloud Databases for Elasticsearch, is a complete enterprise solution designed to support and enhance search and retrieval-augmented generation (RAG) use cases. With robust capabilities for vector search, semantic search, federated search, and more, Watsonx Discovery provides a rich set of tools for building contextual, AI-driven applications. It integrates seamlessly with IBM’s conversational AI platform Watsonx Assistant and orchestration layers like Watsonx Orchestrate to deliver advanced business solutions.
Watsonx Discovery leverages IBM Cloud Databases for Elasticsearch, offering a powerful and flexible backend for managing your enterprise search needs:
Monitoring and Health Checks: With Elasticsearch, developers gain access to comprehensive monitoring and health-check tools to ensure that the embeddings, metadata, chunk size, and document structures are optimized. The Elasticsearch webpage enables seamless management and scaling, focusing on application development rather than operational tasks like backups, logging, monitoring, and setup.
Advanced Search Capabilities: Watsonx Discovery supports a variety of advanced search algorithms, such as sparse, and dense semantic search, metadata-filtered search, lexical search, boolean search (AND/OR/NOT/SLOP), prefix search, fuzzy search, and hybrid search. These capabilities are critical for challenging RAG use cases involving complex queries and documents. Implement the advanced search types outlined in this Notebook to optimize your deployment.
Integration with LLM Frameworks: Elasticsearch’s compatibility with popular LLM orchestration frameworks, such as LangChain and LlamaIndex, makes it ideal for integrating with Watsonx Discovery to power RAG solutions.
Watsonx Discovery is a versatile solution that supports a broad range of search methodologies:
The ELSER (Elastic Search Embeddings) model is a leading prebuilt embeddings model used for semantic search, enabling contextual search even without the need for custom vector databases.
Watsonx Discovery supports federated search across existing deployments, reducing the need to store and maintain duplicate data. This enables enterprise-wide search and retrieval across multiple data silos.
Watsonx Discovery simplifies content extraction and indexing through a rich suite of data ingest tools:
For more information, explore the following ElasticSearch resources to accelerate your Watsonx Discovery development:
Watsonx Discovery integrates with Kibana and Enterprise Search to provide scalable, AI-driven information retrieval and visualization, tailored to large enterprise environments:
Kibana for Data Visualization: Create custom dashboards and visualizations based on data ingested by Watsonx Discovery. It offers real-time analytics on key metrics like entity extraction, sentiment analysis, and document classification. This helps track trends, patterns, and anomalies in unstructured data, streamlining the analysis process and improving decision-making.
Enterprise Search for Scalable Retrieval: Leverage NLP and semantic search to enable fast, accurate retrieval across structured and unstructured data sources. Customize search functionality to index and retrieve from multiple data sources, including documents, emails, and databases, to create domain-specific search applications that integrate directly into business workflows.
Watsonx Discovery applies native support for role-based and attribute-based access control to secure content access and protect data privacy. This ensures that sensitive information is only accessible to authorized users, safeguarding enterprise data.
Watsonx Discovery can be coupled with Watsonx Orchestrate to enhance automation workflows through context-aware data retrieval. When Orchestrate automates a business process that requires insights from unstructured data, it leverages Discovery to retrieve and analyze relevant content, enabling it to perform intelligent tasks such as:
IBM watsonx.data is a powerful data platform built on an open lakehouse architecture, combining the strengths of both data warehouses and data lakes. It offers a single point of entry to access all your data through a shared and open metadata layer, making it an ideal solution for organizations seeking to streamline their data management. With support for open data formats, integrated vectorized embedding capabilities, and a generative AI-powered conversational interface for data insights, watsonx.data enhances real-time analytics and AI use cases. The platform integrates seamlessly with existing databases and tools while offering flexible deployment options, including cloud and on-premises configurations.
A key component of IBM watsonx.data is watsonx.data Milvus, an open-source vector database specifically designed to store, manage, and transfer high-dimensional vector data such as vector embeddings. This highly configurable vector database is optimized for indexing and retrieving vectors efficiently, making it ideal for use cases that require vector similarity searches in AI applications. Watsonx.data Milvus is particularly beneficial for customers leveraging watsonx.ai who need seamless integration with vector database capabilities, as well as for those interested in implementing open-source frameworks like LangChain, often used in Retrieval-Augmented Generation (RAG) patterns.
IBM watsonx.data is available in three deployment options to suit diverse needs:
For organizations looking to leverage vector databases, watsonx.data supports integration with open-source Milvus, enabling advanced capabilities such as vector similarity search, which is essential for real-time AI applications.
To start using IBM watsonx.data, choose the deployment option that best fits your requirements:
Provision an instance.
For On-Premises Deployment:
IBM provides extensive documentation to help you get started with both the core watsonx.data platform and the Milvus vector database:
By leveraging watsonx.data’s open architecture, integrated vector database capabilities, and advanced AI features, organizations can optimize data engineering processes, improve real-time analytics, and unlock new insights through AI-powered capabilities. This unified platform empowers teams to work with structured and unstructured data in innovative ways, bringing together traditional analytics and state-of-the-art vector similarity search under a single solution.
IBM Watson Discovery is a tool designed to retrieve and analyze unstructured data. It's main features is keyword search, which enables users to search through large datasets using specific terms or phrases. Unlike traditional search engines that tend to use text matching, Watson Discovery uses the context behind the keywords to deliver a more relevant result.
Another important capability of Watson Discovery is entity extraction, this means Watson Discovery can automatically extract key pieces of information like people, locations, organizations, and dates from unstructured text. This process structures the data so it is digestible and can be directly incorporated into RAG responses.
Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and passed to the LLM when doing the generation step.
LangChain supports many different retrieval algorithms. LangChain supports basic methods such as semantic search. However, we have also added a collection of algorithms on top of this to increase performance. These include:
Milvus is an AI platform for automation and extraction of structured data from unstructured documents and is often used with Watsonx Discovery for certain retrieval scenarios. Milvus can adapts to various formats, making it ideal for large-scale processing workflows. For an AI engineer, it offers a scalable solution that reduces manual effort in data extraction while ensuring precision and efficiency.
Milvus is primarily a vector database designed for similarity search across large datasets, particularly excelling in high-dimensional vectors. watsonx Discovery, on the other hand, is specialized for conversational semantic search and document understanding/processing, with a particular emphasis on conversational and semantic search capabilities.
Milvus offers advanced functionalities like sparse vector, bulk-vector, filtered search, and hybrid search capabilities. It also features a distributed architecture that allows for both scaling out/in and up/down, making it highly scalable for large datasets.
Milvus can handle large-scale vector data and supports high-performance vector similarity searches, essential for RAG solutions that retrieve information from large datasets.
Milvus integrates seamlessly with watsonx.data, simplifying data management for AI models and applications. This enables scalable RAG use cases across large sets of governed data.
Milvus supports advanced features like metadata filtering, enhancing search result relevance and improving RAG system effectiveness.
A basic vector store works well for simple RAG solutions but it quickly breaks down when there are multiple, similar pieces of information in the document (similar is not the same as relevant), the information to answer a query is spread across multiple documents, or is distributed across multiple sections of a document. A few examples to demonstrate…
Example 1
A business operates offices in multiple states, each with its own HR policies. The organization uses a common document format for HR information across all its offices. If the company embedded its HR policies in a vector database there would be multiple chunks for, say, requesting time off with different supporting text; making hallucinations highly likely.
Example 2
A services contract has a glossary of defined terms which are capitalized through the contract to denote the reference to the definition, e.g., the Lessor, the Evaluation Period, etc. A naive parsing and chunking of the contract that parsed and chunked the contract from beginning to end would result in interactions like the following:
Query: How much time is Acme allowed before they will be billed after the services are complete? Response: The Purchaser is allowed 10 days grace after the Evaluation Period before they will be billed for the services.
Helpful but guaranteed to create follow-on queries and some quick addition to arrive at a more complete answer.
Example 3
A state government publishes amendments to its legislation as 'edits' to existing laws. For example, "Section 12.2 of the Narcotics Control Act is amended to exclude Cannabis and Cannabis-related products". How will a vector database determine that this snippet of text should take precedence over all other chunks dealing with the topic of Cannabis?
A document hierarchy organizes document chunks into categories that group related documents together; similar to a table of contents or a set of folders on a computer. Using a document hierarchy enables a RAG solution to 'narrow down' the subject of a user query so that it can retrieve only the most relevant documents.
Applying this to the multi-state HR document example above, a useful document hierarchy would categorize the document by state, enabling the RAG solution to only retrieve document chunks relevant to a specific state. As shown in the figure below, a possible hierarchy could organize documents first by country, then by the applicable state / province, then by policy areas or topic. Of course, this means the solution will either need to infer the relevant state from the user's listed state of residence, or by prompting the user to provide the state of interest.
Document hierarchies can be implemented in hybrid vector stores, vector stores that support schemas that comprise numeric and categorical data types alongside vectors, such as Milvus, using category and sub-category fields as part of the search criteria as shown in the Milvus schema definition below.
chunk_id = FieldSchema( name="chunk_id", dtype=DataType.INT64, auto_id=True, ) state_id = FieldSchema( name="state_id", dtype=DataType.INT64, ) chunk = FieldSchema( name="chunk", dtype=DataType.FLOAT_VECTOR, dim=2 ) schema = CollectionSchema( fields=[chunk_id, state_id, chunk], description="HR documents by state" )
Knowledge graphs capture the relationships between entities in a document or across documents. Unlike a similarity search in a vector database, a knowledge graph can consistently and accurately retrieve relevant relationships and content that can significantly reduce the occurrence of hallucinations. When paired with vector database similarity searches, graph databases can enable RAG solutions to piece together related content from within a document and/or across a document set. The image below shows a potential knowledge graph for services contract.
While the best knowledge graph for a RAG solution is human-crafted and maintained, using a large language model (LLM) to parse the entities and relations in a document and to create a knowledge graph yields surprisingly good results.
Neo4j is a popular open source (GPLv3) graph database. This tutorial takes the reader through configuring and coding a knowledge graph-enhanced RAG solution with neo4j and Langchain.