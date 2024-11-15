Home Architectures Publications RAG Cookbook Storage and Retrieval
Storage and Retrieval

Overview

This section covers tools and considerations for the storage and retrieval of the RAG document base. We deal with the two topics jointly since the storage technology limits the retrieval mechanisms available to query and retrieve stored data.
IBM Tools

IBM Watsonx Discovery: Comprehensive Enterprise Search & RAG Solution

IBM Watsonx Discovery, powered by IBM Cloud Databases for Elasticsearch, is a complete enterprise solution designed to support and enhance search and retrieval-augmented generation (RAG) use cases. With robust capabilities for vector search, semantic search, federated search, and more, Watsonx Discovery provides a rich set of tools for building contextual, AI-driven applications. It integrates seamlessly with IBM’s conversational AI platform Watsonx Assistant and orchestration layers like Watsonx Orchestrate to deliver advanced business solutions.

Why Use Watsonx Discovery with Elasticsearch?

Watsonx Discovery leverages IBM Cloud Databases for Elasticsearch, offering a powerful and flexible backend for managing your enterprise search needs:

  • Monitoring and Health Checks: With Elasticsearch, developers gain access to comprehensive monitoring and health-check tools to ensure that the embeddings, metadata, chunk size, and document structures are optimized. The Elasticsearch webpage enables seamless management and scaling, focusing on application development rather than operational tasks like backups, logging, monitoring, and setup.

  • Advanced Search Capabilities: Watsonx Discovery supports a variety of advanced search algorithms, such as sparse, and dense semantic search, metadata-filtered search, lexical search, boolean search (AND/OR/NOT/SLOP), prefix search, fuzzy search, and hybrid search. These capabilities are critical for challenging RAG use cases involving complex queries and documents. Implement the advanced search types outlined in this Notebook to optimize your deployment.

  • Integration with LLM Frameworks: Elasticsearch’s compatibility with popular LLM orchestration frameworks, such as LangChain and LlamaIndex, makes it ideal for integrating with Watsonx Discovery to power RAG solutions.

Search Methods Supported by Watsonx Discovery

Watsonx Discovery is a versatile solution that supports a broad range of search methodologies:

  • Keyword Search
  • Semantic Search
  • Vector Search
  • Hybrid Search (Reciprocal Rank Fusion)

The ELSER (Elastic Search Embeddings) model is a leading prebuilt embeddings model used for semantic search, enabling contextual search even without the need for custom vector databases.

Federated Search

Watsonx Discovery supports federated search across existing deployments, reducing the need to store and maintain duplicate data. This enables enterprise-wide search and retrieval across multiple data silos.

Rich Data Ingest and Indexing Capabilities

Watsonx Discovery simplifies content extraction and indexing through a rich suite of data ingest tools:

  • Content Extraction: Supports a variety of file formats, including DOCX, PDF, and PPTX, making it easy to discover and index business-critical information.
  • Third-Party Integration: Connect to a wide range of data sources using prebuilt connectors, allowing for seamless integration into your enterprise data ecosystem.
  • Data Transformation and Enrichment: Implement powerful ingest pipelines to transform and enrich data, ensuring it is ready for high-quality retrieval and insights.

For more information, explore the following ElasticSearch resources to accelerate your Watsonx Discovery development:

Enhanced Visualization and Enterprise Search

Watsonx Discovery integrates with Kibana and Enterprise Search to provide scalable, AI-driven information retrieval and visualization, tailored to large enterprise environments:

  • Kibana for Data Visualization: Create custom dashboards and visualizations based on data ingested by Watsonx Discovery. It offers real-time analytics on key metrics like entity extraction, sentiment analysis, and document classification. This helps track trends, patterns, and anomalies in unstructured data, streamlining the analysis process and improving decision-making.

  • Enterprise Search for Scalable Retrieval: Leverage NLP and semantic search to enable fast, accurate retrieval across structured and unstructured data sources. Customize search functionality to index and retrieve from multiple data sources, including documents, emails, and databases, to create domain-specific search applications that integrate directly into business workflows.

Privacy and Security

Watsonx Discovery applies native support for role-based and attribute-based access control to secure content access and protect data privacy. This ensures that sensitive information is only accessible to authorized users, safeguarding enterprise data.

Use Cases: Watsonx Discovery and Orchestrate

Watsonx Discovery can be coupled with Watsonx Orchestrate to enhance automation workflows through context-aware data retrieval. When Orchestrate automates a business process that requires insights from unstructured data, it leverages Discovery to retrieve and analyze relevant content, enabling it to perform intelligent tasks such as:

  • Responding to queries with accurate, data-backed answers.
  • Executing workflows that depend on real-time data retrieval and insights.

IBM watsonx.data: A Modern Data Store with Integrated Vector Database Capabilities

IBM watsonx.data is a powerful data platform built on an open lakehouse architecture, combining the strengths of both data warehouses and data lakes. It offers a single point of entry to access all your data through a shared and open metadata layer, making it an ideal solution for organizations seeking to streamline their data management. With support for open data formats, integrated vectorized embedding capabilities, and a generative AI-powered conversational interface for data insights, watsonx.data enhances real-time analytics and AI use cases. The platform integrates seamlessly with existing databases and tools while offering flexible deployment options, including cloud and on-premises configurations.

A key component of IBM watsonx.data is watsonx.data Milvus, an open-source vector database specifically designed to store, manage, and transfer high-dimensional vector data such as vector embeddings. This highly configurable vector database is optimized for indexing and retrieving vectors efficiently, making it ideal for use cases that require vector similarity searches in AI applications. Watsonx.data Milvus is particularly beneficial for customers leveraging watsonx.ai who need seamless integration with vector database capabilities, as well as for those interested in implementing open-source frameworks like LangChain, often used in Retrieval-Augmented Generation (RAG) patterns.

Deployment Options

IBM watsonx.data is available in three deployment options to suit diverse needs:

  • Fully Managed SaaS: Available on IBM Cloud and AWS, providing a hands-off approach to setup and scaling.
  • Self-Managed Containerized Software: Ideal for on-premises environments, leveraging IBM Cloud Pak for Data and Red Hat OpenShift.
  • Small Footprint Developer Version: Suitable for development environments using single VMs or laptops, providing the flexibility to test and prototype locally.

For organizations looking to leverage vector databases, watsonx.data supports integration with open-source Milvus, enabling advanced capabilities such as vector similarity search, which is essential for real-time AI applications.

Getting Started

To start using IBM watsonx.data, choose the deployment option that best fits your requirements:

  • For IBM Cloud Deployment:
  • Visit the IBM Cloud catalog page for watsonx.data.
  • Select a plan that aligns with your needs.

  • Provision an instance.

  • For On-Premises Deployment:

  • Obtain activation keys for IBM watsonx.data within IBM Cloud Pak for Data.
  • Get Red Hat OpenShift entitlements.
  • Follow the deployment guide to configure the software.

Documentation and Support

IBM provides extensive documentation to help you get started with both the core watsonx.data platform and the Milvus vector database:

By leveraging watsonx.data’s open architecture, integrated vector database capabilities, and advanced AI features, organizations can optimize data engineering processes, improve real-time analytics, and unlock new insights through AI-powered capabilities. This unified platform empowers teams to work with structured and unstructured data in innovative ways, bringing together traditional analytics and state-of-the-art vector similarity search under a single solution.

Watson Discovery

IBM Watson Discovery is a tool designed to retrieve and analyze unstructured data. It's main features is keyword search, which enables users to search through large datasets using specific terms or phrases. Unlike traditional search engines that tend to use text matching, Watson Discovery uses the context behind the keywords to deliver a more relevant result.

Another important capability of Watson Discovery is entity extraction, this means Watson Discovery can automatically extract key pieces of information like people, locations, organizations, and dates from unstructured text. This process structures the data so it is digestible and can be directly incorporated into RAG responses.
Open Source Tools

Langchain

Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and passed to the LLM when doing the generation step.

LangChain supports many different retrieval algorithms. LangChain supports basic methods such as semantic search. However, we have also added a collection of algorithms on top of this to increase performance. These include:

 

  • Parent Document Retriever: This allows you to create multiple embeddings per parent document, allowing you to look up smaller chunks but return larger context.
  • Self Query Retriever: User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the semantic part of a query from other metadata filters present in the query.
  • Ensemble Retriever: Sometimes you may want to retrieve documents from multiple different sources, or using multiple different algorithms. The ensemble retriever allows you to easily do this.

Milvus

Milvus is an AI platform for automation and extraction of structured data from unstructured documents and is often used with Watsonx Discovery for certain retrieval scenarios. Milvus can adapts to various formats, making it ideal for large-scale processing workflows. For an AI engineer, it offers a scalable solution that reduces manual effort in data extraction while ensuring precision and efficiency.

watsonx Discovery vs Milvus

Milvus is primarily a vector database designed for similarity search across large datasets, particularly excelling in high-dimensional vectors. watsonx Discovery, on the other hand, is specialized for conversational semantic search and document understanding/processing, with a particular emphasis on conversational and semantic search capabilities.

Functionality

Milvus offers advanced functionalities like sparse vector, bulk-vector, filtered search, and hybrid search capabilities. It also features a distributed architecture that allows for both scaling out/in and up/down, making it highly scalable for large datasets.

Key Advantages

Scaling

Milvus can handle large-scale vector data and supports high-performance vector similarity searches, essential for RAG solutions that retrieve information from large datasets.

Data Integration

Milvus integrates seamlessly with watsonx.data, simplifying data management for AI models and applications. This enables scalable RAG use cases across large sets of governed data.

Advanced Features

Milvus supports advanced features like metadata filtering, enhancing search result relevance and improving RAG system effectiveness.

Resources

Milvus RAG Tutorials

Using LangChain Retrievers

Open Source Retriever with LlamaIndex
Tips, Techniques, and Recommendations

Dealing with Duplication, Precedence, and Distribution

A basic vector store works well for simple RAG solutions but it quickly breaks down when there are multiple, similar pieces of information in the document (similar is not the same as relevant), the information to answer a query is spread across multiple documents, or is distributed across multiple sections of a document. A few examples to demonstrate…

Example 1

A business operates offices in multiple states, each with its own HR policies. The organization uses a common document format for HR information across all its offices. If the company embedded its HR policies in a vector database there would be multiple chunks for, say, requesting time off with different supporting text; making hallucinations highly likely.

Example 2

A services contract has a glossary of defined terms which are capitalized through the contract to denote the reference to the definition, e.g., the Lessor, the Evaluation Period, etc. A naive parsing and chunking of the contract that parsed and chunked the contract from beginning to end would result in interactions like the following:

 

Query: How much time is Acme allowed before they will be billed after the services are complete?

Response: The Purchaser is allowed 10 days grace after the Evaluation Period before they will be billed for the services.

Helpful but guaranteed to create follow-on queries and some quick addition to arrive at a more complete answer.

Example 3

A state government publishes amendments to its legislation as 'edits' to existing laws. For example, "Section 12.2 of the Narcotics Control Act is amended to exclude Cannabis and Cannabis-related products". How will a vector database determine that this snippet of text should take precedence over all other chunks dealing with the topic of Cannabis?

Document Hierarchies

A document hierarchy organizes document chunks into categories that group related documents together; similar to a table of contents or a set of folders on a computer. Using a document hierarchy enables a RAG solution to 'narrow down' the subject of a user query so that it can retrieve only the most relevant documents.

Applying this to the multi-state HR document example above, a useful document hierarchy would categorize the document by state, enabling the RAG solution to only retrieve document chunks relevant to a specific state. As shown in the figure below, a possible hierarchy could organize documents first by country, then by the applicable state / province, then by policy areas or topic. Of course, this means the solution will either need to infer the relevant state from the user's listed state of residence, or by prompting the user to provide the state of interest.

Implementation

Hybrid Vector Stores

Document hierarchies can be implemented in hybrid vector stores, vector stores that support schemas that comprise numeric and categorical data types alongside vectors, such as Milvus, using category and sub-category fields as part of the search criteria as shown in the Milvus schema definition below.

chunk_id = FieldSchema(
  name="chunk_id",
  dtype=DataType.INT64,
  auto_id=True,
)
state_id = FieldSchema(
  name="state_id",
  dtype=DataType.INT64,  
)
chunk = FieldSchema(
  name="chunk",
  dtype=DataType.FLOAT_VECTOR,
  dim=2
)
schema = CollectionSchema(
  fields=[chunk_id, state_id, chunk],
  description="HR documents by state"
)

Knowledge Graphs

Knowledge graphs capture the relationships between entities in a document or across documents. Unlike a similarity search in a vector database, a knowledge graph can consistently and accurately retrieve relevant relationships and content that can significantly reduce the occurrence of hallucinations. When paired with vector database similarity searches, graph databases can enable RAG solutions to piece together related content from within a document and/or across a document set. The image below shows a potential knowledge graph for services contract.

 

While the best knowledge graph for a RAG solution is human-crafted and maintained, using a large language model (LLM) to parse the entities and relations in a document and to create a knowledge graph yields surprisingly good results.

Implementation

Neo4j

Neo4j is a popular open source (GPLv3) graph database. This tutorial takes the reader through configuring and coding a knowledge graph-enhanced RAG solution with neo4j and Langchain.
Contributors

 

Rohan Singh, Luke Major, Chris Kirby, Sacha Mongrain

Updated: November 15, 2024

 

 