Vector index settings

To create a vector index, you must select a vector data store that is compatible with your grounding documents and specify settings to control how your documents are broken into smaller segments before they are sent to the embedding model.

Make choices about the following options and settings when you create a vector index to generate and retrieve embeddings from a vector data store:

Types of vector stores

You can use one of the following vector data stores to store your grounding documents:

  • In memory: A Chroma database vector index that is associated with your project and provides temporary vector storage.

    Note: The in-memory vector index asset is created for you automatically; you don't need to set up the vector store.
  • Elasticsearch: An external third-party vector index that you set up and connect to your project.

To use an external vector store, you must set up a connection to the data store before you create the vector index. For more information, see Setting up a Elasticsearch vector store and Setting up a watsonx.data Milvus vector store.

Choosing a vector store

To determine the appropriate vector store for your use case, consider the following factors:

  • The file types of your grounding documents. The supported file types differ by vector store.

  • The embedding models that you can use to vectorize documents that you add to the index. The supported models differ by vector store.

  • The number of grounding documents you want to be able to search from your foundation model prompts.

    When you connect to a third-party vector store, you can choose to do one of the following tasks:

    • Add files to vectorize and store in a new vector index or collection in the vector store.
    • Use vectorized data from an existing index or collection in the vector store.

    The number of files that you can add to the vector store at the time that you create the vector index is limited. You can upload upto 10 documents at a time from an in-memory vector store.

    If you want to vectorize more documents, such as a set of PDF files that is larger than 50 MB, use a third-party vector store. With a third-party vector store, you can create a collection or index with more documents directly from the data store first. Then, you can connect to the existing collection or index when you create a vector index asset to associate with your prompt.

    Caution: Do not add more than 10 files in a single upload when you create a vector index in Prompt Lab.

Mapping external vector store schema fields to a vector index asset

For connected vector stores, you can map fields from the existing index or collection in the external vector store to new fields that are defined in the vector index asset in watsonx.ai to provide a consistent way to extract data and capture details about the document, such as the original file name and page number, from different types of vector stores.

Table 1. Vector store schema fields
New vector index field name Field from connected vector store
Vector query Required for Elasticsearch indexes only. Field where the query text is specified that is used to search the Elasticsearch index, such as ml or vector.
Document name Field that identifies the source file. You can choose a field that captures the file name, such as metadata.source, or the document title, such as metadata.title.
Text Field that contains the bulk of the page content, such as body or text.
Page number Field that identifies the page number, such as metadata.page_number.
Document url Field that contains the URL for the document, such as metadata.document_url.
Attention:

To use a connected folder asset that uses a Cloud Object Storage (COS) connection, make sure you meet the following requirements:

  • The COS connection must have a bucket specified.
  • The COS connection must use HMAC credentials (resource instance id, api key, access key, secret key) as authentication.

Grounding document file types

When you add grounding documents to a vector index, you can upload files or connect to a data asset that contains files.

The following table lists the supported file types and maximum file sizes that you can add when you create a new vector index. The supported file types differ by vector store.

File types are listed in the first column. The maximum total file size that is allowed by default for each file type is listed in the remaining columns.

Table 2. Supported file types for grounding documents in differnt vector stores
File type In-memory store maximum total file size Elasticsearch maximum total file size Milvus maximum total file size
CSV Not supported 50 MB 50 MB
DOCX 50 MB 500 MB 500 MB
HTML Not supported 50 MB 50 MB
JSON Not supported 50 MB 50 MB
PDF 50 MB 500 MB 500 MB
TXT 5 MB 50 MB 50 MB
XLSX Not supported 50 MB 50 MB
XML Not supported 50 MB 50 MB
YAML Not supported 50 MB 50 MB

Vectorization settings

When you upload grounding documents, an embedding model is used to calculate vectors that numerically represent the document text.

You can configure the following settings to control how documents are broken into smaller segments, or chunks, before they are sent to the embedding model of your choice:

Supported embedding models

You can use embedding models provided in watsonx.ai with in-memory and Milvus vector data stores. For details, see Embedding model details.

You can use ELSER (Elastic Learned Sparse EncodeR) embedding models with the Elasticsearch vector data store. For details, see ELSER – Elastic Learned Sparse EncodeR.

Text chunk size

Set the chunk size parameter to configure the number of characters to include per document segment.

Define a segment size that is smaller than the maximum number of input tokens allowed by the model. If you break the document into larger segments, some document text might be omitted because after the maximum token size limit is met, any extra characters in the segment are ignored by the embedding model.

The chunk size is specified in characters. The number of characters per token varies by embedding model, but one token is equal to approximately 2-3 characters.

Table 2. Embedding model chunk sizes
Embedding model Maximum input tokens Approximate chunk size
all-MiniLM-L6-v2 256 700
all-MiniLM-l12-v2 256 700
ELSER 512 1400
granite-embedding-107m-multilingual 512 1400
granite-embedding-278m-multilingual 512 1400
multilingual-e5-large 512 1400
slate-30m-english-rtrvr 512 1400
slate-125m-english-rtrvr 512 1400

 

Text chunk overlap

Set the chunk overlap parameter to configure the number of characters to repeat in each of two consecutive document segments.

Repeating text creates a buffer between document segments that helps to capture complete sentences and prevents text from being missed altogether.

Split PDF pages

When the split PDF parameter is enabled, a PDF file is broken into one segment per page and includes the page number source in the answer. The page numbers that are shown are PDF viewer page numbers.

Note: This option is available only when you add a PDF file.

Search settings

You can adjust query settings to improve responses returned from a search of the contents in the vector index asset.

Restriction: You cannot adjust search results and result reranking settings for a vector index asset created with the API.

You can use the following settings to control the number and types of search results returned by the vector index:

Top K

Set the Top K parameter to configure the number of results to sample from a vector index search. The sampled results are used as contextual input to the foundation model.

A lower top K value increases the similarity between a question and an answer. A higher top K value provides more information for a foundation model to use to generate a response. However, the token count in the model input also increases.

By default, the top three search results are included.

Supported reranker models

Select a reranker model to prioritize search results that are more likely to answer the question. For details about the reranker models provided in watsonx.ai, see Reranker model details.

Top N

Set the Top N parameter to configure the number of results from the top K vector index search results that the reranker model should rerank.