Embeddings

Overview

Embedding models are transformer-based neural networks that transform chunks of documents (i.e. passages of text) into a numeric representation or vector. Content with similar meaning or semantics will be mapped to a similar representation in the latent space, as seen in the image below.

The vectorization of language enables AI-powered applications such as 'chatting with documents' or semantic search, rather than traditional keyword (lexical) search.

The IBM RAG Cookbook

Considerations

Performance vs. Cost Tradeoff

The embedding model you choose will significantly impact the retrieval accuracy, latency, and computational cost of your RAG system. Your choice of embedding model will largely be influenced by its size, which depends on two characteristics: the embedding dimension and the number of model parameters.

A larger embedding model typically enhances retrieval performance but at the cost of increased latency, storage, and computational (financial) cost. Conversely, a smaller embedding model usually offers reduced retrieval performance but occupies less memory, requires less compute power, and is faster at runtime. Choose an embedding model that balances performance requirements with available resources.

Certain embedding models are language-specific (e.g., Spanish embedding models built for Spanish clients) and domain-specific (e.g., a model trained on oncology terminology to enable RAG over medical files).

IBM Solutions

Embeddings Models Available on watsonx

You are recommended to use embedding models that are deployed in watsonx: the IBM-developed slate models or the third party models listed below. Please read through this documentation for details about each model. For more information regarding billing classes, see watsonx billing plans.

IBM Slate Models

Model name	API model_id	Billing class	Maximum input tokens	Number of dimensions	More information
slate-125m-english-rtrvr	ibm/slate-125m-english-rtrvr	Class C1	512	768	Model card
slate-30m-english-rtrvr	ibm/slate-30m-english-rtrvr	Class C1	512	384	Model card

slate-125m-english-rtrvr

The slate-125m-english-rtrvr foundation model is provided by IBM. The slate-125m-english-rtrvr foundation model generates embeddings for various inputs such as queries, passages, or documents. The training objective is to maximize cosine similarity between a query and a passage. This process yields two sentence embeddings, one that represents the question and one that represents the passage, allowing for comparison of the two through cosine similarity.

Usage: Two to three times slower but performs slightly better than the slate-30m-english-rtrvr model. Supported Languages: English

slate-30m-english-rtrvr

The slate-30m-english-rtrvr foundation model is a distilled version of the slate-125m-english-rtrvr, which are both provided by IBM. The slate-30m-english-rtrvr embedding model is trained to maximize the cosine similarity between two text inputs so that embeddings can be evaluated based on similarity later.

Usage: Two to three times faster and has slightly lower performance scores than the slate-125m-english-rtrvr model. Supported Languages: English

Third Party Embedding Models available with watsonx

Model name	API model_id	Provider	Billing class	Maximum input tokens	Number of dimensions	More information
all-minilm-l12-v2	sentence-transformers/all-minilm-l12-v2	Open source natural language processing (NLP) and computer vision (CV) community	Class C1	256	384	Model card
multilingual-e5-large	intfloat/multilingual-e5-large	Microsoft	Class C1	512	1024	Model card, Research paper

all-minilm-l12-v2

The all-minilm-l12-v2 embedding model is built by the open source natural language processing (NLP) and computer vision (CV) community and provided by HuggingFace.

Supported Languages: English

multilingual-e5-large

Usage: For use cases where you want to generate text embeddings for text in a language other than English.

Supported natural languages: Up to 100 languages. See the model card for details.

For more information regarding supported embedding models, see the watsonx documentation.

Quickstart with watsonx embeddings Python SDK

Install ibm-watsonx-ai Python library

pip install -U ibm-watsonx-ai

Use the watsonx embeddings API and the available embedding models to generate text embeddings.

from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames as EmbedParams
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes
from ibm_watsonx_ai.foundation_models import Embeddings

# Set the truncate_input_tokens to a value that is equal to or less than the maximum allowed tokens for the embedding model that you are using. If you don't specify this value and the input has more tokens than the model can process, an error is generated.

embed_params = {
 EmbedParams.TRUNCATE_INPUT_TOKENS: 128,
 EmbedParams.RETURN_OPTIONS: {
 'input_text': True
 }
}

embedding = Embeddings(
 model_id=EmbeddingTypes.IBM_SLATE_30M_ENG,
 credentials=credentials,
 params=embed_params,
 project_id=project_id,
 space_id=None,
 verify=False
)

q = [
 "A foundation model is a large scale generative AI model that can be adapted to a wide range of downstream tasks.",
 "Generative AI a class of AI algorithms that can produce various types of content including text, source code, imagery, audio, and synthetic data."
]

embedding_vectors = embedding.embed_documents(texts=q)

print(embedding_vectors)

Sample Notebook

Use watsonx Granite Model Series and embeddings, Chroma, and LangChain to answer questions (RAG) and LangChain

Integrations

LangChain integration
LlamaIndex integration
watsonx embeddings API