Home Architectures Publications RAG Cookbook Embeddings
Embeddings

Explore more of what the RAG Cookbook has to offer to gain a deeper insight into today's RAG solutions

Overview

Embedding models are transformer-based neural networks that transform chunks of documents (i.e. passages of text) into a numeric representation or vector. Content with similar meaning or semantics will be mapped to a similar representation in the latent space, as seen in the image below.

The vectorization of language enables AI-powered applications such as 'chatting with documents' or semantic search, rather than traditional keyword (lexical) search.

Considerations

Performance vs. Cost Tradeoff

The embedding model you choose will significantly impact the retrieval accuracy, latency, and computational cost of your RAG system. Your choice of embedding model will largely be influenced by its size, which depends on two characteristics: the embedding dimension and the number of model parameters.

A larger embedding model typically enhances retrieval performance but at the cost of increased latency, storage, and computational (financial) cost. Conversely, a smaller embedding model usually offers reduced retrieval performance but occupies less memory, requires less compute power, and is faster at runtime. Choose an embedding model that balances performance requirements with available resources.

Certain embedding models are language-specific (e.g., Spanish embedding models built for Spanish clients) and domain-specific (e.g., a model trained on oncology terminology to enable RAG over medical files).

IBM Solutions

Embeddings Models Available on watsonx

You are recommended to use embedding models that are deployed in watsonx: the IBM-developed slate models or the third party models listed below. Please read through this documentation for details about each model. For more information regarding billing classes, see watsonx billing plans.

IBM Slate Models

Model nameAPI model_idBilling classMaximum input tokensNumber of dimensionsMore information
slate-125m-english-rtrvribm/slate-125m-english-rtrvrClass C1                  512                768Model card
slate-30m-english-rtrvribm/slate-30m-english-rtrvrClass C1                  512                384Model card

 

slate-125m-english-rtrvr

The slate-125m-english-rtrvr foundation model is provided by IBM. The slate-125m-english-rtrvr foundation model generates embeddings for various inputs such as queries, passages, or documents. The training objective is to maximize cosine similarity between a query and a passage. This process yields two sentence embeddings, one that represents the question and one that represents the passage, allowing for comparison of the two through cosine similarity.

Usage: Two to three times slower but performs slightly better than the slate-30m-english-rtrvr model. Supported Languages: English

slate-30m-english-rtrvr

The slate-30m-english-rtrvr foundation model is a distilled version of the slate-125m-english-rtrvr, which are both provided by IBM. The slate-30m-english-rtrvr embedding model is trained to maximize the cosine similarity between two text inputs so that embeddings can be evaluated based on similarity later.

Usage: Two to three times faster and has slightly lower performance scores than the slate-125m-english-rtrvr model. Supported Languages: English

Third Party Embedding Models available with watsonx

Model nameAPI model_idProviderBilling classMaximum input tokensNumber of dimensionsMore information
all-minilm-l12-v2sentence-transformers/all-minilm-l12-v2Open source natural language processing (NLP) and computer vision (CV) communityClass C1           256           384Model card
multilingual-e5-largeintfloat/multilingual-e5-largeMicrosoftClass C1            512           1024Model cardResearch paper
all-minilm-l12-v2

The all-minilm-l12-v2 embedding model is built by the open source natural language processing (NLP) and computer vision (CV) community and provided by HuggingFace.

Supported Languages: English

multilingual-e5-large

Usage: For use cases where you want to generate text embeddings for text in a language other than English.

Supported natural languages: Up to 100 languages. See the model card for details.

For more information regarding supported embedding models, see the watsonx documentation.

Quickstart with watsonx embeddings Python SDK

Install ibm-watsonx-ai Python library
pip install -U ibm-watsonx-ai
Use the watsonx embeddings API and the available embedding models to generate text embeddings.
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames as EmbedParams
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes
from ibm_watsonx_ai.foundation_models import Embeddings

# Set the truncate_input_tokens to a value that is equal to or less than the maximum allowed tokens for the embedding model that you are using. If you don't specify this value and the input has more tokens than the model can process, an error is generated.

embed_params = {
 EmbedParams.TRUNCATE_INPUT_TOKENS: 128,
 EmbedParams.RETURN_OPTIONS: {
 'input_text': True
 }
}

embedding = Embeddings(
 model_id=EmbeddingTypes.IBM_SLATE_30M_ENG,
 credentials=credentials,
 params=embed_params,
 project_id=project_id,
 space_id=None,
 verify=False
)

q = [
 "A foundation model is a large scale generative AI model that can be adapted to a wide range of downstream tasks.",
 "Generative AI a class of AI algorithms that can produce various types of content including text, source code, imagery, audio, and synthetic data."
]

embedding_vectors = embedding.embed_documents(texts=q)

print(embedding_vectors)

Sample Notebook

Use watsonx Granite Model Series and embeddings, Chroma, and LangChain to answer questions (RAG) and LangChain

Integrations

Explore More

 

Get the latest technology patterns, solution architectures, and architecture publications from IBM.

Go to the IBM Architecture Center
Contributors

Luke Major, Dean Sacoransky

Updated: November 15, 2024