Embedding models are transformer-based neural networks that transform chunks of documents (i.e. passages of text) into a numeric representation or vector. Content with similar meaning or semantics will be mapped to a similar representation in the latent space, as seen in the image below.
The vectorization of language enables AI-powered applications such as 'chatting with documents' or semantic search, rather than traditional keyword (lexical) search.
The embedding model you choose will significantly impact the retrieval accuracy, latency, and computational cost of your RAG system. Your choice of embedding model will largely be influenced by its size, which depends on two characteristics: the embedding dimension and the number of model parameters.
A larger embedding model typically enhances retrieval performance but at the cost of increased latency, storage, and computational (financial) cost. Conversely, a smaller embedding model usually offers reduced retrieval performance but occupies less memory, requires less compute power, and is faster at runtime. Choose an embedding model that balances performance requirements with available resources.
Certain embedding models are language-specific (e.g., Spanish embedding models built for Spanish clients) and domain-specific (e.g., a model trained on oncology terminology to enable RAG over medical files).
You are recommended to use embedding models that are deployed in watsonx: the IBM-developed slate models or the third party models listed below. Please read through this documentation for details about each model. For more information regarding billing classes, see watsonx billing plans.
IBM Slate Models
Model name | API model_id | Billing class | Maximum input tokens | Number of dimensions | More information |
---|---|---|---|---|---|
slate-125m-english-rtrvr | ibm/slate-125m-english-rtrvr | Class C1 | 512 | 768 | Model card |
slate-30m-english-rtrvr | ibm/slate-30m-english-rtrvr | Class C1 | 512 | 384 | Model card |
The slate-125m-english-rtrvr foundation model is provided by IBM. The slate-125m-english-rtrvr foundation model generates embeddings for various inputs such as queries, passages, or documents. The training objective is to maximize cosine similarity between a query and a passage. This process yields two sentence embeddings, one that represents the question and one that represents the passage, allowing for comparison of the two through cosine similarity.
Usage: Two to three times slower but performs slightly better than the slate-30m-english-rtrvr model. Supported Languages: English
The slate-30m-english-rtrvr foundation model is a distilled version of the slate-125m-english-rtrvr, which are both provided by IBM. The slate-30m-english-rtrvr embedding model is trained to maximize the cosine similarity between two text inputs so that embeddings can be evaluated based on similarity later.
Usage: Two to three times faster and has slightly lower performance scores than the slate-125m-english-rtrvr model. Supported Languages: English
Third Party Embedding Models available with watsonx
Model name | API model_id | Provider | Billing class | Maximum input tokens | Number of dimensions | More information |
---|---|---|---|---|---|---|
all-minilm-l12-v2 | sentence-transformers/all-minilm-l12-v2 | Open source natural language processing (NLP) and computer vision (CV) community | Class C1 | 256 | 384 | Model card |
multilingual-e5-large | intfloat/multilingual-e5-large | Microsoft | Class C1 | 512 | 1024 | Model card, Research paper |
The all-minilm-l12-v2 embedding model is built by the open source natural language processing (NLP) and computer vision (CV) community and provided by HuggingFace.
Supported Languages: English
Usage: For use cases where you want to generate text embeddings for text in a language other than English.
Supported natural languages: Up to 100 languages. See the model card for details.
For more information regarding supported embedding models, see the watsonx documentation.
Install ibm-watsonx-ai Python library
pip install -U ibm-watsonx-ai
Use the watsonx embeddings API and the available embedding models to generate text embeddings.
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames as EmbedParams from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes from ibm_watsonx_ai.foundation_models import Embeddings # Set the truncate_input_tokens to a value that is equal to or less than the maximum allowed tokens for the embedding model that you are using. If you don't specify this value and the input has more tokens than the model can process, an error is generated. embed_params = { EmbedParams.TRUNCATE_INPUT_TOKENS: 128, EmbedParams.RETURN_OPTIONS: { 'input_text': True } } embedding = Embeddings( model_id=EmbeddingTypes.IBM_SLATE_30M_ENG, credentials=credentials, params=embed_params, project_id=project_id, space_id=None, verify=False ) q = [ "A foundation model is a large scale generative AI model that can be adapted to a wide range of downstream tasks.", "Generative AI a class of AI algorithms that can produce various types of content including text, source code, imagery, audio, and synthetic data." ] embedding_vectors = embedding.embed_documents(texts=q) print(embedding_vectors)
Use watsonx Granite Model Series and embeddings, Chroma, and LangChain to answer questions (RAG) and LangChain
Updated: November 15, 2024