The Granite Embedding model collection consists of embedding models to generate high-quality text embeddings and a reranker model to improve the relevance and quality of search results or recommendations. The embedding models output vector representations (aka embeddings) of textual inputs such as queries, passages, and documents to capture the semantic meaning of the input text. The primary use cases for these embeddings are in semantic search and retrieval-augmented generation (RAG) applications.The Granite Embedding Reranker model is optional, but useful to further improve the relevance and quality of search results or recommendations. After the initial retrieval of items based on their embeddings, the reranker refines the ranking by considering additional factors and more complex criteria.Built on a foundation of carefully curated, permissibly licensed public datasets, the Granite Embedding models achieve state-of-the-art results in their respective weight classes. On the MTEB Leaderboard, granite-embedding-97m-multilingual-r2 ranks #1 for multilingual embedding models under 100M parameters, and granite-embedding-311m-multilingual-r2 ranks #2 for multilingual embedding models under 500M parameters.Granite Embedding models are released under the Apache 2.0 license, making them freely available for both research and commercial purposes, with full transparency into their training data.Granite Embedding Paper
The Granite Embedding 311m multilingual model is designed to produce fixed-length vector representations for a given text, which can be used for text similarity, retrieval, and search applications across multiple languages. It supports 200+ languages with enhanced support for 52 languages.For efficient inference, these models support Flash Attention 2. Installing it is optional but can lead to faster encoding:
The model is compatible with the SentenceTransformer library and is very easy to use:First, install the sentence transformers library:
pip install sentence_transformers
The model can then be used to encode pairs of text and find the similarity between their representations:
from sentence_transformers import SentenceTransformer, utilmodel_path = "ibm-granite/granite-embedding-311m-multilingual-r2"# Load the Sentence Transformer modelmodel = SentenceTransformer(model_path)input_queries = [ 'What is the tallest mountain in Japan?', # English query 'Wer hat das Lied Achy Breaky Heart geschrieben?', # German query 'ドイツの首都はどこですか?', # Japanese query ]input_passages = [ "富士山は、静岡県と山梨県にまたがる活火山で、標高3776.12 mで日本最高峰の独立峰である。", # Japanese passage "Achy Breaky Heart is a country song written by Don Von Tress. Originally titled Don't Tell My Heart and performed by The Marcy Brothers in 1991.", # English passage "Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland. Die Stadt ist mit rund 3,7 Millionen Einwohnern die bevölkerungsreichste Kommune Deutschlands.", # German passage ]# Cross-lingual retrieval: each query should score highest with its matching passage in a different languagequery_embeddings = model.encode(input_queries)passage_embeddings = model.encode(input_passages)# calculate cosine similarity — expect high scores on the diagonal (EN→JA, DE→EN, JA→DE)print(util.cos_sim(query_embeddings, passage_embeddings))# output: tensor([[0.9393, 0.6899, 0.7627],# [0.6780, 0.9598, 0.7062],# [0.7818, 0.7342, 0.9172]])
This model supports Matryoshka Representation Learning (MRL), which allows you to truncate embeddings to smaller dimensions (e.g., 512, 256, 128) with graceful performance degradation. This is useful for reducing storage and memory requirements.
from sentence_transformers import SentenceTransformermodel = SentenceTransformer("ibm-granite/granite-embedding-311m-multilingual-r2")# Full 768-dimensional embeddingsfull_embeddings = model.encode(["example text"])print(full_embeddings.shape) # (1, 768)# Truncated to 256 dimensions — still effective for many retrieval taskstruncated_embeddings = model.encode(["example text"], truncate_dim=256)print(truncated_embeddings.shape) # (1, 256)
This is a simple example of how to use the granite-embedding-311m-multilingual-r2 model with the Transformers library and PyTorch. For a complete retrieval workflow including passage encoding and cosine similarity, see the Sentence Transformers example above.First, install the required libraries:
pip install transformers torch
The model can then be used to encode text:
import torchfrom transformers import AutoModel, AutoTokenizermodel_path = "ibm-granite/granite-embedding-311m-multilingual-r2"# Load the model and tokenizermodel = AutoModel.from_pretrained(model_path)tokenizer = AutoTokenizer.from_pretrained(model_path)model.eval()input_queries = [ 'What is the tallest mountain in Japan?', # English query 'Wer hat das Lied Achy Breaky Heart geschrieben?', # German query 'ドイツの首都はどこですか?', # Japanese query ]# tokenize inputstokenized_queries = tokenizer(input_queries, padding=True, truncation=True, return_tensors='pt')# encode querieswith torch.no_grad(): model_output = model(**tokenized_queries) # Perform pooling. granite-embedding-311m-multilingual-r2 uses CLS Pooling query_embeddings = model_output[0][:, 0]# normalize the embeddingsquery_embeddings = torch.nn.functional.normalize(query_embeddings, dim=1)
Pre-converted ONNX and OpenVINO models are released alongside the PyTorch weights for production deployment. These can be loaded through the Optimum library:
The ONNX model is compatible with any ONNX Runtime backend (CPU, CUDA, TensorRT, DirectML). The OpenVINO model is optimized for Intel hardware including CPUs and integrated GPUs.
The model can be served as an embedding endpoint using vLLM:
from langchain_huggingface import HuggingFaceEmbeddingsfrom langchain_milvus import Milvusfrom langchain_community.document_loaders import TextLoaderfrom langchain_text_splitters import CharacterTextSplitterimport os, tempfile, wget# load the embedding modelembeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")# setup the vectordbdb_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).nameprint(f"The vector database will be saved to {db_file}")vector_db = Milvus( embedding_function=embeddings_model, connection_args={"uri": db_file}, auto_id=True, index_params={"index_type": "AUTOINDEX"}, )# load example corpus filefilename = 'state_of_the_union.txt'url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'if not os.path.isfile(filename): wget.download(url, out=filename)loader = TextLoader(filename)documents = loader.load()text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)texts = text_splitter.split_documents(documents)# add processed documents to the vectordbvector_db.add_documents(texts)# search the vectordb with the queryquery = "What did the president say about Ketanji Brown Jackson"docs = vector_db.similarity_search(query)print(docs[0].page_content)