Granite Embedding

Text embedding and reranker models that produce vector representations for semantic search, retrieval-augmented generation, and recommendation ranking.

Model Collection

View the full Granite Embedding collection on Hugging Face

Run locally with Ollama

Download and run Granite Embedding with Ollama

Replicate

Deploy Granite Embedding on Replicate

Overview

The Granite Embedding model collection consists of embedding models to generate high-quality text embeddings and a reranker model to improve the relevance and quality of search results or recommendations. The embedding models output vector representations (aka embeddings) of textual inputs such as queries, passages, and documents to capture the semantic meaning of the input text. The primary use cases for these embeddings are in semantic search and retrieval-augmented generation (RAG) applications.

The Granite Embedding Reranker model is optional, but useful to further improve the relevance and quality of search results or recommendations. After the initial retrieval of items based on their embeddings, the reranker refines the ranking by considering additional factors and more complex criteria.

Built on a foundation of carefully curated, permissibly licensed public datasets, the Granite Embedding models achieve state-of-the-art results in their respective weight classes. On the MTEB Leaderboard, granite-embedding-97m-multilingual-r2 ranks #1 for multilingual embedding models under 100M parameters, and granite-embedding-311m-multilingual-r2 ranks #2 for multilingual embedding models under 500M parameters.

Granite Embedding models are released under the Apache 2.0 license, making them freely available for both research and commercial purposes, with full transparency into their training data.

Granite Embedding Paper

Getting Started

Intended Use

The Granite Embedding 311m multilingual model is designed to produce fixed-length vector representations for a given text, which can be used for text similarity, retrieval, and search applications across multiple languages. It supports 200+ languages with enhanced support for 52 languages.

For efficient inference, these models support Flash Attention 2. Installing it is optional but can lead to faster encoding:

pip install flash_attn

Usage with Sentence Transformers

The model is compatible with the SentenceTransformer library and is very easy to use:

First, install the sentence transformers library:

pip install sentence_transformers

The model can then be used to encode pairs of text and find the similarity between their representations:

from sentence_transformers import SentenceTransformer, util

model_path = "ibm-granite/granite-embedding-311m-multilingual-r2"
# Load the Sentence Transformer model
model = SentenceTransformer(model_path)

input_queries = [
    'What is the tallest mountain in Japan?',          # English query
    'Wer hat das Lied Achy Breaky Heart geschrieben?', # German query
    'ドイツの首都はどこですか？',                            # Japanese query
    ]

input_passages = [
    "富士山は、静岡県と山梨県にまたがる活火山で、標高3776.12 mで日本最高峰の独立峰である。",  # Japanese passage
    "Achy Breaky Heart is a country song written by Don Von Tress. Originally titled Don't Tell My Heart and performed by The Marcy Brothers in 1991.",  # English passage
    "Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland. Die Stadt ist mit rund 3,7 Millionen Einwohnern die bevölkerungsreichste Kommune Deutschlands.",  # German passage
    ]

# Cross-lingual retrieval: each query should score highest with its matching passage in a different language
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)

# calculate cosine similarity — expect high scores on the diagonal (EN→JA, DE→EN, JA→DE)
print(util.cos_sim(query_embeddings, passage_embeddings))
# output: tensor([[0.9393, 0.6899, 0.7627],
#                 [0.6780, 0.9598, 0.7062],
#                 [0.7818, 0.7342, 0.9172]])

Matryoshka Representation Learning

This model supports Matryoshka Representation Learning (MRL), which allows you to truncate embeddings to smaller dimensions (e.g., 512, 256, 128) with graceful performance degradation. This is useful for reducing storage and memory requirements.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ibm-granite/granite-embedding-311m-multilingual-r2")

# Full 768-dimensional embeddings
full_embeddings = model.encode(["example text"])
print(full_embeddings.shape)  # (1, 768)

# Truncated to 256 dimensions — still effective for many retrieval tasks
truncated_embeddings = model.encode(["example text"], truncate_dim=256)
print(truncated_embeddings.shape)  # (1, 256)

Usage with Hugging Face Transformers

This is a simple example of how to use the granite-embedding-311m-multilingual-r2 model with the Transformers library and PyTorch. For a complete retrieval workflow including passage encoding and cosine similarity, see the Sentence Transformers example above.

First, install the required libraries:

pip install transformers torch

The model can then be used to encode text:

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "ibm-granite/granite-embedding-311m-multilingual-r2"

# Load the model and tokenizer
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()

input_queries = [
    'What is the tallest mountain in Japan?',          # English query
    'Wer hat das Lied Achy Breaky Heart geschrieben?', # German query
    'ドイツの首都はどこですか？',                            # Japanese query
    ]

# tokenize inputs
tokenized_queries = tokenizer(input_queries, padding=True, truncation=True, return_tensors='pt')

# encode queries
with torch.no_grad():
    model_output = model(**tokenized_queries)
    # Perform pooling. granite-embedding-311m-multilingual-r2 uses CLS Pooling
    query_embeddings = model_output[0][:, 0]

# normalize the embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, dim=1)

Optimized Inference and Deployment

ONNX and OpenVINO

Pre-converted ONNX and OpenVINO models are released alongside the PyTorch weights for production deployment. These can be loaded through the Optimum library:

pip install langchain_huggingface sentence_transformers \
    langchain_milvus 'pymilvus[milvus_lite]' \
    langchain_community \
    langchain_text_splitters \
    wget

The ONNX model is compatible with any ONNX Runtime backend (CPU, CUDA, TensorRT, DirectML). The OpenVINO model is optimized for Intel hardware including CPUs and integrated GPUs.

vLLM

The model can be served as an embedding endpoint using vLLM:

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_milvus import Milvus
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
import os, tempfile, wget

# load the embedding model
embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")

# setup the vectordb
db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")
vector_db = Milvus(
  embedding_function=embeddings_model,
  connection_args={"uri": db_file},
  auto_id=True,
  index_params={"index_type": "AUTOINDEX"},
  )

# load example corpus file
filename = 'state_of_the_union.txt'
url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(filename):
  wget.download(url, out=filename)

loader = TextLoader(filename)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# add processed documents to the vectordb
vector_db.add_documents(texts)

# search the vectordb with the query
query = "What did the president say about Ketanji Brown Jackson"
docs = vector_db.similarity_search(query)
print(docs[0].page_content)

Overview​

Getting Started​

Intended Use​

Usage with Sentence Transformers​

Matryoshka Representation Learning​

Usage with Hugging Face Transformers​

Optimized Inference and Deployment​

ONNX and OpenVINO​

vLLM​