Granite Speech

Model Collection

View the full Granite Speech collection on Hugging Face

Speech Demo

Try Granite Speech in action

WebGPU Demo

Run Granite Speech in your browser

Replicate

Deploy Granite Speech on Replicate

Overview

The Granite Speech 4.1 model family provides compact and efficient speech-language models for multilingual automatic speech recognition (ASR) and automatic speech translation (AST), supporting English, French, German, Spanish, Portuguese, and Japanese. All models are trained on 174,000 hours of audio from public corpora and tailored synthetic datasets.

Model Variants

The Granite Speech 4.1 suite includes three specialized variants:

granite-speech-4.1-2b: Balanced ASR and AST capabilities with improved punctuation and capitalization across all supported languages
granite-speech-4.1-2b-plus: Speech-to-text model with speaker-attributed ASR, timestamps, and keyword-prompted ASR for enhanced recognition of names, acronyms, and technical jargon
granite-speech-4.1-2b-nar: Non-autoregressive variant (NLE architecture) optimized for fast and accurate ASR with significantly lower latency

Performance

Granite Speech 4.1 models deliver industry-leading performance on the OpenASR Leaderboard: granite-speech-4.1-2b ranks #1 for accuracy, while granite-speech-4.1-2b-nar places #3 with exceptional speed through its non-autoregressive architecture. Granite Speech is released under the Apache 2.0 license, making it freely available for both research and commercial purposes, with full transparency into its training data. Granite Speech Paper

Getting Started

Granite Speech model is supported natively in transformers>=4.52.1. Below is a simple example of how to use the granite-speech-4.1-2b model.

Usage with Transformers

First, make sure to install a recent version of transformers:

pip install transformers torchaudio peft soundfile

Then run the code:

import torch
import torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-4.1-2b"
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name, device_map=device, torch_dtype=torch.bfloat16
)

# Load audio
audio_path = hf_hub_download(repo_id=model_name, filename="multilingual_sample.wav")
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000  # mono, 16kHz

# Create text prompt
user_prompt = "<|audio|>transcribe the speech with proper punctuation and capitalization."
chat = [
    {"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# Run the processor + model
model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device)
model_outputs = model.generate(
    **model_inputs, max_new_tokens=200, do_sample=False, num_beams=1
)

# Transformers includes the input IDs in the response
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = model_outputs[0, num_input_tokens:].unsqueeze(0)
output_text = tokenizer.batch_decode(
    new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0]}")

Usage with vLLM

First, make sure to install vLLM:

pip install vllm

Offline mode

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

model_id = "ibm-granite/granite-speech-4.1-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_prompt(question: str, has_audio: bool):
    """Build the input prompt to send to vLLM."""
    if has_audio:
        question = f"<|audio|>{question}"
    chat = [
        {
            "role": "user",
            "content": question
        }
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False)

model = LLM(
    model=model_id,
    max_model_len=2048, # This may be needed for lower resource devices.
    limit_mm_per_prompt={"audio": 1},
)

question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
    question=question,
    has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate

inputs = {
    "prompt": prompt_with_audio,
    "multi_modal_data": {
        "audio": audio,
    }
}

outputs = model.generate(
    inputs,
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=64,
    ),
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")

Online mode

First, launch the vLLM server:

vllm serve ibm-granite/granite-speech-4.1-2b \
    --api-key token-abc123 \
    --max-model-len 2048

Then, use the OpenAI-compatible API:

import base64

import requests
from openai import OpenAI

from vllm.assets.audio import AudioAsset

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model_name = "ibm-granite/granite-speech-4.1-2b"
# Any format supported by librosa is supported
audio_url = AudioAsset("mary_had_lamb").url

# Use base64 encoded audio in the payload
def encode_audio_base64_from_url(audio_url: str) -> str:
    """Encode an audio retrieved from a remote url to base64 format."""
    with requests.get(audio_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode("utf-8")
    return result

audio_base64 = encode_audio_base64_from_url(audio_url=audio_url)

question = "can you transcribe the speech into a written format?"
chat_completion_with_audio = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
            {
                "type": "audio_url",
                "audio_url": {
                    # Any format supported by librosa is supported
                    "url": f"data:audio/ogg;base64,{audio_base64}"
                },
            },
        ],
    }],
    temperature=0.2,
    max_tokens=64,
    model=model_name,
)

print(f"Audio Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")

Models

Run Granite

Model Standards

Responsible AI

Model Collection

Speech Demo

WebGPU Demo

Replicate

Overview

Model Variants

Performance

Getting Started

Usage with Transformers

Usage with vLLM

Offline mode

Online mode

Models

Run Granite

Model Standards

Responsible AI

Documentation Index

Model Collection

Speech Demo

WebGPU Demo

Replicate

​Overview

​Model Variants

​Performance

​Getting Started

​Usage with Transformers

​Usage with vLLM

​Offline mode

​Online mode

Overview

Model Variants

Performance

Getting Started

Usage with Transformers

Usage with vLLM

Offline mode

Online mode