The Granite Speech 4.1 model family provides compact and efficient speech-language models for multilingual automatic speech recognition (ASR) and automatic speech translation (AST), supporting English, French, German, Spanish, Portuguese, and Japanese. All models are trained on 174,000 hours of audio from public corpora and tailored synthetic datasets.
The Granite Speech 4.1 suite includes three specialized variants:
granite-speech-4.1-2b: Balanced ASR and AST capabilities with improved punctuation and capitalization across all supported languages
granite-speech-4.1-2b-plus: Speech-to-text model with speaker-attributed ASR, timestamps, and keyword-prompted ASR for enhanced recognition of names, acronyms, and technical jargon
Granite Speech 4.1 models deliver industry-leading performance on the OpenASR Leaderboard: granite-speech-4.1-2b ranks #1 for accuracy, while granite-speech-4.1-2b-nar places #3 with exceptional speed through its non-autoregressive architecture.Granite Speech is released under the Apache 2.0 license, making it freely available for both research and commercial purposes, with full transparency into its training data.Granite Speech Paper
import base64import requestsfrom openai import OpenAIfrom vllm.assets.audio import AudioAsset# Modify OpenAI's API key and API base to use vLLM's API server.openai_api_key = "token-abc123"openai_api_base = "http://localhost:8000/v1"client = OpenAI( # defaults to os.environ.get("OPENAI_API_KEY") api_key=openai_api_key, base_url=openai_api_base,)model_name = "ibm-granite/granite-speech-4.1-2b"# Any format supported by librosa is supportedaudio_url = AudioAsset("mary_had_lamb").url# Use base64 encoded audio in the payloaddef encode_audio_base64_from_url(audio_url: str) -> str: """Encode an audio retrieved from a remote url to base64 format.""" with requests.get(audio_url) as response: response.raise_for_status() result = base64.b64encode(response.content).decode("utf-8") return resultaudio_base64 = encode_audio_base64_from_url(audio_url=audio_url)question = "can you transcribe the speech into a written format?"chat_completion_with_audio = client.chat.completions.create( messages=[{ "role": "user", "content": [ { "type": "text", "text": question }, { "type": "audio_url", "audio_url": { # Any format supported by librosa is supported "url": f"data:audio/ogg;base64,{audio_base64}" }, }, ], }], temperature=0.2, max_tokens=64, model=model_name,)print(f"Audio Example - Question: {question}")print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")