IBM Granite 3.3: Speech recognition, refined reasoning, and RAG LoRAs

16 April 2025

 

Author

Kate Soule

Director, Technical Product Management, Granite

IBM

Dave Bergmann

Senior Writer, AI Models

IBM

Here's the key info, at a glance:

  • We’re releasing Granite Speech 3.3 8B, a new speech-to-text (STT) model that excels in automatic speech recognition (ASR) and automatic speech translation (AST).
  • The new audio model is built on top of Granite 3.3 8B Instruct, the latest update to our workhorse enterprise large language model (LLM). Alongside enhanced reasoning capabilities, the Granite 3.3 Instruct models now offer fill-in-the-middle (FIM) capabilities in addition to standard next-token prediction.
  • To enhance existing Granite-driven applications, we’re also releasing a suite of retrieval augmented generation (RAG)-focused LoRA adapters for Granite 3.2. Feedback will inform development of LoRA adapters for Granite 3.3 Instruct, which will be released shortly, as well as for future generations of Granite LLMs.
  • Alongside these conventional adapters, IBM Research has also developed a series of activated LoRAs (aLoRAs), an experimental new kind of low-rank adaption (LoRA) that cuts inference costs and memory requirements while enabling seamless switching between adapters.
  • As always, all Granite models and tools are released open source under a standard Apache 2.0 license.
  • All Granite 3.3 models and associated tools are available on Hugging Face. Granite 3.3 Instruct is also available on IBM watsonx.ai, as well as through platform partners including LMStudio, Ollama and Replicate.


Today’s launch represents another expansion of IBM Granite’s multimodal footprint. Headlined by Granite Speech 8B, our first official speech-to-text model, Granite 3.3 marks the beginning of our explorations into audio capabilities. Alongside the recent addition of vision and reasoning capabilities, IBM continues to grow the versatility of the Granite series across the enterprise use cases that customers and the open source community need most.

Joining Granite Speech 3.3 8B is Granite 3.3 8B Instruct, the large language model (LLM) that serves as its foundation, and its smaller (2B) counterpart. The enhanced sophistication of the text models’ reasoning process over their predecessors and addition of fill-in-the-middle (FIM) capabilities facilitate a wider array of applicable use cases, particularly in the coding domain.

We’re also releasing an updated and expanded series of performance enhancing (and primarily RAG-focused) LoRA adapters for the previously released Granite 3.2 8B Instruct model through Granite Experiments, an IBM Research playground for testing open source ideas. Further LoRA innovations, including a suite of adapters for Granite 3.3 Instruct, will be launched in the coming weeks. ­­­

Granite Speech 3.3 8B: Accurate, efficient transcription and translation

Granite Speech 3.3 8B is a compact and cost-efficient audio-in (and text-in), text-out STT model, intended for use in enterprise applications that process speech inputs and optimized for automatic speech recognition (ASR) and automatic speech translation (AST).

On transcription tasks, Granite Speech 3.3 consistently delivers greater accuracy than leading open and closed model competitors in testing across several prominent public datasets.

The model also provides automated translation from English to a diverse array of languages, including French, Spanish, Italian, German, Portuguese, Japanese and Mandarin. In IBM testing of AST performance, Granite Speech 3.3 8B kept pace with leading proprietary models such as OpenAI’s GPT-4o and Google’s Gemini 2.0 Flash on Granite-supported languages in the CoVost dataset. More information on translation performance is available in the model's Hugging Face model card.

Architecture and design

Architecturally speaking, Granite Speech 3.3 consists of:

  • A speech encoder, comprising 10 conformer blocks trained with Connectionist Temporal Classification (CTC) on ASR-focused datasets.
  • A speech projector—in this case, a 2-layer query transformer (Q-former)—that projects audio embeddings to a space where they can be interpreted by an LLM.
  • An LLM—namely, Granite 3.3 8B Instruct with 128K context length.
  • LoRA adapters, applied to the LLM’s query and value projection matrices when audio data is present.

In contrast to directly integrated models that combine speech and text in a single pass, Granite Speech 3.3 uses a two-pass design. For instance, to ask the model questions about an audio file requires an initial call to transcribe the audio and a second prompt to query the model about that transcribed text. If a prompt contains the “<audio> ” token and a corresponding .wav file, Granite Speech will engage the audio encoder, projector and LoRA adapter. If not, the model will simply run in text mode using Granite 3.3 Instruct 8B.

This two-pass approach ensures that Granite Speech 3.3 8B’s performance on text queries mirrors that of its underlying LLM (Granite 3.3 8B Instruct), avoiding the degradation on text-based performance typical of many multimodal models. Provided access to an inference platform configured to properly serve both text and speech models, developers can essentially understand Granite Speech 3.3 8B as a version of Granite 3.3 8B Instruct with added audio-in capabilities.

Unlike conventional Whisper-based ASR models, Granite Speech 3.3 can accept inputs of arbitrary length—in testing, the model was comfortably able to process a 20-minute audio file on an H100 80GB GPU—rather being fixed to a 30-second window. In Whisper-based models, audio files exceeding that maximum must be cut into 30-second chunks, which often introduces inaccuracies near the moments where these 30 second cuts are imposed. As a general rule, the less artificial cuts you need to make, the less inaccuracy you introduce.

While Granite Speech 3.3 can ostensibly ingest rather long audio inputs, it’s worth noting that the model has not yet been fine-tuned on long audio data. To maintain consistent accuracy, we suggest a limit of 1 minute for each discrete unit of audio input.

Avenues for improvement

Granite Speech 3.3 represents only the opening salvo of IBM’s exploration into audio capabilities for the Granite series. Ongoing research into enhancing Granite Speech for future releases—particularly in Granite 4—include:

  • Multilingual encoding: Presently, Granite Speech 3.3’s audio encoder is English only. An important next step for Granite Speech involves audio encoders that are multilingual and sensitive to paralinguistic phenomena, allowing us to enable true multilingual inputs.
  • Refined data recipes: Future training regimens will incorporate more and higher quality training data, with synthetic data generation for targeted use cases playing an important role. We’re also experimenting with additional fine-tuning and data balancing steps.
  • Earlier modality fusion: We’re exploring implementation of a more unified structure that incorporates audio features at all training stages of future Granite models.
  • Emotion detection: Future Granite Speech models will support speech emotion recognition (SER) capabilities through training our acoustic encoder to be more sensitive to non-lexical audio events.

Granite 3.3 Instruct: FIM and enhanced reasoning

The latest versions of our text-only instruction tuned models, Granite 3.3 8B Instruct and Granite 3.3 2B Instruct add fill-in-the-middle (FIM) capabilities and continue to refine the thinking capabilities introduced in Granite 3.2.

We’re also releasing their base model counterparts—Granite 3.3 8B Base and Granite 3.3 2B Base, which now supersede their predecessors from Granite 3.1—to provide developers with access to our FIM-capable models for their own fine-tuning endeavors.

Filling in the middle

Autoregressive LLMs—the LLMs typically used for text generation—are fundamentally designed to move forward, from left to right. They’re trained through self-supervised learning to iteratively predict the next token in a sequence, based on information from the preceding tokens, until the sequence is deemed complete. While that design lends itself to an impressive variety of generative tasks, it inherently falls short on a different kind of task: predicting the correct tokens based on the tokens that come before and after. In other words, conventional autoregressive LLMs cannot “fill in the middle.”

To adapt autoregressive models for infilling requires a redesign of training tasks to essentially “trick” the LLM into predicting tokens in the middle using its intrinsic left-to-right prediction ability. This generally requires dividing a sample passage into prefix (the preceding tokens), suffix (the tokens that come after) and middle (the tokens to be predicted by infilling), then rearranging the passage such that the model is provided both prefix and suffix before being asked to predict middle tokens. Granite 3.3 utilizes specialized tokens to enable the model to generate content conditioned on both prefix and suffix.

While FIM has a wide array of use cases, it’s particularly applicable to coding tasks, from code repair and error connection to refactoring to quickly generating boilerplate code and enabling the insertion of function arguments or docstrings.

Enhanced reasoning

Our focus for Granite 3.2 was enriching the Instruct models’ reasoning abilities through Thought Preference Optimization (TPO) to improve their ability to follow complex instructions without sacrificing general performance. Our focus for Granite 3.3 Instruct was to preserve those gains while also enriching the models’ performance on complex mathematical reasoning.

Built on top of an updated Granite 3.3 base model and fine-tuned through multi-stage reinforcement learning using TPO and Group Relative Policy Optimization (GRPO), both Granite 3.3 Instruct models demonstrated significant improvement on the highly technical benchmarks conventionally associated with “reasoning” capabilities.

Granite 3.3 8B’s performance on the MATH500 benchmark puts it comfortably ahead of Anthropic’s Claude 3.5 Haiku (64.2%) and Meta’s Llama 3.1 8B Instruct (44.4%), roughly in line with the 24B-parameter Mistral Small 3 (70.6%), and barely behind Claude 3.5 Sonnet (72.4%) and OpenAI’s GPT-4o Mini (72.6%).1

As with the Granite 3.2 Instruct models, “thinking” can be easily toggled on and off, allowing developers to prioritize enhanced chain-of-thought (CoT) reasoning when they need it and prioritize cost-efficiency and low latency when they don’t.

Refining RAG through LoRA adapters

To enhance existing Granite-based applications and inform development of the next generation of performance-enhancing LoRA adapters, IBM is also releasing a collection of 5 (mostly) RAG-specific LoRA adapters for Granite 3.2 8B Instruct through Granite Experiments, an IBM Research playground for testing open source ideas. Each of these LoRA adapters leverages the model’s intrinsic knowledge to enable a specific task, such as rewriting retrieval queries or detecting hallucinations.

IBM Research developed these “conventional” LoRA adapters alongside counterparts for each that use a new kind of low-rank adaption that we call activated LoRAs (aLoRAs). Swapping between standard LoRA adapters often slows down performance because the model must recompute the context of the ongoing conversation using the new. But unlike standard LoRAs, IBM’s aLoRAs simply reuse the existing key-value (KV) cache, avoiding the need to recompute the context (or “prefill”) again. Activated LoRAs match the generation quality of standard LoRAs while providing significant runtime and compute advantages. Source code to run the aLoRAs is available here.

RAG Hallucination Detection
Even with RAG, an LLM can sometimes hallucinate. When equipped with the RAG Hallucination Detection LoRA, the model will provide a “faithfulness score” between 0–1 (in increments of 0.1), reflecting how closely its output reflects the information contained within the retrieved documents. A lower faithfulness scores indicates higher hallucination risk. The model will output unanswerable when the question cannot be answered with information from available sources.

RAG Query Rewrite
Retrieval engines return significantly better results in response to standalone queries that contain all relevant information than they do in response to queries that require context from earlier in the conversation to be actionable. With the Query Rewrite LoRA equipped, the model will automatically rewrite any non-standalone user query into a fully self-contained query. For instance, consider this exchange:

User: “Who’s the CEO of Apple?”
Model: “Tim Cook is the Chief Executive Officer of Apple Inc.”
User: “What about Microsoft?”

The model will pass the user’s first query as is, but rewrite the second query as, “Who is the CEO of Microsoft?”. In testing, this rewriting increased the relevance of model responses by as much as 21 percentage points.

Though it was designed with RAG in mind, Query Rewrite doesn’t require the presence of RAG documents: it can also be used to rewrite user queries for other use cases, such as tool calls.

RAG Citation Generation
When equipped with the RAG Citation Generaton LoRA, the model will generate a citation for each sentence of its output (if that sentence was informed by any external sources). Each sentence-level citation not only notes any source(s) referenced, but also contains a set of sentences from the cited source(s) that support the model’s corresponding output sentence.

RAG Answerability Prediction
When equipped with the RAG Answerability Prediction LoRA, the model will determine whether or not the user’s query can be answered using the information available in connected documents. This binary classification—“answerable” or “unanswerable”—can be used to, among other things, filter out unanswerable questions (reducing hallucinations) or prompt the model to re-query the retriever in a different way.

Uncertainty Prediction
For each model output, the Uncertainty LoRA—born from the MIT-IBM Watson AI Lab’s AI model calibration research—enables the model to generate a quantized “certainty score” ranging from 0 to 9 (representing 5% to 95% certainty, respectively). The score essentially reflects the extent to which the model’s response is supported by information contained within its training data.

Combining RAG LoRAs

Whereas traditional RAG entails a single inference—a straightforward prompt grounded in specific context—yielding a single model output, we propose the use of these LoRAs in workflows that leverage multiple LoRA adapters across multiple inferences en route to a final model response.

For instance, you can first implement Query Rewrite to (when necessary) quickly rewrite initial prompts for optimal retriever accuracy. Once the model’s retrieval-augmented response has been generated using the rewritten prompt, you might then implement RAG Hallucination Detection to verify an appropriate level of faithfulness to the information in the retrieved documents. If the faithfulness score falls beneath an acceptable threshold, your workflow could direct the model to resample the response until the faithfulness score exceeds that threshold. Once hallucinations are no longer detected, you could then engage RAG Citations for the final response provided to the user.

This would essentially be akin to the RAG equivalent of scaling test time compute, scaffolding multiple inferences to improve and enrich the model’s final output. We’re excited to see how the open source community will implement and experiment with these new LoRA adapters. More information on the RAG LoRAs and their impact on model performance is available in the accompanying technical paper.

What's next for IBM Granite?

IBM Research is actively training Granite 4.0, a new generation of models that represent a major evolution of the Granite architecture and demonstrate promising gains in speed, context length and capacity. Though specific details will not be announced until later in Q2, clients, partners and developers can count on IBM maintaining its commitment to small, practical models that can be run at low cost and latency.

Getting started with Granite 3.3

The new Granite 3.3 Instruct models are live on IBM watsonx.ai, our integrated, end-to-end studio for enterprise AI development. You can try Granite 3.3 Instruct 8B—and easily experiment with toggling “thinking” on and off—on the Granite Playground.

Granite Speech 3.3 8B, along with all of the newly Granite models and LoRA adapters, is available on Hugging Face. Select Instruct models are also available through platform partners including (in alphabetical order) LMStudio, Ollama and Replicate, with more to come in the near future.

A number of guides and recipes for working with Granite models are available in Granite docs and the Granite Snack Cookbook on GitHub. Developers can get started with Granite models by exploring our array of useful demos, recipes and tutorials, such as:

Explore the new IBM Granite 3.3 models→
 

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Footnotes

1"MATH 500 Benchmark," Vals AI, last updated 24 March 2025

Related solutions
IBM® watsonx.ai™

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo