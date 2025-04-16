To enhance existing Granite-based applications and inform development of the next generation of performance-enhancing LoRA adapters, IBM is also releasing a collection of 5 (mostly) RAG-specific LoRA adapters for Granite 3.2 8B Instruct through Granite Experiments, an IBM Research playground for testing open source ideas. Each of these LoRA adapters leverages the model’s intrinsic knowledge to enable a specific task, such as rewriting retrieval queries or detecting hallucinations.

IBM Research developed these “conventional” LoRA adapters alongside counterparts for each that use a new kind of low-rank adaption that we call activated LoRAs (aLoRAs). Swapping between standard LoRA adapters often slows down performance because the model must recompute the context of the ongoing conversation using the new. But unlike standard LoRAs, IBM’s aLoRAs simply reuse the existing key-value (KV) cache, avoiding the need to recompute the context (or “prefill”) again. Activated LoRAs match the generation quality of standard LoRAs while providing significant runtime and compute advantages. Source code to run the aLoRAs is available here.

RAG Hallucination Detection

Even with RAG, an LLM can sometimes hallucinate. When equipped with the RAG Hallucination Detection LoRA, the model will provide a “faithfulness score” between 0–1 (in increments of 0.1), reflecting how closely its output reflects the information contained within the retrieved documents. A lower faithfulness scores indicates higher hallucination risk. The model will output unanswerable when the question cannot be answered with information from available sources.

RAG Query Rewrite

Retrieval engines return significantly better results in response to standalone queries that contain all relevant information than they do in response to queries that require context from earlier in the conversation to be actionable. With the Query Rewrite LoRA equipped, the model will automatically rewrite any non-standalone user query into a fully self-contained query. For instance, consider this exchange:

User: “Who’s the CEO of Apple?” Model: “Tim Cook is the Chief Executive Officer of Apple Inc.” User: “What about Microsoft?”

The model will pass the user’s first query as is, but rewrite the second query as, “Who is the CEO of Microsoft?”. In testing, this rewriting increased the relevance of model responses by as much as 21 percentage points.

Though it was designed with RAG in mind, Query Rewrite doesn’t require the presence of RAG documents: it can also be used to rewrite user queries for other use cases, such as tool calls.

RAG Citation Generation

When equipped with the RAG Citation Generaton LoRA, the model will generate a citation for each sentence of its output (if that sentence was informed by any external sources). Each sentence-level citation not only notes any source(s) referenced, but also contains a set of sentences from the cited source(s) that support the model’s corresponding output sentence.

RAG Answerability Prediction

When equipped with the RAG Answerability Prediction LoRA, the model will determine whether or not the user’s query can be answered using the information available in connected documents. This binary classification—“answerable” or “unanswerable”—can be used to, among other things, filter out unanswerable questions (reducing hallucinations) or prompt the model to re-query the retriever in a different way.

Uncertainty Prediction

For each model output, the Uncertainty LoRA—born from the MIT-IBM Watson AI Lab’s AI model calibration research—enables the model to generate a quantized “certainty score” ranging from 0 to 9 (representing 5% to 95% certainty, respectively). The score essentially reflects the extent to which the model’s response is supported by information contained within its training data.