IBM Granite

Granite 3.2

Try Granite 3.2

Model playground

Agentic RAG

granite-3.2-8b-instruct

Use Granite in an agentic RAG pattern to answer complex queries using external information.

Entity extraction

granite-3.1-8b-instruct

Extract well-formed entities from unstructured data with Granite.

Document summarization

granite-3.1-8b-instruct

Multi-level abstractive document summarization using Granite.

Table of contents

  1. Overview
  2. Model cards
  3. Run locally with Ollama
  4. Chat template
    1. Basic chat template
    2. Advanced features
      1. Originality in RAG tasks
      2. Control the response length
      3. Thinking capability
  5. Granite 3.2 examples
    1. Basic inference examples
      1. Summarization
      2. Text classification
      3. Text extraction
      4. Text translation
    2. Advanced inference examples
      1. RAG with annotations
      2. Length annotation
      3. Reasoning
      4. Function-calling
  6. Backward compatibility
  7. Building prompts with transformers library
  8. Inference Tips

Overview

Granite 3.2 models incorporate enhanced reasoning capabilities and advanced features for RAG tasks. Finetuned from Granite 3.1 instruction models, they build on a foundation of open-source instruction datasets with permissive licenses and internally curated synthetic datasets designed for long-context problem-solving. These models maintain key strengths of their predecessors, including support for a 128K context length, exceptional performance in RAG and function-calling tasks, and competitive results across general, enterprise, and safety benchmarks. Released as open source, Granite 3.2 models are available under the Apache 2.0 license.

Model cards

Run locally with Ollama

Learn more about Granite 3.2 on Ollama.

Granite 3.2 highlights

Granite 3.2 models, like their predecessors, support a long context length and excel in retrieval-augmented generation (RAG) and function-calling tasks. Additionally, they incorporate advanced features to control the response length and the originality of the text generated for RAG responses. Furthermore, Granite 3.2 models come with enhanced reasoning capabilities.

Chat template

To obtain the best performance from Granite 3.2 instruction models, we recommend using our official chat template.

Basic chat template

Below, we show a basic example of Granite 3.2 models chat template.

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 18, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>What is the largest ocean on Earth?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>The largest ocean on Earth is the Pacific Ocean. It covers an area of about 63,800,000 square miles (165,250,000 square kilometers), which is more than all of Earth's land areas combined.<|end_of_text|>

The elements of this chat template serve the following purpose:

  • <|start_of_role|>, <|end_of_role|>, and <|end_of_text|> are special tokens for prompt control.
    • <|start_of_role|> indicates the start of a turn.
    • <|end_of_role|> indicates the end of the control token, and the start of textual content.
    • <|end_of_text|> indicates the end of the turn.
  • system, user, assistant, documents, tools and tool_response are generic turn types that can be used to build prompts.
    • system is useful for adding a static system prompt, which is optional. Depending on the text-generation task, we recommend to use the following system prompts with Granite 3.2 models:
      • system prompt for standard single and multiturn conversations
      Knowledge Cutoff Date: April 2024.
      Today's Date: February 18, 2025.
      You are Granite, developed by IBM. You are a helpful AI assistant.
      • system prompt for RAG generation tasks (without annotations)
      Knowledge Cutoff Date: April 2024.
      Today's Date: February 18, 2025.
      You are Granite, developed by IBM. Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.
      • system prompt for Reasoning tasks
      Knowledge Cutoff Date: April 2024.
      Today's Date: February 18, 2025.
      You are Granite, developed by IBM. You are a helpful AI assistant.
      Respond to every user query in a comprehensive and detailed way. You can write down your thoughts and reasoning process before responding. In the thought process, engage in a comprehensive cycle of analysis, summarization, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. In the response section, based on various attempts, explorations, and reflections from the thoughts section, systematically present the final solution that you deem correct. The response should summarize the thought process. Write your thoughts after 'Here is my thought process:' and write your response after 'Here is my response:' for each user query.

      Important: using custom system prompts is possible, but we recommend it only for advanced users. Managing unexpected results requires expert-level knowledge of the chat template that Granite 3.2 models were trained with.

    • Use user for queries from the user or an external agent.
    • Use assistant for generations from the model itself.
    • Use documents in RAG tasks to provide all available grounding documents to the model.
    • Use tools in function-calling tasks to provide the list of tools to the model.
    • Use tool_response in function-calling tasks to feed back to the model the external system execution of a tool call.

Use the following code example to reproduce the previous output.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.2-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

Advanced features

Our chat template supports advanced features to regulate response length, set the tone of RAG outputs, and encourage the model to engage in deeper reasoning. In this section, we demonstrate how to use our chat template to guide the model’s responses for each scenario.

Originality in RAG tasks

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>
<|start_of_role|>documents<|end_of_role|>Document 0
Dynamic Data: Automatically generate and refine graphs to best fit your domain and ontology needs.
Document 1
RAG, retrieval-augmented generation, is a technique that grants generative artificial intelligence models information retrieval capabilities.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>What is RAG?<|end_of_text|>

Important considerations:

  • Use the optional originality annotation to moderate the response style of the assistant.
  • The supported values are extractive and abstractive. The extractive mode guides the model to generate responses using nearly direct sentences from the source text, while the abstractive mode paraphrases the source text to formulate an answer.
  • The annotation will only influence the immediate next generation of the assistant role.
  • To activate this feature you must use the system prompt for RAG tasks. Otherwise, this feature will not be activated.
  • For RAG tasks in general:
    • Use the turn type, documents to provide the list of documents the model should consider. Inside the list of documents, a document header signals the beginning of the document. A line break signals the end.
    • The assistant returns a default message if the question cannot be answered with the documents provided.

Use the following code example to reproduce the previous output.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.2-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)

Control the response length

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 20, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Give me a list of wildflowers from Colorado<|end_of_text|>
<|start_of_role|>assistant {"length": "short"}<|end_of_role|>1. Columbine (Aquilegia coerulea)
2. Indian Paintbrush (Castilleja spp.)
3. Larkspur (Delphinium spp.)
4. Purple Coneflower (Echinacea purpurea)
5. Rocky Mountain Penstemon (Penstemon strictus)

Important considerations: - This optional length annotation can be added to the assistant response to control the response’s length. - The annotation will only influence the immediate next generation of the assistant role. - The supported values are short and long. A short value guides the model to generate a shorter response, while a long value encourages a more subtle response.

se the following code example to reproduce the previous output.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.2-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)

Thinking capability

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.
Respond to every user query in a comprehensive and detailed way. You can write down your thoughts and reasoning process before responding. In the thought process, engage in a comprehensive cycle of analysis, summarization, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. In the response section, based on various attempts, explorations, and reflections from the thoughts section, systematically present the final solution that you deem correct. The response should summarize the thought process. Write your thoughts after 'Here is my thought process:' and write your response after 'Here is my response:' for each user query.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>A farmer has 10 cows, 5 chickens, and 2 horses. If we count all the animals' legs together, how many legs are there in total?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Here is my thought process:
To solve this, we need to calculate the total number of legs for each type of animal and then sum them up.
- Cows have 4 legs each.

Important considerations: - To activate this feature you must use the system prompt for reasoning tasks.

Use the following code example to reproduce the previous output.

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch
model_path="ibm-granite/granite-3.2-8b-instruct"
device="cuda"
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(
model_path,

Granite 3.2 examples

In this section, we share additional examples that demonstrate all the capabilities of Granite 3.2 models.

Basic inference examples

Summarization

This example demonstrates how to summarize an interview transcript.

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>
<|start_of_role|>documents<|end_of_role|>Document 0
Speaker 1 (00:00):
Simone, congratulations.
Simone (00:02):
Thank you.

Text classification

This example demonstrates a classification task for movie reviews. The user query includes classification examples to improve the model’s response accuracy.

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>
Classify the sentiment of the movie reviews as positive or negative.
Your response should only include the answer. Do not provide any further explanation.
Here are some examples, complete the last one:
Review:

Text extraction

This example demonstrates how to extract certain information from a set of documents with a similar structural pattern.

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>
Extract the Line Of Credit Facility Maximum Borrowing Capacity from the 10K sentences.
Your response should only include the answer. Do not provide any further explanation.
Here are some examples, complete the last one:
10K Sentence:

Text translation

Granite 3.2 models also support tasks in multiple languages. This is a basic example of how to use the models to translate text from English to Spanish.

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>
Please translate the following list of greetings from English to Spanish.
Morning!, how are things?, hello, it's good to see you, what's up?
Please only output the translation, and nothing else.

Granite 3.2 models support text translation as well as other multilingual tasks in English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese (Simplified).

Advanced inference examples

Retrieval augmented generation (RAG) with annotations

The following example combines annotations to guide the model to generate a short-extractive response.

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 24, 2025.
You are Granite, developed by IBM. Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>
<|start_of_role|>documents<|end_of_role|>Document 0
Bridget Jones: The Edge of Reason (2004) - Bridget is currently living a happy life with her lawyer boyfriend Mark Darcy, however not only does she start to become threatened and jealous of Mark's new young intern, she is angered by the fact Mark is a Conservative voter. With so many issues already at hand, things get worse for Bridget as her ex-lover, Daniel Cleaver, re-enters her life; the only help she has are her friends and her reliable diary.,
Document 1
Bridget Jones's Baby (2016) - Bridget Jones is struggling with her current state of life, including her break up with her love Mark Darcy. As she pushes forward and works hard to find fulfilment in her life seems to do wonders until she meets a dashing and handsome American named Jack Quant. Things from then on go great, until she discovers that she is pregnant but the biggest twist of all, she does not know if Mark or Jack is the father of her child.

Use the following code example to reproduce the previous output.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.2-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

Length annotation

This example highlights the contrast between using short and long values. In this multi-turn interaction, we ask the model the same prompt twice. The first response, annotated with {length:short}, was stored in the dialogue history. Then, we prompt the model usign the same instruction, but this time with the {length:long} annotation.

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Give me a list of wildflowers from Colorado<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>1. Columbine (Aquilegia coerulea)
2. Indian Paintbrush (Castilleja spp.)
3. Larkspur (Delphinium spp.)
4. Purple Coneflower (Echinacea purpurea)
5. Rocky Mountain Penstemon (Penstemon strictus)

Reasoning

This example demonstrates the application of Granite 3.2 models’ thinking capabilities to a logical fallacy.

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.
Respond to every user query in a comprehensive and detailed way. You can write down your thoughts and reasoning process before responding. In the thought process, engage in a comprehensive cycle of analysis, summarization, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. In the response section, based on various attempts, explorations, and reflections from the thoughts section, systematically present the final solution that you deem correct. The response should summarize the thought process. Write your thoughts after 'Here is my thought process:' and write your response after 'Here is my response:' for each user query.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>If all roses are flowers and some flowers fade quickly, does it mean all roses fade quickly? Why or why not?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Here is my thought process:
To answer this, we need to understand the logical structure of the statements given:
1. All roses are flowers. (Universal statement)

Function-calling

This example demonstrates Granite 3.2 models ability to perform tool calls. Consider the following points to build promts for function-calling tasks:

  • Use the system prompt for function-calling tasks to obtain the best performance out of Granite 3.2 models.
  • Use tools turn type to provide the list of tools to the model.
  • Tools are provided to the model as a list of dictionaries, with each dictionary representing a tool. While the model has been trained to understand various dictionary structures for tools, we recommend building your prompts as shown in this example. In this approach, each tool is represented by a dictionary that includes the tool’s name, description, and required arguments.
  • The format of the assistant responses containing function calls should be pre-pended by a <tool_call> token.
  • Use tool_response turn type to feed back to the model the external system execution of a tool call.
  • Backward compabibility notice: this chat template is fully compatible with Granite 3.1 models. For 3.0 models, however, the turn type for the list of tools should be available_tools and there is not a specific token pre-pending tool calls.
<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant with access to the following tools. When a tool is required to answer the user's query, respond with <|tool_call|> followed by a JSON list of tools used. If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request.<|end_of_text|>
<|start_of_role|>tools<|end_of_role|>[
{
"name": "get_current_weather",
"description": "Get the current weather",
"arguments": {
"location": {

Use the following code example to nearly reproduce the previous output.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.2-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

After obtaining a tool call from the assistant and executing it via an external system, you could feed the execution’s result back to the model by updating the chat history as follows:

chat=[
{"role": "user", "content": "What's the current weather in New York?"},
{"role": "assistant", "content": "<tool_call>[{\"name\": \"get_current_weather\", \"arguments\": {\"location\": \"New York\"}}]"},
{"role": "tool_response", "content":"{“temp”: 20.5, “unit”: “C”}"} # This result is obtained by executing the tool call via an external system.
]

Long-context inference

A longer context allows Large Language Models to handle more detailed and complex inputs. This capability, for instance, is crucial for tasks that require using long documents, many documents, or understanding extensive dialogues. The following example shows a generation task that benefits from long-context capability due to a long list of documents. Note that we have summarized the list of documents for demonstration purposes.

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: February 19, 2025.
You are Granite, developed by IBM. Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>
<|start_of_role|>documents<|end_of_role|>Document 0
Bridget Jones is a binge drinking and chain smoking thirty-something British woman trying to keep her love life in order while also dealing with her job as a publisher. When she attends a Christmas party with her parents, they try to set her up with their neighbours' son, Mark. After being snubbed by Mark, she starts to fall for her boss Daniel, a handsome man who begins to send her suggestive e-mails that leads to a dinner date. Daniel reveals that he and Mark attended college together, in that time Mark had an affair with his fiancée. Bridget decides to get a new job as a TV presenter after finding Daniel being frisky with a colleague. At a dinner party, she runs into Mark who expresses his affection for her, Daniel claims he wants Bridget back, the two fight over her and Bridget must make a decision who she wants to be with.
Document 1
Bridget is currently living a happy life with her lawyer boyfriend Mark Darcy, however not only does she start to become threatened and jealous of Mark's new young intern, she is angered by the fact Mark is a Conservative voter. With so many issues already at hand, things get worse for Bridget as her ex-lover, Daniel Cleaver, re-enters her life; the only help she has are her friends and her reliable diary.,

Backward compatibility

Besides the integration of advanced features to include reasoning and response length, prompts previously designed for Granite 3.1 should be compatible with Granite 3.2 models. However, we recommend using this prompting guide to create new prompts for Granite 3.2 models that take full advantage of the new features that these models bring on board.

Building prompts with transformers library

The apply_chat_template function from the transformers library automatically applies the basic chat template structure of Granite 3.2 models to your prompts. To build prompts that integrate advanced features via the apply_chat_template you must use the appropriate kwargs.

  • Use documents, to pass the list of documents for RAG prompts. This will automatically activate the RAG system prompt.
  • Use tools, to pass the list of tools for function-calling tasks. This will automatically activate the function-calling system prompt.
  • Use thinking, to make the model reason about its response. This will automatically activate the reasoning system prompt.
  • Use controls to pass a json indicating what optional annotations should be activated. The respective system prompt extension (if needed) will also be automatically activated.

We strongly recommend that developers use the apply_chat_template function to construct their prompts, as it enhances the development experience and minimizes the risk of inference errors caused by manually crafting chat templates. However, you can also load a chat template that incorporates Granite 3.2’s advanced features without using this function.

In the following example, we load the template for RAG inference with annotations to moderate length and originality. The list of documents and desired annotations have been previously rendered. The final model prompt is saved in input_text variable. You can take this example as baseline to build wrappers for libraries that do no suppot kwargs.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.2-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

The execution of this script returns the assistant’s output below:

Bridget Jones's Diary (2001) follows Bridget Jones, a thirty-something British woman, as she navigates her love life and career while dealing with her vices like binge drinking and smoking. She becomes involved with her boss Daniel and Mark, who later reveals he had an affair with Daniel's fiancée. Bridget must choose between the two men.

Inference Tips

  • Verify the Max New Tokens setting. If you see a model response stopping mid-sentence, the max new tokens setting is likely set too low. This is particularly critical in long-context tasks.
  • Avoid pronouns in follow-up questions. For example, do not use “Can you edit it?”, instead use “Can you edit the table?”.
  • Reduce explanation length. If the explanations provided by the model are too long, update the instruction to make it clear there should be no additional explanations.
  • Fix run-on sentences/responses. If the model is generating run-on sentences/responses, try reducing the max tokens output parameter or adding line breaks or spaces as stop token(s). This should lead the model to stop generating the output after generating these tokens.
  • For prompts that include in-context examples, consider using the example labels as a stop token so the model stops generation after providing the answer.