IBM Granite

Granite 3.3

Try Granite

Model playground

Agentic RAG

granite-3.2-8b-instruct

Use Granite in an agentic RAG pattern to answer complex queries using external information.

Entity extraction

granite-3.2-8b-instruct

Extract well-formed entities from unstructured data with Granite.

Document summarization

granite-3.2-8b-instruct

Multi-level abstractive document summarization using Granite.

Table of contents

  1. Overview
  2. Model cards
  3. Run locally with Ollama
  4. Granite 3.3 highlights
  5. Basic chat template example
  6. Granite 3.3 inference examples
    1. Basic examples
    2. Advanced examples
  7. Backward compatibility
  8. Building prompts with transformers library
  9. Inference Tips

Overview

Granite 3.3 models feature enhanced reasoning capabilities and support for Fill-in-the-Middle (FIM) code completion. They are built on a foundation of open-source instruction datasets with permissive licenses, alongside internally curated synthetic datasets tailored for long-context problem-solving. These models preserve the key strengths of previous Granite versions, including support for a 128K context length, strong performance in retrieval-augmented generation (RAG) and function calling, and controls for response length and originality. Granite 3.3 also delivers competitive results across general, enterprise, and safety benchmarks. Released as open source, the models are available under the Apache 2.0 license.

Model cards

Run locally with Ollama

Learn more about Granite 3.3 on Ollama.

Granite 3.3 highlights

Granite 3.3 models feature extended context length and deliver strong performance across tasks such as retrieval-augmented generation (RAG), function calling, and reasoning. They retain key capabilities from earlier versions—such as response length control and originality. Additionally, Granite 3.3 introduces fill-in-the-middle (FIM) support for code completion and improves the clarity of model reasoning by separating intermediate thoughts from final answers.

Basic chat template example

Below, we show a basic example of Granite 3.3 models chat template.

<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
Today's Date: April 16, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>What is the largest ocean on Earth?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>The largest ocean on Earth is the Pacific Ocean. It covers an area of about 63,800,000 square miles (165,200,000 square kilometers).<|end_of_text|>

Use the following code example to reproduce this output.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.3-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

Granite 3.3 inference examples

Basic examples

In this section, we provide basic examples for a variety of inference tasks.

Summarization

This example demonstrates how to summarize an interview transcript.

<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
Today's Date: April 12, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Summarize a fragament of an interview transcript. In this interview, an NBC reporter interviews Simone Biles about her participation in Paris 2024 Olimpic games.
Your response should only include the answer. Do not provide any further explanation.
Speaker 1 (00:00):
Simone, congratulations.
Simone (00:02):
Thank you.

Text classification

This example demonstrates a classification task for movie reviews. The user query includes classification examples to improve the model’s response accuracy.

<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
Today's Date: April 12, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Classify the sentiment of the movie reviews as positive or negative.
Your response should only include the answer. Do not provide any further explanation.
Here are some examples, complete the last one:
Review:
Oh, where do I even begin? Barbie 2023 is a tour de force that has left me utterly captivated, enchanted, and spellbound. Every moment of this cinematic marvel was nothing short of pure excellence, deserving nothing less than a perfect 10 out of 10 rating!
Sentiment:

Text extraction

This example demonstrates how to extract certain information from a set of documents with a similar structural pattern.

<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
Today's Date: April 12, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Extract the Line Of Credit Facility Maximum Borrowing Capacity from the 10K sentences.
Your response should only include the answer. Do not provide any further explanation.
Here are some examples, complete the last one:
10K Sentence:
The credit agreement also provides that up to $500 million in commitments may be used for letters of credit.
Line Of Credit Facility Maximum Borrowing Capacity:

Text translation

Granite 3.3 models also support tasks in multiple languages. This is a basic example of how to use the models to translate text from English to Spanish.

<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
Today's Date: April 12, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Please translate the following list of greetings from English to Spanish.
Morning!, how are things?, hello, it's good to see you, what's up?
Please only output the translation, and nothing else.<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>¡Buenos días!, ¿cómo estás?, hola, bueno es verte, ¿qué tal?<|end_of_text|>

Granite 3.3 models support text translation as well as other multilingual tasks in English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese (Simplified).

Advanced examples

In this section, we demonstrate how to use the chat template to leverage the advanced features supported by Granite 3.3 models.

Control the response length

<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
Today's Date: April 12, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Give me a list of wildflowers from Colorado<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Here are some wildflowers native to Colorado:
1. Indian Paintbrush (Castilleja spp.)
2. Blue Columbine (Aquilegia coerulea)
3. Prairie Coneshell (Clysia ambigula)
4. Rocky Mountain Iris (Iris missiwi)

Important considerations:

  • The length annotation can be added to the assistant response to control the response’s length.
  • The annotation will only influence the immediate next generation of the assistant role.
  • The supported values are short and long. A short value guides the model to generate a direct response, while a long value encourages a more subtle response.

Use the following code example to reproduce the previous output.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.3-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)

Thinking capability

This example demonstrates the application of Granite 3.3 models’ thinking capabilities to a math problem.

<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
Today's Date: April 12, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.
Respond to every user query in a comprehensive and detailed way. You can write down your thoughts and reasoning process before responding. In the thought process, engage in a comprehensive cycle of analysis, summarization, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. In the response section, based on various attempts, explorations, and reflections from the thoughts section, systematically present the final solution that you deem correct. The response should summarize the thought process. Write your thoughts between <think></think> and write your response between <response></response> for each user query.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>A farmer has 10 cows, 5 chickens, and 2 horses. If we count all the animals' legs together, how many legs are there in total?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|><think>To solve this problem, we need to calculate the total number of legs for all the animals combined. We'll do this by considering each type of animal separately and then summing up the total number of legs.
1. **Cows**: Each cow has 4 legs. The farmer has 10 cows. Therefore, the total number of legs from cows is:
\[
10 \text{ cows} \times 4 \text{ legs/cow} = 40 \text{ legs}

The following example demonstrates the application of Granite 3.3 models’ thinking capabilities to a logical fallacy.

<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
Today's Date: April 12, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.
Respond to every user query in a comprehensive and detailed way. You can write down your thoughts and reasoning process before responding. In the thought process, engage in a comprehensive cycle of analysis, summarization, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. In the response section, based on various attempts, explorations, and reflections from the thoughts section, systematically present the final solution that you deem correct. The response should summarize the thought process. Write your thoughts between <think></think> and write your response between <response></response> for each user query.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>If all roses are flowers and some flowers fade quickly, does it mean all roses fade quickly? Why or why not?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|><think>This question presents a classic example of a syllogism, a form of logical argument where a conclusion is drawn from two given or assumed propositions (premises). Here, the premises are:
1. All roses are flowers.
2. Some flowers fade quickly.

Important considerations:

  • To activate this feature you must use the system prompt for reasoning tasks.
  • Note that thoughts appear between the tags <think> and </think>, and the final response between <response> and </response>.

Use the following code example to reproduce the outputs for both reasoning tasks. To reproduce the second task, simply update the user prompt.

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch
model_path="ibm-granite/granite-3.3-8b-instruct"
device="cuda"
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(
model_path,

Function-calling

This example demonstrates Granite 3.3 models ability to perform tool calls.

<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
Today's Date: April 12, 2025. You are Granite, developed by IBM. You are a helpful assistant with access to the following tools. When a tool is required to answer the user's query, respond only with <|tool_call|> followed by a JSON list of tools used. If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request.<|end_of_text|>
<|start_of_role|>available_tools<|end_of_role|>[
{
"name": "get_current_weather",
"description": "Get the current weather",
"arguments": {
"location": {
"description": "The city and state, e.g. San Francisco, CA"

Consider the following points to build promts for function-calling tasks:

  • Use the system prompt for function-calling tasks to obtain the best performance out of Granite 3.3 models.
  • Use available_tools turn type to provide the list of tools to the model.
  • Tools are provided to the model as a list of dictionaries, with each dictionary representing a tool. While the model has been trained to understand various dictionary structures for tools, we recommend building your prompts as shown in this example. In this approach, each tool is represented by a dictionary that includes the tool’s name, description, and required arguments.
  • The format of the assistant responses containing function calls should be pre-pended by a <|tool_call|> token.
  • Use tool_response turn type to feed back to the model the external system execution of a tool call.

Use the following code example to nearly reproduce the previous output.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.3-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)

After obtaining a tool call from the assistant and executing it via an external system, you could feed the execution’s result back to the model by updating the chat history as follows:

chat=[
{"role": "user", "content": "What's the current weather in New York?"},
{"role": "assistant", "content": "<|tool_call|>[{\"name\": \"get_current_weather\", \"arguments\": {\"location\": \"New York\"}}]"},
{"role": "tool_response", "content":"{“temp”: 20.5, “unit”: “C”}"} # This result is obtained by executing the tool call via an external system.
]

FIM for a coding task

This example demonstrates how to use Granite 3.3 models for a FIM code completion task.

<fim_prefix>def print_hello_world():<fim_suffix><fim_middle>
print("Hello, World!")

Consider the following points to build prompts for FIM coding tasks:

  • Prepend code before the missing part with the tag <fim_prefix>
  • Prepend code after the missing part with the tag <fim_suffix>
  • End your prompt with the tag <fim_middle> to indicate to the model something is missing in the code snippet.
  • Completion of basic programming concepts (e.g., function, method, conditionals, loops) is covered for various programming languages (e.g., python, c/c++, go, java).

Use the following code example to reproduce the previous output.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.3-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)

Backward compatibility

Prompt templates designed for basic inference tasks should remain compatible with Granite 3.3 models. However, more complex templates may require a migration process. To take full advantage of the new features introduced in Granite 3.3, please refer to this prompt engineering guide for creating new templates or migrating existing ones.

Some points to consider while migrating templates are:

  • Tool/function calling
    • The turn type used to provide the list of tools varies by version. For Granite 3.3, we recommend to use available_tools, but tools will also work to keep compatibility with previous versions.
    • Additionally, the token that precedes a tool call in the model’s response is <|tool_call|> for Granite 3.3 models, and <tool_call> for Granite 3.2 and 3.1 models. Please consider these variants to parse tool call outputs from Granite models.
  • RAG tasks: In Granite 3.3, each document must be provided in a separate turn. This differs from previous versions, where documents could be passed as a list within a single turn.
  • Reasoning: Granite 3.3 models display reasoning outputs with intermediate thoughts clearly separated from the final answer, a capability not supported in Granite 3.2.

Building prompts with transformers library

The apply_chat_template function from the transformers library automatically applies the basic chat template structure of Granite 3.3 models to your prompts. To build prompts that integrate advanced features via the apply_chat_template you must use the appropriate kwargs.

  • Use documents, to pass the list of documents for RAG prompts. This will automatically activate the RAG system prompt.
  • Use tools, to pass the list of tools for function-calling tasks. This will automatically activate the function-calling system prompt.
  • Use thinking, to make the model reason about its response. This will automatically activate the reasoning system prompt.
  • Use controls to pass a json indicating what optional annotations should be activated. The respective system prompt extension (if needed) will also be automatically activated.

We strongly recommend that developers use the apply_chat_template function to construct their prompts, as it enhances the development experience and minimizes the risk of inference errors caused by manually crafting chat templates. However, you can also load a chat template that incorporates Granite 3.3’s advanced features without using this function.

In the following example, we load the template for RAG inference with annotations to moderate length and originality. The list of documents and desired annotations have been previously rendered. The final model prompt is saved in input_text variable. You can take this example as baseline to build wrappers for libraries that do no suppot kwargs.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model_path = "ibm-granite/granite-3.3-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

The execution of this script returns the assistant’s output below:

RAG stands for Retrieval-Augmented Generation. It's a method that enhances generative AI models by providing them with information retrieval skills, allowing them to access and utilize external data or context during the generation process.

Inference Tips

  • Verify the Max New Tokens setting. If you see a model response stopping mid-sentence, the max new tokens setting is likely set too low. This is particularly critical in long-context tasks.
  • Avoid pronouns in follow-up questions. For example, do not use “Can you edit it?”, instead use “Can you edit the table?”.
  • Reduce explanation length. If the explanations provided by the model are too long, update the instruction to make it clear there should be no additional explanations.
  • Fix run-on sentences/responses. If the model is generating run-on sentences/responses, try reducing the max tokens output parameter or adding line breaks or spaces as stop token(s). This should lead the model to stop generating the output after generating these tokens.
  • For prompts that include in-context examples, consider using the example labels as a stop token so the model stops generation after providing the answer.