Documentation Index
Fetch the complete documentation index at: https://wwwpoc.ibm.com/llms.txt
Use this file to discover all available pages before exploring further.
Granite 4.0 highlights
- Simplified and standardized chat template design for general, tool-use, and RAG tasks
- Enhanced tool-calling, RAG, and FIM capabilities.
- Native support for JSON outputs.
Chat template
To maximize the performance of Granite 4.0 instruct models, we recommend constructing prompts using libraries such astransformers, which automatically apply the models’ official chat template. If users choose to author prompts directly in Jinja, they should carefully adhere to the template’s design conventions.
Chat template design
The elements of this chat template serve the following purpose:-
<|start_of_role|>,<|end_of_role|>, and<|end_of_text|>are the tags for role and prompt control.<|start_of_role|>indicates the start of a role.<|end_of_role|>indicates the end of the role.<|end_of_text|>indicates the end of a single message block. Each message block should be terminated with<|end_of_text|>to ensure proper message segmentation.
-
user,assistant, andsystemare the roles the chat template supports.-
userrole identifies queries from the user or an external agent. -
assistantrole identifies model generations. -
systemrole identifies system messages. This role, in particular, encompasses several design decisions, which we discuss in detail below.- Granite 4.0 models chat template do specify a default system prompt; however, users may define a different one. This is an example of a user-defined system prompt:
- Granite 4.0 models chat template also supports multiple system turns within a single conversation. The chat template will automatically format them, as demonstrated in the following example.
- Granite 4.0 chat template automatically includes the list of tools as part of the system prompt when tools are provided.
- Granite 4.0 chat template automatically includes the list of documents as part of the system prompt when documents are provided.
- Granite 4.0 chat template also supports combining tool-use and RAG capabilities through an appropriately designed system prompt. Note that order matters, the chat template lists tools first followed by documents. Moreover, if a user-defined system prompt is provided, it will appear first, followed by the tools section and then the documents section.
-
-
About tool calling
- Granite 4.0 chat template automatically lists tools between
<tools>and</tools>tags as part of thesystemmessage when a list of tools is provided. - Granite 4.0 chat template automatically returns tool calls between
<tool_call>and</tool_call>tags within the assistant turn. Example:
- Moreover, Granite 4.0 chat template converts
toolrole content into auserrole in which tool responses appear between<tool_response>and</tool_response>tags. Example:
- Granite 4.0 chat template automatically lists tools between
-
About RAG
- The chat template lists documents as part of the
systemturn between<documents>and</documents>tags.
- The chat template lists documents as part of the
-
About FIM
- The tags supported for fill-in-the-middle (FIM) code completion are:
<|fim_prefix|>,<|fim_middle|>, and<|fim_suffix|>. Make sure to use the correct FIM tags when using Granite 4.0 models for FIM code completions.
- The tags supported for fill-in-the-middle (FIM) code completion are:
- Granite 4.0 models chat template uses reverse iteration over messages to identify the last user query index.
- All outputs should be terminated with
<|end_of_text|>to clearly segment messages.
Basic chat template example
Below, we show a basic example of Granite 4.0 models chat template.Granite 4.0 inference examples
Basic examples
In this section, we provide basic examples for a variety of inference tasks.Summarization
This example demonstrates how to summarize an interview transcript.Text classification
This example demonstrates a classification task for movie reviews. The user query includes classification examples to improve the model’s response accuracy.Text extraction
This example demonstrates how to extract certain information from a set of documents with a similar structural pattern.Text translation
Granite 4.0 models also support tasks in multiple languages. This is a basic example of how to use the models to translate text from English to Spanish.Temporal reasoning
This is an example of a basic temporal reasoning question.Logical reasoning
This is an example of a basic logical reasoning question.Building prompts with transformers library
Theapply_chat_template function from the transformers library automatically applies the basic chat template structure of Granite 4.0 models to your prompts. To build prompts that integrate advanced features via the apply_chat_template you must use the appropriate kwargs.
- Use
toolsto provide the list of tools in tool-calling tasks. This automatically activates the tool-calling system prompt and formats the tool list. - Use
documentsto provide the list of documents for RAG prompts. This automatically activates the RAG system prompt and formats the document list.
apply_chat_template function to construct their prompts, as it enhances the development experience and minimizes the risk of inference errors caused by manually crafting chat templates in Jinja. However, you can also load a chat template that incorporates Granite 4.0’s advanced features without using the apply_chat_template function.
In the following example, we load the template for RAG inference. Note that the final model prompt is saved in input_text variable. You can take this example as baseline to build wrappers for libraries that do no support kwargs.
Inference Tips
- Verify the Max New Tokens setting. If you see a model response stopping mid-sentence, the max new tokens setting is likely set too low. This is particularly critical in long-context tasks.
- Avoid pronouns in follow-up questions. For example, do not use “Can you edit it?”, instead use “Can you edit the table?”.
- Reduce explanation length. If the explanations provided by the model are too long, update the instruction to make it clear there should be no additional explanations.
- Fix run-on sentences/responses. If the model is generating run-on sentences/responses, try reducing the max tokens output parameter or adding line breaks or spaces as stop token(s). This should lead the model to stop generating the output after generating these tokens.
- For prompts that include in-context examples, consider using the example labels as a stop token so the model stops generation after providing the answer.