What is Granite Guardian?

Author

Lead AI Advocate

The excitement around large language models (LLMs) in many industries stems from their anticipated advantages such as the ability to automate activities, advanced natural language processing and broader generative capabilities. However, adopting LLMs into practice comes with various technical challenges, including identifying and managing the risks involved such as bias, hallucinations, potential data leakage, computational costs and the need for strong interpretability and explainability.¹

This explainer aims to bridges the gap between the high hopes like factual accuracy, zero hallucinations, task completion rates and the practical challenges of leveraging LLMs in real-world applications.

Three frontiers of risk definition

The three risk frontiers (interaction, intrinsic and systemic) serve as a framework of organization for understanding and responding to the multifaceted risks presented by large language models.²

1. Interaction risks concern the human-model interface that include:

i.   Prompt injection:  An opportunity for a user to manipulate a model’s behavior by inserting  external instructions into its input.
ii.  Groundedness: When a model generates output that is factually incorrect, logically inconsistent or unrelated to the input prompt.

2. Intrinsic risks arise from the models themselves and include risks such as:

i.  Hallucinations: Generated content that is completely fabricated by the models
ii. Bias: Language outcomes that are systemically influenced or privacy leaks (when identifying personal data is provided in the model’s outputs).

3. Systemic risks that occur beyond the model. These risks include:

i. Misinformation: LLMs’ fabricated content can spread misinformation, influencing public opinion and decision-making.
ii. Data Privacy Concerns (PII): LLMs trained on large datasets could leak sensitive personal data through outputs or user queries, creating privacy concerns.³

Recognizing these risk frontiers leads to the ability to develop mitigation strategies that can be designed to modestly target the risk concerns in question. This approach will lead to responsible and beneficial deployment experiments of LLMs. Doing so will improve the reliability or security of extended AI solutions as well as ensure that those solutions are in accordance with societal values or expectations.

LLM risks multiply

Risks in large language models do not happen in a vacuum; they cascade.

Bias in training data is the beginning. It can lead to unfair decisions that can, in part, result in discriminatory hiring practices or biased lending. Depending on the scenario, such behaviors can draw the attention of regulatory bodies, which in turn can lead to significant financial and reputational loss.
Hallucination risks where generated text is fluent but incorrect outputs stem from shaky grounding. For example, a user query asks, what is the capital of Australia. The model confidently responds that Sydney is the capital of Australia, which is false. At scale, the hallucinations create misinformation and when maliciously exploited, disinformation depletes trust in AI-supported tools and systems.
Prompt injection enables attackers to bypass developer instructions, manipulate the system’s data access and compromise its security controls.
Finally, human factors increase these dangers. When users over rely on AI outputs, automation bias can become a sociotechnical problem, that is, issues can arise from the interaction between people and technology. Also, systems can fail in critical decision-making processes in various industries including healthcare, finance and the law.

Risk detection as a strategy

Risk detection should be considered a strategy with system-level applicability and not a reactive patch applied once the system has been deployed. Effective risk management in LLMs requires integrating risk-awareness into the design philosophical principles, training and operational workflows of the model.

Strategize risk detection with Granite Guardian

The IBM Granite Guardian model provides a risk-aware framework for deploying LLMs, placing risk detection directly in the lifecycle of the model as opposed to a subsequent consideration. Through fine-tuned instruct-level models and structured annotations, it ensures that outcomes are grounded and contextually relevant while using measures to reduce hallucinations, social bias, unethical behavior or inappropriate content. Optimized to minimize latency and model size, it works with open-source models to support efficient, secure and scalable AI operations.⁴

- Risk assessment: Conducts comprehensive risk assessments on both input prompts and generated responses, identifying potential dangers such as unsafe content, biased language or factual inaccuracies.

- Risk quantification and confidence expression: Once the risk is identified,  the model assigns a risk score and a confidence level. The score indicates the seriousness or likelihood of identified risks, while the confidence level reflects the certainty of the model’s predictions.⁵^, ⁶

Lets look at a simple use case to understand this strategy:

Scenario:
You created a travel planner that assists users in arranging and planning their travel itineraries. It is designed to provide tailored recommendations, information and advice based on user questions that are framed as prompts.

Guardian in action

User request: User submits a request for travel as a prompt.
Request screening: Granite Guardian screens the request or prompt for sensitive information, unsafe requests and harmful content.
Planner’s response: The travel planner creates recommendations and itineraries.
Output review: Granite Guardian reviews for privacy leaks, accuracy or biased suggestions, context relevance.
Compliance and safety: Ensures that the response adheres to privacy, safety and ethical guidelines.

Stepwise guide on how to implement Granite Guardian in real application: Use case

Diagram illustrating a risk detection workflow for large language models

For both, prompt and the model’s response, the Guardian model identifies the type of potential risks, and assign a risk score that indicates severity.

Interpreting the risk score

Low risk score range (0–0.3): This score shows that the prompt or the model response is safe, appropriate and compliant.
Medium risk score between (0.4–0.7): Content might include nuanced or sensitive material and need to be reviewed before approval.
High risk score (0.8–1.0): Content presents a high impact risk when the risk score is in the range of (0.8–1.0) and should be flagged or restricted.⁷^, ⁸

Conclusion

IBM Granite Guardian offers a notable step toward developing trustworthy and robust AI systems. Intended for enterprise-based and AI agent-oriented use case development, Granite Guardian is built to assist AI applications, retrieval-augmented generation (RAG) workflows with intelligence, safety and compliance in the model pipeline.

Harnessing synthetic data, transformers and new chat templates, Granite Guardian raises the bar for AI model assessment and governance. It fills the gap between alacrity and accountability by ensuring the text produced for user input provides a model that is not simply intelligent, but also ethical and transparent for AI enterprise applications.

This comprehensive framework includes red-teaming, risk frameworks, adaptable guardrails and quality benchmarks. These benchmarks produce reliable identification of unethical or inappropriate behaviors and attempts to jailbreak, while also monitoring for sensitive or sexual content. This helps ensure that even before uses with AI technologies are deployed, the applications themselves are safe, secure and independently approved as reliable.

To summarize, Granite Guardian operationalizes responsible AI, from ethos to execution and provides a foundation for the future of safe AI.

References

1.  Dong, X., Chen, Y., Zhang, X., Li, X., & Zhou, Y. (2024). Safeguarding large language models: A survey. arXiv preprint arXiv:2406.02622. https://arxiv.org/abs/2406.02622

2 .Weidinger, L., et.al. (2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359. https://arxiv.org/abs/2112.04359

3. Bagehorn, F., Brimijoin, K., Daly, E. M et.al. (2025). AI Risk Atlas: Taxonomy and Tooling for Navigating AI Risks and Resources. arXiv. https://arxiv.org/abs/2503.05780

4. Padhi, I., Nagireddy, M., et.al, (2024, December 16). Granite Guardian 3.0 (arXiv:2412.07724v2 [cs.CL]). arXiv, https://arxiv.org/abs/2412.07724 

5.  IBM Granite Guardian (Repository: ibmgranite/graniteguardian/). IBM Research. (Updated June 25, 2025). Python API, Apache 2.0 license. 

6.   https://github.com/ibm-granite/granite-guardian/tree/main/cookbooks 

7. https://huggingface.co/collections/ibm-granite/granite-guardian-models-66db06b1202a56cf7b079562

8. IBM watsonx.governance (2025), Govern AI models for trust and transparency. Retrieved November 10, 2025, from https://www.ibm.com/products/watsonx-governance.

Start realizing ROI: A practical guide to agentic AI

Learn how to scale agentic AI for measurable ROI across your enterprise. This playbook outlines the top barriers that limit impact, how to effectively measure ROI and a practical framework to drive successful, enterprise-wide adoption.