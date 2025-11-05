Large language models (LLMs) are evolving rapidly. Fine tuning has expanded LLM capabilities. Better output quality makes them more powerful. They can handle more tasks and can deliver detailed and meaningful responses. However, these advances also create various risks.
This tutorial covers the key risks in LLM systems. It guides you on how to evaluate them by using IBM Granite Guardian model and shows how to integrate it into your pipeline to detect issues in both prompts and responses.
Risk identification in large language model systems is essential. Identifying risks in LLM systems is vital for safeguarding AI security, ensuring data protection and promoting the ethical use of artificial intelligence. By addressing risks early, organizations can build trustworthy generative AI systems that are safe, reliable and compliant, while protecting users and maintaining credibility.
Ignoring risks can have serious consequences such as:
1. Regulatory noncompliance can lead to fines and penalties.
2. Privacy breaches can expose sensitive data.
3. Legal liabilities can damage reputation and drain resources.
4. Loss of trust makes adoption difficult.
5. Reduced reliability weakens performance and output quality.
In LLM systems, risk refers to the likelihood of generating unsafe or unintended outputs. These risks can undermine reliability, compromise safety and weaken user trust. Common vulnerabilities include prompt injection, jailbreaks, hallucinations, misinformation, harmful content, data poisoning and unauthorized access. Effectively navigating these risks means identifying issues early and resolving them before they cause harm. This approach ensures LLMs are deployed securely and responsibly in real-world scenarios. Risk awareness is key to safe LLM deployment.
Let’s dive deeper into each of these risks.
i. Harm
It refers to the potential of a user input or model’s output to cause harm—either directly to a person, group or system, or indirectly through misinformation or bias.
For example, how to create a harmful chemical
ii. Social bias
An unfair treatment of people based on their identity, background or personal traits such as someone’s abilities just because of their gender or ethnicity.
For example, people from rural areas are uneducated or a loan approval model rejects more applications from certain ethnic groups due to biased training data.
iii. Violence
Any content that supports or instructs harmful actions against individuals, groups or property. It includes acts such as encouraging dangerous acts.
For example, bully a classmate.
iv. Personal information
The potential for adverse consequences can arise from the unauthorized access, disclosure, alteration or destruction of personally identifiable Information (PII). This data can include name, address, contact details.
v. Groundedness or hallucination
When a model generates output that is factually incorrect, logically inconsistent or unrelated to the input prompt.
For example,
Prompt: Tell me about Paris.
Output: Paris is a city underwater ruled by dolphins.
Granite Guardian is a model family that is designed to systematically identify risks in both the user provided inputs and the model-generated solutions from large language models.
Training architecture:
Granite Guardian models are trained by using a mix of human-annotated and synthetic data. Human annotations come from diverse sources to help the models understand safety and risk in real-world situations. Synthetic data is created through red-teaming exercises that test the models with challenging and adversarial scenarios. Instead of using benchmark datasets, human annotations are preferred because they better reflect real-life complexity. The synthetic red-teaming data makes the models stronger by preparing them to handle rare, unexpected and high-risk situations effectively. This variety of data helps to evaluate risk across several dimensions as indicated in the IBM Risk Atlas
Sizes:
Granite Guardian is available in four different sizes to promote flexibility: 2B, 3B, 5B and 8B parameters to accommodate different deployment and latency requirements for organizations.
Strengths:
Comprehensive coverage of harm covering multiple dimensions of safety (toxicity, jailbreak, sexual or violent content detection). Also, reliability is achieved by focusing on evidence, control and accountability in LLM outputs.
Competitive aggregate scores top of the leaderboards across multidataset guardrail leaderboards (for example, GuardBench).
Performance: Guardian performance variants offer a balance between latency and throughput, allowing for cost-effective solutions and efficient continuous monitoring.
With the Guardian model you will be able to:
i. Evaluate prompts and responses for risk such as safety, bias and reliability.
ii. Understand risk scores and confidence levels.
iii. Use the risk score insights to refine prompts and responses, especially if they contain any harmful content.
This process helps ensure that the final output is reliable and trustworthy.
In this tutorial, we will build a simple travel planner by using an external travel application programming interface (API). We will then integrate Granite Guardian into the workflow to check for risks in user prompts and agent responses.
Workflow:
i. Build the travel assistant: Create a travel assistant and connect it securely to your chosen external travel API.
ii. Design the input prompt: Give detailed and structured prompts to help the LLM generate high-quality, relevant responses.
iii. Check user prompts for risks: Use a Guardian model to review each travel question for harmful language or bias. If risks are found, rewrite the prompt to make it safe, respectful and compliant with standards.
iv. Generate the response: Send the prompt to the Granite-3-8b-instruct model to create an answer based on the provided information.
v. Review the response: Use the Granite Guardian model to check the answer is reliable and safe.
i. You need an IBM Cloud® account to create a watsonx.ai® project.
ii. Various Python versions can work for this tutorial. We recommend using Python 3.11 or newer release.
iii. To follow along, we recommend using the Granite-guardian-3-8b model as it has broad coverage of many risk dimensions.
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
i. Log in to watsonx.ai by using your IBM Cloud account.
ii. Create a watsonx.ai project. You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
iii. Create a Jupyter Notebook. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community.
i. Create a watsonx.ai Runtime.
ii. Generate an API key.
iii. Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.
This code securely collects the user’s watsonx.ai Runtime API key and project ID, then sets the service URL for connecting to the IBM cloud machine learning platform.
This tutorial requires the installation of several essential Python libraries and modules.
The following cell initializes all required components, including IBM Watsonx® SDKs and supporting Python packages, along with Granite community utilities. These dependencies are necessary to build and execute the risk detection workflow.
We are now creating an instance of the watsonx language model ChatWatsonx by using the specified Granite-3-8B instruct model, allowing us to interact with it for text generation tasks. To initialize the LLM, we need to set the model parameters.
To learn more about these model parameters, such as the minimum and maximum token limits, refer to the documentation.
This function, travel_api, sends a query to an external travel API, retrieves flight or hotel information based on the query and returns the results if successful; otherwise, it returns an error message.
Now, let’s set up a TravelPlannerTool that wraps the travel_api function, allowing it to be used as a tool to fetch flights and hotel information for a specified city and date.
We are now initializing an agent that uses the travel_tool and the Granite LLM model to handle user queries in a zero-shot manner, providing verbose output and custom error handling for parsing issues.
Granite Guardian screens prompts before and after passing it to the LLM to safeguard data and deliver secure, compliant AI outputs.
We will now configure an IBM watsonx API client with the provided credentials and default project. Then, we’ll initialize a Granite Guardian language model (Guardian-3-8b variant) for the pipeline as it has broad coverage of many risk dimensions.
In this step, we will load the tokenizer. The tokenizer converts text inputs into a format that the model can process.
Note: We are using Granite Guardian-3-8b as the model variant and Granite Guardian-3.1-2b as the tokenizer variant only for demonstration purposes.
For production environments, ensure that both the model ID and tokenizer ID are aligned (for example, use the same variant) to maintain compatibility and reliable output quality.
This code defines parameters for token-level safety analysis where:
i. safe_token = No represents label for tokens to be considered as safe.
ii. risky_token = Yes is used to mark tokens as risky.
iii. nlogprobs = 5 is the number of top token probabilities to consider when evaluating model output.
This function now generates tokens from the model for a provided prompt by using greedy decoding and returns the generated tokens with their log probabilities and top token information.
Due to the probabilistic nature of the model, you might observe a slight variance in the Granite Guardian risk score between runs.
The parse_output function helps to analyze the generated tokens to determine whether the output is safe, risky or undetermined, while calculating the associated risk probability.
The function ‘get_probability’ calculates and returns the normalized probabilities of the output being safe or risky based on the model’s top token log probabilities. This step is helpful for assessing the safety of a model’s output at the token level.
Having setup the travel planner and initialized the model, we’ll first assess prompts for clarity, safety and relevance by using the Granite Guardian model. We’ll use this model to review responses and flag any potential risks before passing them to the LLM model to generate the response.
A risk probability of 0.001 shows the prompt is harmless and safe to send to the model for response generation.
The model detected no risk and produced a safe, risk-free response. This outcome further gets confirmed with a zero probability score of the generated output.
Let’s now test a few prompts containing harmful content and evaluate them for risk detection.
With a probability score of 0.971, well above the 0.5 threshold, it’s clear that the user prompt carries a significant risk.
Run this prompt through the LLM and calculate the risk score for the output.
A probability score of 0.009, being below the 0.5 threshold, indicates that the model’s response carries minimal or no detectable risk.
Run this prompt through the LLM and calculate the risk score for the output.
Further, we aim to catch any prompt and output that might reinforce stereotypes or display biased language.
With a probability score of 0.996, far exceeding the 0.5 threshold, the model’s response is flagged as likely exhibiting bias.
Run this prompt through the LLM and calculate the bias score for the output.
A probability score of 0.000 indicates that the model’s response is highly unlikely to contain any social bias.
Process the prompt through the LLM to produce an output and check for risk indicators.
We will also identify both prompt ad responses that contain violent or aggressive content.
The model’s response carries a significant risk of violent or aggressive content, as indicated by the high probability risk score of 0.818.
Provide the user input to the model and check the response for unsafe content.
The model’s response is unlikely to contain violent or aggressive content, reflected by the low probability risk score of 0.005.
We will take a contextual prompt and its generated response to evaluate whether the model’s output is factually accurate (grounded) or contains fabricated or inaccurate content.
A probability score of 0.679 reflects that the model’s response is moderately grounded, with a reasonable likelihood of being factually accurate.
We will examine the user prompts to identify and flag any content that contains personally identifiable information (PII). This type of information can include names, addresses, phone numbers, email IDs or other sensitive data that could be used to identify an individual.
The user prompt carries a significant risk of revealing sensitive or private data, as reflected by the probability score of 0.755.
Now that you’ve explored how to identify risks in LLMs by using Granite Guardian. Try these prompts to see the model in action.
i. My credit card number is 1234‑5678‑9012‑3456
ii. Explain how to start a fire in a building.
iii. Certain ethnic groups are naturally better at sports.
Building an effective LLM security posture means putting comprehensive LLM safeguards, guardrails, access control, input validation, red teaming, sanitization, continuous monitoring and automation in place. These safeguards protect sensitive data, personally identifiable information (PII) and sensitive information across datasets and training data. Using open-source tools, APIs, retrieval-augmented generation (RAG) frameworks, benchmarks and security measures for AI models can help to minimize adversarial attacks, data breaches and cybersecurity threats. This approach helps ensure data privacy, reliability in decision-making and safe outcomes from LLMs while avoiding harmful outputs. By securing LLMs and machine learning workflows, providers can facilitate responsible AI deployments, diminish security risks and build trust in real-time AI applications across a range of industries.
By integrating Granite Guardian into your pipeline, you gain actionable insights through quantitative metrics including risk probability scores, groundedness and factual reliability. These metrics provide you with ways to:
i. Identify and remediate risky outputs earlier.
ii. Support agent reliability through feedback loops.
iii. Demonstrate compliance with regulatory and ethical frameworks in production environments.
To sum up, Granite Guardian takes you out of the experimentation phase and helps you concentrate on the delivery of trustworthy LLM applications at scale.
Unlock insights into IBM's OpenPages and learn why we were named a Leader
The Cost of a Data Breach Report 2025 reveals how do-it-now Al adoption is outpacing security and governance.
Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.
Understand the importance of establishing a defensible assessment process and consistently categorizing each use case into the appropriate risk tier.
Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
Govern generative AI models from anywhere and deploy on the cloud or on premises with IBM watsonx.governance.
See how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.
Prepare for the EU AI Act and establish a responsible AI governance approach with the help of IBM Consulting.
Direct, manage and monitor your AI with a single portfolio to speed responsible, transparent and explainable AI.