IBM’s answer to governing AI Agents: Automation and Evaluation with watsonx.governance

4 March 2025

Authors

Heather Gentile

Director of watsonx.governance Product Management

IBM Data and AI Software

Jordan Byrd

Senior Product Marketing Manager, watsonx.governance

IBM

Manish Bhide

Distinguished Engineer and CTO, watsonx.governance

IBM

Agentic AI is transforming IT landscapes globally, but most organizations still face uncertainty over how to use AI agents safely and effectively. This is due to the complexity of developing and managing these agents, ensuring compliance and governance, and mitigating risks associated with models, users and data sets.

The potential for agents is immense, which is why Gartner predicts that by 2028, one-third of gen AI interactions will use action models and autonomous agents. The risks for generative AI and machine learning can be significant to begin with, especially for certain use cases. Add in AI agents, and the risks are further amplified.  

We are excited to announce that a tech preview of new agentic evaluation capabilities will be available the week of March 3. These metrics can help organizations track agents more closely, confirming they are acting appropriately and detect early warning signs if they are not.

Here are the new RAG, agentic AI evaluation metrics you'll find in watsonx.governance:

  • Context Relevance: Measures how well does the data retrieved by the model align with the question specified in the prompt. Scores range from 0 to 1. Higher scores indicate that the context is more relevant to the question in the prompt
  • Faithfulness: Indicates how accurately and reliably the generated response reflects the information contained in the retrieved documents or context. It measures the extent to which the generative model stays true to the content it has retrieved, without introducing errors, hallucinations (i.e., generating information not supported by the retrieved context), or misleading details that aren’t present in the source material. Scores range from 0 to 1. Higher scores indicate that the output is more grounded and less hallucinated.
  • Answer similarity: Answer similarity measures how closely the generated answer aligns with a reference answer to determine the quality of your model performance. Scores range from 0 to 1. Higher scores indicate that the answer is more closely aligned with the reference answer.

Why governance is required for AI agents 

Agents have unsupervised autonomy and can take actions that are at times harmful to organizations or their customers, and in some cases, those actions may be irreversible. With so many capabilities, data and decision points, even tracking and tracing the many steps an agent took to reach a conclusion and take recommended action can be daunting. 

These actions can also influence the underlying data and create data bias due to specific actions, which in some cases could create infinite feedback loops. Like other forms of generative AI, agents can also hallucinate and confidently choose the wrong tool or take an impractical or unwise action. Security and access as to what the agent can interact with and who can interact with the agent becomes challenging from an identity management perspective. 

The scope and scale of managing, governing and securing agents is overwhelming and not feasible in an ad hoc or manual way, even to safely experiment with agents in an effort to learn as you scale requires a robust AI governance solution.  

Read on to learn more about the benefits of using watsonx.governance, including its ability to track the end-to-end AI lifecycle, aid compliance with internal policies and external regulations and improve transparency and explainability for tracked models. By the end, you'll understand how watsonx.governance can help you build trust in your ability to build, deploy, manage and govern AI agents. 

Lifecycle governance of AI agents

Developing, deploying and managing agentic AI follows the same lifecycle as other AI, starting with the use case, but requires additional capabilities to fully track the metadata for every stage of agentic systems. Managing risk, compliance and security are also key to agentic governance. Watsonx.governance automates many of these processes so you can scale agentic AI in your organization. 

We created a short demo to highlight how watsonx.governance can be used for agentic AI lifecycle governance. This clip shows how watsonx.governance allows you to create an AI use case describing the business goals for the AI agent. In this example, we have created Automated investment assistant as our hypothetical use case. From the use case you can associate the related AI agents. Then we associate an existing AI agent Portfolio Rebalancer, to the new agent and add an entry for another new agent, Fund Withdrawal Agent. The agents must follow the organization’s governed workflow which includes an initial risk assessment to identify potential risks early in the process. Once deployed, you can monitor agent performance and behavior using watsonx.governance’s runtime monitoring features.   

While the demo above shows it’s possible to govern agentic AI in watsonx.governance today, we are working on providing out of the box and enhanced functionality to do the same, which will be released later this year.  

When organizations are exploring agentic AI across various use cases, experiment tracking can help assess how different variant agents are performing to inform developers and leaders as to which one to push forward to production. Traceability can also help agentic app developers debug their applications by providing a complete lineage of the agent’s decisions at each step of the user interaction and agent processing to inform actions.

Agentic Systems Evaluation

While metrics have always been important for governing AI, they are even more so with governing agents. Later this year, watsonx.governance will support additional specialized metrics for agentic systems throughout the model lifecycle and agent interaction.  Context relevance, faithfulness and answer similarity metrics were discussed earlier and paint a better picture about the agent’s ability to answer the right question, in the right way, with the right result. We are working on additional specialized agentic AI metrics to monitor and improve agent performance.

Query translation faithfulness metrics can confirm whether an agent properly understood a user question or if it hallucinated. For example, if a user asks “how much discount do I receive as a gold level customer?” and the agent’s query was: FindDiscount(type=silver), that would score poorly.  

System drift metrics help track whether agents are operating and inferring as intended when they launched, or if they have significantly evolved over time and potentially drifted towards being unsafe or unproductive. watsonx.governance will also look for tool selection quality, which assesses if the orchestrator selected the proper tool or agent for each user query. 

Additional watsonx.governance agentic improvements 

Agentic AI will continue to be a focus for us throughout the year and we are going to be launching risk management and regulatory compliance for agentic systems. Building upon the current guardrails and red teaming capabilities within watsonx.governance, we will also have enhanced agentic systems guardrails, multi-turn conversation guardrails and agentic red teaming. 

If you’d like your organization to be exploring and scaling agentic AI effectively and responsibly, you need an end-to-end AI governance solution like watsonx.governance.  Try it out for yourself or set up a time for us to discuss with an IBM expert today. 

Try watsonx.governance today

Learn about IBM’s AI Governance Services