Unlocking the power of Agentic AI with new watsonx.governance capabilities

Kecerdasan Buatan Komputasi dan server Otomatisasi IT

18 June 2025

Authors

Siddhi Shreekar Gowaikar

Product Manager

Andrea Colmenares

AI Campaign, Product Marketing

Sahiba Pahwa

Product Marketing, watsonx.governance

IBM

Agentic AI is a transformative force, with Gartner predicting that one third of gen-AI interactions will use action models and autonomous agents by 2028.

But unsupervised AI agents can operate with significant autonomy and power, exposing organizations to numerous unpredictable risks that may result in harmful and irreversible impact to both businesses and customers. Their complex decision-making processes, influenced by data, can create biases, complicate traceability, and introduce security concerns. Hallucinations and incorrect choices further compound these challenges.

To combat these challenges, in March, we announced the tech preview of our agentic AI governance capabilities. Building on this momentum, we’re rolling out additional new features as part of watsonx.governance.

Streamlined inventory of tools for Agents

The Governed Agentic Catalog is a comprehensive resource for managing and selecting AI tools, agents and workflows, designed to streamline tool/agent selection and promote reuse across users and use cases. This centralized repository helps teams maintain consistency and efficiency by consolidating a wide range of tools, each performing specific tasks essential for designing and building agentic systems. These agentic tools encompass various functionalities, such as data retrieval and external connections.

The catalog's key features include:

  1. Enable tool lineage mapping: Enabling users to trace tools back to their respective use cases. This feature will be available in later releases. It also offers search functionality by use of case type or domain, allowing users to quickly locate relevant tools and expedite project initiation. 
  2. Facilitate tool comparison in a single view: Users can filter tools based on their type, and each tool card provides a clear description along with quality metrics. The catalog facilitates easy side-by-side comparisons of different tools, empowering users to make informed decisions.
  3. Ensure tool effectiveness and reliability: As part of a later release, users can view ratings from other community members to gauge tool effectiveness and reliability.

By leveraging the Governed Agentic Catalog, teams can manage tool sprawl, ensure proper tool utilization and maintain consistency across departments. This comprehensive approach to tool management ultimately accelerates progress and fosters a collaborative environment for agentic system development.

Accelerate Agentic AI performance evaluation

The growing prevalence of AI agents introduces significant complexities, such as the challenge of evaluating the performance, reliability, safety and ethical behavior of these autonomous AI agents.

Agentic AI evaluation best practices can reduce exposure to various predictable and unknown risks. However, effective performance tracking can be a challenge for organizations and developers, as agents demand observing not just outputs but also behaviors, decisions and intentions. With watsonx.governance, organizations can assess agent performance using:

  • Evaluation metrics with benchmarks: Helps assess agent competence overall and at various tasks.
  • Root cause analysis: Identifies underlying reasons for poor performance tracking decision chains, not just final output to inform improvements for e.g. lack of unbiased data.
  • Human feedback or red teaming: Allows SMEs to observe and verify the agent's actions (human in the loop) and test agents for susceptibilities.

Beginning in March, watsonx.governance introduced these new capabilities to support additional specialized metrics. The new RAG agentic AI evaluation metrics are now available. The comprehensive set of metrics to evaluate performance, include HAP, PII, prompt injection, context relevance, faithfulness, answer similarity, answer relevance, hit rate, average precision, reciprocal rank, and unsuccessful requests, among others, to ensure a thorough assessment of our system's effectiveness. This helps confirm agents act appropriately and detect warning signs by adding the necessary guardrails to regulate agentic behavior toward desired outcome.

These metrics will be available by adding a simple python decorator to the tool node in a LangGraph application.  Adding this decorator will result in the metric being computed as a byproduct of running the node in the Agentic Application. The computed metric can then be used within the application to make flow decisions. For example, if the context fetched from the vector database is not relevant to the user query, do not generate an answer, but try a web-search to fetch the right context. These evaluators are not just easy to use but are also efficient and include both opensource metrics and IBM advanced metrics. Thus, they provide a wide range of capabilities for evaluation and are suitable for various use cases and task types.

Fast-track your Agentic experimentation 

Experimentation tracking is crucial in governing an AI agent because it provides a comprehensive record of all changes, iterations and improvements made during the development process. This includes modifications to algorithms, data inputs, hyperparameters and other critical aspects.

Agentic App development is an iterative process. Developers build an Agentic AI app, test it, fine tune when necessary and build a new version for improved output, and the process continues for further optimization. Watsonx.governance will automatically support tracking of various experiments and comparison using Evaluation Studio:

  • Faster agentic development: Evaluate multiple agents in one single instance, saving developer time in evaluating multiple agents built on any third-party platforms, offering versatility.
  • Enhanced decision-making and selection processes: Visualize and compare agents simultaneously to improve operational efficiency by eliminating the need for manual reviews, thereby streamlining workflows and reducing potential human error.
  • Increased operational efficiency: Eliminates the need for manual reviews, streamlining workflows, and reduces potential human error.

Watsonx.governance accelerates the iteration and development process by enabling quick comparisons of Agentic AI applications. This functionality is not limited to AI apps built on our watsonx platform; it also extends support to third-party platforms, offering versatility.

Monitor agentic AI applications in production real-time 

Monitoring metrics can help track agent performance, detect issues like performance degradation, data drift and model bias in production, and guide improvements. Without proper evaluation, it becomes difficult to trust, control or calibrate/fine tune AI agents for improved accuracy, increasing the risk of unintended outcomes.

In scenarios where agentic AI is deployed in production, ongoing surveillance becomes imperative to address issues like agentic hallucination, response time, model drift and bias. Deploying agentic AI applications with continuous production monitoring is critical for maintaining system reliability and trust. Real-time surveillance enables MLOps and AgentOps teams to track model and agent behavior, performance drift, and unexpected outputs, allowing for immediate intervention when deviations occur. This operational readiness ensures that autonomous systems remain aligned with intended goals and safety constraints.

In the upcoming releases, IBM's watsonx.governance will be equipped to offer continuous oversight of agentic applications, initiating alerts when any of the specified metrics exceed their predefined limits. This feature ensures proactive management and timely intervention for maintaining optimal AI performance.

Proactively assess risk 

Similar to other swiftly evolving technologies, AI agents introduce possible risks, obstacles and societal consequences. Some new risks introduced by AI agents include data bias, redundant actions, function-calling hallucinations, sharing confidential information and attacks on an AI agent’s external resources. Beyond these, agentic AI intensifies existing risks, challenges, and societal effects.

The IBM Risk Atlas provides a list of risks inherent to data and AI and is being updated to reflect agentic risks and threats. 

AI governance across the lifecycle

AI Governance is needed across the AI lifecycle, from use case creation, development and validation to monitoring in production. At every stage, there are risks and pitfalls, which if not properly managed, can cause present or future issues. For example, while creating a new use case, watsonx.governance provides a risk assessment, which helps you identify which risks your use case is prone to so you can incorporate necessary risk management techniques. Similarly, during development of an agentic application, you need to measure and evaluate the performance of each tool or node in the application to make improvements in future iterations.

Watsonx.governance provides a library with over 50 metrics that can be added as decorators to your application and measure its performance. Without governance, you cannot scale or build trust in your AI.

Try watsonx.governance today

Effective governance and security are indispensable, but as companies grow and adopt AI at scale, implementing a robust AI governance structure becomes essential to ensure safe experimentation and manage the complexities of widespread AI adoption efficiently.

Try watsonx.governance to explore these new feature releases and several other enhancements built to help enterprises unlock the true potential of AI and transform your AI governance experience today.

Try watsonx.governance for free today

Learn how to work with modern day AI governance tools

Learn more about watsonx.governance capabilities

Learn more Try watsonx.governance today Scale trusted AI with watsonx.governance Explore governance today