Agentic AI is a transformative force, with Gartner predicting that one third of gen-AI interactions will use action models and autonomous agents by 2028.
But unsupervised AI agents can operate with significant autonomy and power, exposing organizations to numerous unpredictable risks that may result in harmful and irreversible impact to both businesses and customers. Their complex decision-making processes, influenced by data, can create biases, complicate traceability, and introduce security concerns. Hallucinations and incorrect choices further compound these challenges.
To combat these challenges, in March, we announced the tech preview of our agentic AI governance capabilities. Building on this momentum, we’re rolling out additional new features as part of watsonx.governance.
The Governed Agentic Catalog is a comprehensive resource for managing and selecting AI tools, agents and workflows, designed to streamline tool/agent selection and promote reuse across users and use cases. This centralized repository helps teams maintain consistency and efficiency by consolidating a wide range of tools, each performing specific tasks essential for designing and building agentic systems. These agentic tools encompass various functionalities, such as data retrieval and external connections.
The catalog's key features include:
By leveraging the Governed Agentic Catalog, teams can manage tool sprawl, ensure proper tool utilization and maintain consistency across departments. This comprehensive approach to tool management ultimately accelerates progress and fosters a collaborative environment for agentic system development.
The growing prevalence of AI agents introduces significant complexities, such as the challenge of evaluating the performance, reliability, safety and ethical behavior of these autonomous AI agents.
Agentic AI evaluation best practices can reduce exposure to various predictable and unknown risks. However, effective performance tracking can be a challenge for organizations and developers, as agents demand observing not just outputs but also behaviors, decisions and intentions. With watsonx.governance, organizations can assess agent performance using:
Beginning in March, watsonx.governance introduced these new capabilities to support additional specialized metrics. The new RAG agentic AI evaluation metrics are now available. The comprehensive set of metrics to evaluate performance, include HAP, PII, prompt injection, context relevance, faithfulness, answer similarity, answer relevance, hit rate, average precision, reciprocal rank, and unsuccessful requests, among others, to ensure a thorough assessment of our system's effectiveness. This helps confirm agents act appropriately and detect warning signs by adding the necessary guardrails to regulate agentic behavior toward desired outcome.
These metrics will be available by adding a simple python decorator to the tool node in a LangGraph application. Adding this decorator will result in the metric being computed as a byproduct of running the node in the Agentic Application. The computed metric can then be used within the application to make flow decisions. For example, if the context fetched from the vector database is not relevant to the user query, do not generate an answer, but try a web-search to fetch the right context. These evaluators are not just easy to use but are also efficient and include both opensource metrics and IBM advanced metrics. Thus, they provide a wide range of capabilities for evaluation and are suitable for various use cases and task types.
Experimentation tracking is crucial in governing an AI agent because it provides a comprehensive record of all changes, iterations and improvements made during the development process. This includes modifications to algorithms, data inputs, hyperparameters and other critical aspects.
Agentic App development is an iterative process. Developers build an Agentic AI app, test it, fine tune when necessary and build a new version for improved output, and the process continues for further optimization. Watsonx.governance will automatically support tracking of various experiments and comparison using Evaluation Studio:
Watsonx.governance accelerates the iteration and development process by enabling quick comparisons of Agentic AI applications. This functionality is not limited to AI apps built on our watsonx platform; it also extends support to third-party platforms, offering versatility.
Monitoring metrics can help track agent performance, detect issues like performance degradation, data drift and model bias in production, and guide improvements. Without proper evaluation, it becomes difficult to trust, control or calibrate/fine tune AI agents for improved accuracy, increasing the risk of unintended outcomes.
In scenarios where agentic AI is deployed in production, ongoing surveillance becomes imperative to address issues like agentic hallucination, response time, model drift and bias. Deploying agentic AI applications with continuous production monitoring is critical for maintaining system reliability and trust. Real-time surveillance enables MLOps and AgentOps teams to track model and agent behavior, performance drift, and unexpected outputs, allowing for immediate intervention when deviations occur. This operational readiness ensures that autonomous systems remain aligned with intended goals and safety constraints.
In the upcoming releases, IBM's watsonx.governance will be equipped to offer continuous oversight of agentic applications, initiating alerts when any of the specified metrics exceed their predefined limits. This feature ensures proactive management and timely intervention for maintaining optimal AI performance.
Similar to other swiftly evolving technologies, AI agents introduce possible risks, obstacles and societal consequences. Some new risks introduced by AI agents include data bias, redundant actions, function-calling hallucinations, sharing confidential information and attacks on an AI agent’s external resources. Beyond these, agentic AI intensifies existing risks, challenges, and societal effects.
The IBM Risk Atlas provides a list of risks inherent to data and AI and is being updated to reflect agentic risks and threats.
AI Governance is needed across the AI lifecycle, from use case creation, development and validation to monitoring in production. At every stage, there are risks and pitfalls, which if not properly managed, can cause present or future issues. For example, while creating a new use case, watsonx.governance provides a risk assessment, which helps you identify which risks your use case is prone to so you can incorporate necessary risk management techniques. Similarly, during development of an agentic application, you need to measure and evaluate the performance of each tool or node in the application to make improvements in future iterations.
Watsonx.governance provides a library with over 50 metrics that can be added as decorators to your application and measure its performance. Without governance, you cannot scale or build trust in your AI.
Effective governance and security are indispensable, but as companies grow and adopt AI at scale, implementing a robust AI governance structure becomes essential to ensure safe experimentation and manage the complexities of widespread AI adoption efficiently.
Try watsonx.governance to explore these new feature releases and several other enhancements built to help enterprises unlock the true potential of AI and transform your AI governance experience today.
Try watsonx.governance for free today