Agentic AI evaluation
The agentic AI evaluation module computes metrics to measure the performance of agentic AI tools to help you streamline your workflows and manage risks for your use case.
Agentic AI evaluation is a module in the ibm-watsonx-gov Python SDK. You can use the agentic AI evaluation module to automate and accelerate tasks to help streamline your workflows
and manage regulatory compliance risks by measuring performance with quantitative metrics.
Types of evaluations in Agentic AI
Agentic AI supports two complementary approaches to evaluating agents: in-the-loop evaluations and offline evaluations. Choosing the right evaluation method depends on whether you need to control execution decisions during runtime or validate agent quality during development.
In-the-loop evaluations
In-the-loop evaluations are performed during the execution of the agent. The evaluation result can be used immediately as a decision point to control the agent’s logic.
For example, in an Agentic RAG (retrieval-augmented generation) workflow, you might compute a context relevance score for the data retrieved from a vector database:
- If the relevance score is high, the agent continues and generates an answer.
- If the relevance score is low, the agent can be configured to stop execution and return a message indicating that it cannot answer the question.
In-the-loop evaluations help you:
- Prevent the agent from producing low-quality or misleading responses.
- Build more trusted and high-quality AI agents that adapt dynamically to the inputs they process.
Offline evaluations
Offline evaluations are performed outside of runtime execution, typically during development and testing. These evaluations help you measure and refine the quality of an agent or its tools before deployment.
For example, you can:
- Evaluate a single tool used by an agent to measure accuracy or reliability.
- Evaluate the entire agent flow after it completes, using test data or scenarios.
Offline evaluations provide insights that help developers iteratively improve agent design and performance before putting the agent into production.
Evaluators
The agentic AI evaluation module uses the following evaluators to measure performance for agentic RAG use cases:
- MetricGroup evaluators
- evaluator.evaluate_retrieval_quality
- An evaluation decorator that computes retrieval quality metrics on an agentic tool. Metrics include context relevance, retrieval precision, average precision, hit rate, reciprocal rank, and NDCG.
- evaluator.evaluate_answer_quality
- An evaluation decorator that computes answer quality metrics on an agentic tool. Metrics include answer relevance, faithfulness, answer similarity, and unsuccessful requests.
- evaluator.evaluate_content_safety
- An evaluation decorator that computes content safety metrics on an agentic tool. Metrics include HAP, PII, harm, social bias, profanity, sexual content, unethical behavior, violence, harm engagement, evasiveness, and jailbreak.
- evaluator.evaluate_readability
- An evaluation decorator that computes readability metrics on an agentic tool. Readability metrics include Text Reading Ease and Text Grade Level.
- Metric Evaluators
- evaluator.evaluate_context_relevance
- Computes the context relevance metric for your content retrieval tool.
- evaluator.evaluate_faithfulness
- Computes the faithfulness metric for your answer generation tool. This metric does not require ground truth.
- evaluator.evaluate_answer_similarity
- Computes the answer similarity metric for your answer generation tool. This metric requires ground truth for computation.
- evaluator.evaluate_retrieval_precision
- An evaluation decorator that computes the retrieval precision metric on an agentic tool. This metric uses context relevance values as a prerequisite.
- evaluator.evaluate_average_precision
- An evaluation decorator that computes the average precision metric on an agentic tool. This metric uses context relevance values as a prerequisite.
- evaluator.evaluate_hit_rate
- An evaluation decorator that computes the hit rate metric on an agentic tool. This metric uses context relevance values as a prerequisite.
- evaluator.evaluate_reciprocal_rank
- An evaluation decorator that computes the reciprocal rank metric on an agentic tool. This metric uses context relevance values as a prerequisite.
- evaluator.evaluate_ndcg
- An evaluation decorator that computes the NDCG (Normalized Discounted Cumulative Gain) metric on an agentic tool. This metric uses context relevance values as a prerequisite.
- evaluator.evaluate_answer_relevance
- An evaluation decorator that computes the answer relevance metric on an agentic tool.
- evaluator.evaluate_unsuccessful_requests
- An evaluation decorator that computes the unsuccessful requests metric on an agentic tool.
- evaluator.evaluate_tool_call_syntactic_accuracy
- An evaluation decorator that computes the tool call syntactic accuracy metric on an agentic tool.
- evaluator.evaluate_answer_quality
- An evaluation decorator that computes answer quality metrics on an agentic tool. Metrics include answer relevance, faithfulness, answer similarity, and unsuccessful requests.
- evaluator.evaluate_hap
- An evaluation decorator that computes the HAP metric on an agentic tool.
- evaluator.evaluate_pii
- An evaluation decorator that computes the PII metric on an agentic tool.
- evaluator.evaluate_harm
- An evaluation decorator that computes harm risk on an agentic tool via Granite Guardian.
- evaluator.evaluate_social_bias
- An evaluation decorator that computes social bias risk on an agentic tool via Granite Guardian.
- evaluator.evaluate_profanity
- An evaluation decorator that computes profanity risk on an agentic tool via Granite Guardian.
- evaluator.evaluate_sexual_content
- An evaluation decorator that computes sexual content risk on an agentic tool via Granite Guardian.
- evaluator.evaluate_unethical_behavior
- An evaluation decorator that computes unethical behavior risk on an agentic tool via Granite Guardian.
- evaluator.evaluate_violence
- An evaluation decorator that computes violence risk on an agentic tool via Granite Guardian.
- evaluator.evaluate_harm_engagement
- An evaluation decorator that computes harm engagement risk on an agentic tool via Granite Guardian.
- evaluator.evaluate_evasiveness
- An evaluation decorator that computes evasiveness risk on an agentic tool via Granite Guardian.
- evaluator.evaluate_jailbreak
- An evaluation decorator that computes jailbreak risk on an agentic tool via Granite Guardian.
- evaluator.evaluate_text_reading_ease
- The Text Reading Ease metric measures how readable the text is. It is computed using the
flesch_reading_easemethod. - evaluator.evaluate_text_grade_level
- The Text Grade Level metric measures the approximate U.S. grade level of a text. It is computed using the
flesch_kincaid_grademethod. - evaluator.evaluate_prompt_safety_risk
- The Prompt Safety Risk metric evaluates how likely an AI is to respond with harmful, unsafe, or inappropriate content. Available only in Dallas (us-south) and Frankfurt (eu-de) regions.
- evaluator.evaluate_topic_relevance
- The Topic Relevance metric evaluates how closely the input content aligns with the topic specified by the
system_prompt. Available only in Dallas (us-south) and Frankfurt (eu-de) regions.
To use the agentic AI evaluation module you must install the ibm-watsonx-gov Python SDK with specific settings:
pip install "ibm-watsonx-gov[agentic]"
Examples
You can evaluate agentic AI tools with the agentic AI evaluation module as shown in the following examples:
Set up the state
The ibm-watsonx-gov Python SDK provides a pydantic based state class that you can extend:
from ibm_watsonx_gov.entities.state import EvaluationState
class AppState(EvaluationState):
pass
Set up the evaluator
To evaluate agentic AI applications, you must instantiate the AgenticEvaluation class to define evaluators to compute different metrics:
from ibm_watsonx_gov.evaluators.agentic_evaluator import AgenticEvaluator
evaluator = AgenticEvaluator()
You can also run an advanced version of the evaluator:
from ibm_watsonx_gov.evaluators.agentic_evaluator import AgenticEvaluator
from ibm_watsonx_gov.config import AgenticAIConfiguration
from ibm_watsonx_gov.entities.agentic_app import (AgenticApp, MetricsConfiguration, Node)
from ibm_watsonx_gov.metrics import AnswerRelevanceMetric
from ibm_watsonx_gov.entities.enums import MetricGroup
# Define the metrics to be computed at the agentic app(interaction) level in metrics_configuration under AgenticApp,
# these metrics use the agent input and output fields.
# The node level metrics to be computed post the graph invocation can be specified in the nodes parameter of AgenticApp.
retrieval_quality_config_web_search_node = {
"input_fields": ["input_text"],
"context_fields": ["web_context"]
}
nodes = [Node(name="Web \nSearch \nNode",
metrics_configurations=[MetricsConfiguration(configuration=AgenticAIConfiguration(**retrieval_quality_config_web_search_node),
metrics=[ContextRelevanceMetric()])])]
agent_app = AgenticApp(name="Rag agent",
metrics_configuration=MetricsConfiguration(metrics=[AnswerRelevanceMetric()],
metric_groups=[MetricGroup.CONTENT_SAFETY]),
nodes=nodes)
evaluator = AgenticEvaluator(agentic_app=agent_app)
Add your evaluators
Compute the context relevance metric by defining the retrieval_node tool and decorate it with the evaluate_context_relevance evaluator tool:
@evaluator.evaluate_context_relevance
def retrieval_node(state: AppState, config: RunnableConfig):
# do something
pass
You can also stack evaluators to compute multiple metrics with a tool. The following example shows the generate_node tool decorated with the evaluate_faithfulness and evaluate_answer_similarity tools
to compute answer quality metrics:
@evaluator.evaluate_faithfulness
@evaluator.evaluate_answer_similarity
def generate_node(state: AppState, config: RunnableConfig):
# do something
pass
Make an invocation
When you invoke an application for a row of data, a interaction_id key is added to the inputs to track individual rows and associate metrics with each row:
evaluator.start_run()
result = rag_app.invoke({"input_text": "What is concept drift?", "ground_truth": "Concept drift occurs when the statistical properties of the target variable change over time, causing a machine learning model’s predictions to become less accurate."})
evaluator.end_run()
eval_result = evaluator.get_result()
eval_result.to_df()
The invocation generates a result as shown in the following example:
| interaction_id | Generation Node.answer_similarity | Generation Node.faithfulness | Generation Node.latency | Retrieval Node.context_relevance | Retrieval Node.latency | interaction.cost | interaction.duration | interaction.input_token_count | interaction.output_token_count |
|---|---|---|---|---|---|---|---|---|---|
| eb1167b367a9c3787c65c1e582e2e662 | 0.924013 | 0.300423 | 3.801389 | 0.182579 | 1.652945 | 0.000163 | 5.575077 | 608 | 121 |
Invoke the graph on multiple rows
To complete batch invocation, you can define a dataframe with questions and ground truths for those questions:
import pandas as pd
question_bank_df = pd.read_csv("https://raw.githubusercontent.com/IBM/ibm-watsonx-gov/refs/heads/samples/notebooks/data/agentic/medium_question_bank.csv")
question_bank_df["interaction_id"] = question_bank_df.index.astype(str)
evaluator.start_run()
result = rag_app.batch(inputs=question_bank_df.to_dict("records"))
evaluator.end_run()
eval_result = evaluator.get_result()
eval_result.to_df()
The dataframe index is used as a interaction_id to uniquely identify each row.
The invocation generates a result as shown in the following example:
| interaction_id | Generation Node.answer_similarity | Generation Node.faithfulness | Generation Node.latency | Retrieval Node.context_relevance | Retrieval Node.latency | interaction.cost | interaction.duration | interaction.input_token_count | interaction.output_token_count |
|---|---|---|---|---|---|---|---|---|---|
| 12f175ffae3b16ec9a27d85888c132ad | 0.914762 | 0.762620 | 1.483254 | 0.434709 | 1.639955 | 0.000131 | 3.147790 | 701 | 44 |
| 31d0b6640589f8779b0252440950fd13 | 0.356361 | 0.584075 | 4.864134 | 0.525792 | 1.353179 | 0.000258 | 6.243586 | 623 | 276 |
| 6d16ee18552116dd2ba4b180cb69ca38 | 0.896585 | 0.889639 | 3.266545 | 0.707973 | 1.686493 | 0.000203 | 4.983225 | 670 | 172 |
| 7aaf0e891fb797fab7d6467b2f5a522a | 0.774119 | 0.735871 | 3.533067 | 0.715336 | 1.849011 | 0.000187 | 5.404923 | 608 | 161 |
| a25b59fd92e8e269d12ecbc40b9475b1 | 0.857428 | 0.875609 | 6.110012 | 0.763275 | 1.374762 | 0.000154 | 7.512924 | 502 | 133 |
| ade9b2b4efdd35f80fa34266ccfdba9b | 0.891241 | 0.786779 | 3.674506 | 0.669930 | 1.050648 | 0.000177 | 4.750497 | 642 | 137 |
| d480865f9b38fe803042e325a28f5ab0 | 0.935062 | 0.267500 | 3.108228 | 0.182579 | 1.640975 | 0.000163 | 4.776831 | 608 | 121 |
| d576d4155ec17dbe176ea1b164264cd5 | 0.861390 | 0.893529 | 2.277618 | 0.838808 | 4.941034 | 0.000144 | 7.247118 | 636 | 83 |
| d5fdb76a19fbeb1d9edfa3da6cf55b15 | 0.661731 | 0.684596 | 2.075541 | 0.680110 | 1.632314 | 0.000128 | 3.730348 | 633 | 57 |
| daf66c5f2577bffac87a746319c16a0d | 0.890937 | 0.808881 | 2.250932 | 0.706106 | 1.515383 | 0.000141 | 3.797323 | 608 | 86 |
For more information, see the sample notebook.