An AI agent refers to a software system capable of autonomously carrying out tasks on behalf of a user or another system, by developing its own workflow and using external tooling as required.
Agents go well beyond simple language processing and understanding. They are capable of decision making, problem solving, interacting with the environment and acting in pursuit of goals.
AI agents are now being incorporated into a variety of enterprise solutions, from IT automation and software engineering, to conversational interfaces and code generation implementation. Driven by large language models (LLMs), they can comprehend complex directions, decompose them into steps, interact with resources from outside sources and have the cognitive ability to know when to deploy certain tools or services to help achieve tasks.
Agent evaluation is an important procedure when creating and deploying autonomous AI systems because it measures how well an agent performs the tasks assigned, makes decisions and interacts with users or environments. This way we can ensure that agents operate reliably, efficiently and ethically in their intended use cases.
Key reasons for agent evaluation include:
Assessing an AI agent's performance uses metrics organized in several formal classes of performance: accuracy, response time (speed) and cost of resources used. Accuracy describes how well the agent gives the correct and relevant responses, as well as the agent's capacity to complete its intended functions. Response time measures the speed that the agent takes to process the input and to produce output. Minimizing latency is especially important in interactive and real-time programs and cost measures the computational resources the agent consumes, such as token use, call to an application programming interface (API) or system time. These metrics provide guidelines to improve the performance of the system and limit operational costs.
While key metrics such as correctness, helpfulness and coherence fall under accuracy, response time (latency) measures metrics including throughput, average latency and timeout delay. Cost metrics include token usage, compute time, API call count and memory consumption.
In this tutorial we will explore the key metrics of correctness, helpfulness and coherence that fall under accuracy.
You will develop a travel agent and evaluate its performance by using an "LLM-as-a-judge."
You need an IBM® Cloud® account to create a watsonx.ai® project.
You also need Python version 3.12.7
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai by using your IBM Cloud account.
Create a watsonx.ai project. You can get your project ID from within your project. Click the Manage tab. Then copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more IBM Granite® tutorials, check out the IBM Granite Community.
Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an application programming interface (API) key.
Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.
We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they're not installed, a quick pip installation resolves the problem.
Note, this tutorial was built by using Python 3.12.7.
To set our credentials, we need the WATSONX_APIKEY and WATSONX_PROJECT_ID you generated in step 1. We will also set the URL serving as the API endpoint. Your API endpoint can differ depending on your geographical location.
We will use the Granite 3 -8B Instruct model for this tutorial. To initialize the LLM, we need to set the model parameters. To learn more about these model parameters, such as the minimum and maximum token limits, refer to the documentation.
Let's build a travel explorer buddy that helps users with trip planning and travel research.
We will create a simple travel assistant application that can retrieve airline and hotel information in response to user inquiries by connecting to an external travel API. In order to integrate with AI agents for dynamic travel planning, we will have a straightforward function that makes API queries and wrap it in a tool.
Finally, we run an evaluation and print the final evaluation score. In order to evaluate the trip planner by using three distinct criteria (correctness, helpfulness and coherence), a structured evaluation prompt is developed for an evaluator LLM.
The output shows both qualitative and quantitative assessment of the travel planner generated by using three criteria—correctness, helpfulness and coherence.
Let's break down what each score and metric means in the context of the agent's output:
Â
When evaluating an agent’s ability to truly meet user needs, criteria such as coherence, helpfulness and accuracy play a central role. Regardless of whether you're working with OpenAI, IBM Granite or other LLM-as-a-service models, it's crucial to rely on structured evaluation methods—such as evaluation datasets, benchmarks, annotations and ground truth—to thoroughly test final outputs. In practical use cases like chatbots or RAG-based customer support, open source frameworks like LangGraph are invaluable. They support scalable automation, dependable routing and enable rapid iteration cycles. These technologies also make it easier to power generative AI systems, debug behaviors and optimize and configure complex workflows. By carefully defining test cases and keeping an eye on observability metrics like computation cost, price and latency, teams can consistently improve system performance. Ultimately, applying a reliable and repeatable evaluation approach brings rigor to machine learning systems and strengthens their trustworthiness over time.
Govern generative AI models from anywhere and deploy on the cloud or on premises with IBM watsonx.governance.
See how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.
Prepare for the EU AI Act and establish a responsible AI governance approach with the help of IBM Consulting.