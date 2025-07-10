An AI agent refers to a software system capable of autonomously carrying out tasks on behalf of a user or another system, by developing its own workflow and utilizing external tooling as required.
Agents go well beyond simple language processing and understanding. They are capable of decision making, problem solving, interact with the environment, and act in pursuit of goals.
AI agents are now being incorporated into a variety of enterprise solutions, from IT automation and software engineering, to conversational interfaces and code generation implementation. Driven by large language models (LLMs), they can comprehend complex directions, decompose them into steps, interact with resources from outside sources, and have the cognitive ability to know when to deploy certain tools or services to help achieve tasks.
Agent evaluation is an important procedure when creating and deploying autonomous AI systems because it measures how well an agent performs the tasks assigned, makes decisions, and interacts with users or environments. This way we can ensure that agents operate reliably, efficiently, and ethically in their intended use cases.
Key reasons for agent evaluation include :
- Functional Verification: This helps verifying agent's behaviors and actions given certain conditions, as well as the completion of its objectives in defined constraints.
- Design Optimization: Identifies the shortcomings and inefficiencies in the agent's reasoning, planning, or tool use, allowing us to iteratively improve the agent's architecture and flow.
- Robustness: Evaluates agent's ability to encounter edge cases, adversarial inputs, or sub-optimal conditions, which can improve fault tolerance and resiliency.
- Performance and Resource Metrics: The observation of latency, throughput, token consumption, memory, and other system metrics can be tracked so that we can determine runtime efficiencies and so minimize operational costs.
- User Interaction Quality: Measures the clarity, helpfulness, coherence, and relevance of the agent's responses as an indicator of user satisfaction or conversational effectiveness.
- Goal Completion Analysis: By using success criteria, or specific task-based benchmarks, we can assess how reliably and accurately the agent completed its goals.
- Ethical and Safety Considerations: The outputs of the agent can be evaluated for fairness, bias, potential harm, and adherence to any safety procedures.
Assessing an AI agent's performance utilizes metrics organized in several formal classes of performance: accuracy, response time (speed), and cost of resources used. While Accuracy describes how well the agent gives the correct and relevant responses, as well as the agent's capacity to complete its intended functions. Response time measures the speed the agent takes to process the input and to produce output. Minimizing latency is especially important in interactive and real-time programs and cost measures the computational resources the agent consumes, such as token use, call to an API, or system time. These metrics provide guidelines to improve the performance of the system and limit operational costs.
While under Accuracy falls key metrics like correctness, Helpfulness and Coherence. Response Time (Latency) measures metrics like throughput, average Latency, timeout delay and Cost metrics include token Usage, Compute Time, API Call Count and memory consumption.
In tutorial we will explore key metrics i.e. Correctness, Helpfulness and Coherence that fall under Accuracy.
- Correctness: Correctness assesses whether the agent’s responses are factually accurate and logically true from the input prompt or task. This is often the most basic measure, particularly for fields such as healthcare, legal advice, or technical support.
- Helpfulness: Helpfulness assesses how useful or actionable the agent’s response is for the user’s intent. Even if a response is factually correct, it may not be helpful if it does not address a solution or next steps.
- Coherence: is related with flow - both logical and narrative flow. It is particularly important in multi-turn interactions, and in interactions where reasoning is being done over multiple steps. Coherence refers to whether the agent “makes sense” from start to finish.
You will develop a travel agent and evaluate it's performance using an LLM-as-a-judge."
You need an IBM Cloud account to create a watsonx.ai project.
You also need Python version 3.12.7
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai using your IBM Cloud® account.
Create a watsonx.ai project. You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community.
Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an application programming interface (API) key.
Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.
We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they're not installed, a quick pip installation resolves the problem.
Note, this tutorial was built using Python 3.12.7.
To set our credentials, we need the WATSONX_APIKEY and WATSONX_PROJECT_ID you generated in step 1. We will also set the URL serving as the API endpoint. Your API endpoint may differ depending on your geographical location.
We will use Granite 3 -8B Instruct model for this tutorial. To initialize the LLM, we need to set the model parameters. To learn more about these model parameters, such as the minimum and maximum token limits, refer to the documentation.
Let's build Travel Explorer Buddy that helps users with trip planning and travel research.
We will create a simple travel assistant application that can retrieve airline and hotel information in response to user inquiries by connecting to an external travel API. In order to integrate with AI agents for dynamic travel planning, we will have a straightforward function that makes API queries and wrap it in a tool.
Finally we will run evaluation and print the final evaluation score. In order to evaluate the trip planner using three distinct criteria (Correctness, Helpfulness, and Coherence), a structured evaluation prompt is developed for an evaluator LLM.
The output shows both qualitative and quantitative assessment of the travel planner generated using three criteria : Correctness, Helpfulness, and Coherence.
Let's break down what each score and metric means in context of the agent's output:
Correctness tells us how factually accurate and logically the response sounds. In the above example, the factual content is correct, hence the correctness score is (5 out of 5).
Helpfulness shows how helpful and pertinent the response is to the user's needs is measured by its helpfulness. A score of (5 out 5) in this scenario means the AI travel plan is very useful and thoughtfully designed for someone visiting best places to in India during winters for the first time.
Coherence : A score of 5 means the planner is logically organized and easy to read, which supports a high Coherence score.
When evaluating an agent’s ability to truly meet user needs, criteria like coherence, helpfulness, and accuracy play a central role. Regardless of whether you're working with OpenAI, IBM's Granite, or other LLM-as-a-service models, it's crucial to rely on structured evaluation methods—such as evaluation datasets, benchmarks, annotations, and ground truth—to thoroughly test final outputs. In practical use cases like chatbots or RAG-based customer support, open source frameworks like LangGraph are invaluable. They support scalable automation, dependable routing, and enable rapid iteration cycles. These technologies also make it easier to power generative AI systems, debug behaviors,optimize and configure complex workflows. By carefully defining test cases and keeping an eye on observability metrics like computation cost, price, and latency, teams can consistently improve system performance. Ultimately, applying a reliable and repeatable evaluation approach brings rigor to machine learning systems and strengthens their trustworthiness over time.
Explore the game-changing potential of AI agents that can effortlessly integrate into your business operations.
Join us for an insightful webinar where leaders and participants from the recent IBM Consulting and Microsoft hackathon share their experiences and insights on creating prototypes and MVPs.
Learn how organizations are shifting from launching AI in disparate pilots to using it to drive transformation at the core.
Dive into this comprehensive guide breaks down key use cases, core capabilities, and step-by-step recommendations to help you choose the right solutions for your business.
Learn how AI agents and AI assistants can work together to achieve new levels of productivity.
Discover how you can unlock the full potential of gen AI with AI agents.
Stay updated about the new emerging AI agents, a fundamental tipping point in the AI revolution.
Explore how generative AI assistants can lighten your workload and improve productivity.
Learn ways to use AI to be more creative, efficient and start adapting to a future that involves working closely with AI agents.
Stay ahead of the curve with our AI experts on this episode of Mixture of Experts as they dive deep into the future of AI agents and more.
Build, deploy and manage powerful AI assistants and agents that automate workflows and processes with generative AI.
Build the future of your business with AI solutions that you can trust.
IBM Consulting AI services help reimagine how businesses work with AI for transformation.
Whether you choose to customize pre-built apps and skills or build and deploy custom agentic services using an AI studio, the IBM watsonx platform has you covered.