Testing and evaluating a draft agent

Important: Agent evaluation is not available in isolated tenants in IBM Cloud.

Evaluate how your draft agent responds to real user utterances before it goes live. Use the Test option to upload test cases, run targeted or full evaluations, and review detailed results.

Testing helps confirm that recent changes to tools, collaborators, or knowledge produce the expected responses. You can iterate faster by running only relevant cases for small updates and full evaluations when validating end-to-end behavior.

Evaluating the agent before deployment helps you to fine-tune its behavior, helping ensure that it aligns with business goals and delivers consistent, measurable outcomes.

Note: AI-generated responses from your agent can vary. Validate the responses before processing.

Before you begin

Add the tools, collaborators, and knowledge your agent needs before testing. As the tests use your current agent configuration, preparing these inputs can ensure that each prompt is evaluated in a realistic scenario. This setup helps simulate actual user interactions and improves the accuracy of your evaluation results.

Make sure that your test cases reflect realistic scenarios to effectively assess how the agent performs when deployed.

Accessing the test option

To validate your agent's responses before deployment, you can access the test option from the agent configuration page:

Go to the agent configuration page.
Click the overflow menu next to Deploy.
Select Test.

The "Manage test cases and evaluations" page opens, where you can upload test cases, run targeted or full evaluations, and review detailed results.

Prepare test cases

Before running evaluations, prepare a .csv file (maximum size: 5 MB) containing test cases for your agent. To get started:

To get started, click Upload tests > Download .CSV template to download a sample file.
Each row in the .csv file must include one Prompt (the user utterance) and one Answer (the expected agent response).

This structure helps ensure that your test cases are formatted correctly and reflect realistic interaction scenarios.

Sample test cases

Use the following format in your .csv file:

Prompt,Answer
"What is the capital of France?","Paris"
"List three healthcare providers.","Provider A, Provider B, Provider C"

The uploader validates the .csv format and supports files up to 5 MB. Upload one file at a time to keep evaluations focused and traceable.

Uploading tests

After preparing your .csv file, upload your test cases by using the following steps:

Click Upload tests on the test management page.
Click or drag your .csv file into the upload box.
Click Upload to confirm.
- During the upload process, the system validates the file format and verifies if it meets the 5 MB size limit. If you had previous uploads, they are deselected automatically. Only the newly uploaded test cases are selected for evaluation.

Note: Upload one file at a time to keep each evaluation focused and traceable.

Managing test cases

Upon uploading your .csv file, you can view and manage the test cases that you imported. The Test cases table displays all available prompts and provides options to run, delete, or deselect them individually or in bulk.

Use the following actions to manage your test cases:

Action	Description
Run	Run only the selected test cases.
Delete	Remove the selected test cases if you no longer need them.
Cancel	Deselect all selected prompts.

On clicking Cancel, you can use Run all to select and run all available test cases, including newly added and previously uploaded ones.

Viewing and organizing test cases

The Test cases table includes features to help you efficiently view, organize, and navigate large sets of test cases.

Figure 1. The Manage test cases table shows test cases and their status.

Table columns:

Column	Description
Prompt	Displays the questions from your `.csv` file.
Date created	Shows when the `.csv` file was uploaded.
Last run	Indicates the last time that the test case was run.

Table features:

Feature	Description
Search	Quickly locate specific test cases, useful when working with large datasets. (Note: Search is only available when no test cases are selected.)
Sort	Click column headers to sort by `Prompt`, `Date created`, or `Last run`.
Pagination	View up to 5 prompts per page for better navigation.

These features help streamline the review process, making it easier to manage extensive test sets and focus on the most relevant cases.

Running a test evaluation

After your test cases are uploaded and selected, click Run to start the evaluation. You can choose to run targeted test cases or a full evaluation, depending on your testing goals. The results appear in the Evaluations table after the process is complete.

Notes:

While an evaluation is in progress, the Test cases table is temporarily disabled to prevent conflicting changes. It is re-enabled after the evaluation completes.
Evaluations might take up to 10 minutes, depending on the number of test cases and overall system load.
Use targeted evaluations for quick checks after small updates, and full evaluations to validate end-to-end behavior.

Reviewing evaluation results

Each evaluation appears as a row in the Evaluations table and includes key details to help you track and analyze test outcomes:

Date evaluated: Indicates when the evaluation was initiated.
Evaluation status: Displays the current state of the evaluation, in progress, completed, or error.
Number of tests: Shows how many prompts were included in the evaluation.
Run by: Identifies the user who initiated the evaluation.
Trash icon: Deletes all evaluations. Use the checkbox to delete evaluations individually.
Download: Export your evaluation report in the .csv format for further analysis or record keeping.

You can download reports to review results offline. This helps in tracking performance over time and identifying areas for improvement.

Analyzing evaluation metrics

To analyze evaluation metrics, select an evaluation by clicking the timestamp under Date evaluated in the Evaluations table. This action opens a detailed dashboard view with individual test results, by showing:

Pass and Fail status for each prompt:
- Pass: The agent successfully processed the prompt and returned an output.
- Fail: The prompt encountered an error during execution.
Prompts that need attention:
- Prompts flagged under Answer quality, Tool call, or Message completion can indicate areas where the agent’s configuration needs refinement.

Use focused rerun to validate specific changes, helping you iterate faster and improve accuracy without rerunning the entire test set.

Answer quality

Metrics are used to assess the degree to which the agent’s responses align with user expectations. Evaluation is based on the following criteria:

Faithfulness: Measures how accurately the output reflects and stays grounded in the provided context or source information.
Relevance: Assesses how relevant the answer is to the user’s question.
Correctness: How closely the generated text matches the reference answer, based on ground truth provided in tests with the .csv file.

Tool quality

The evaluation considers both the set of tools available to the agent and the specific tool calls that were run. The two metrics used to assess tool call are as follows:

Accuracy: This metric performs syntactic validation of tool calls. It checks whether the tools were called with the correct parameters and structure.
Relevance: This metric assesses whether the selected tool addresses the user's immediate request, based on the context of the conversation. It evaluates the alignment between the tool call and user intent by comparing it against all available tools. A large language model (LLM) conducts the assessment.

Message completion

This metric evaluates how successfully the agent completes messages at run time. It helps you assess the reliability and stability of your agent’s responses.

Success: The number of messages that ran successfully without any exceptions.
Failed: The number of messages that failed during execution and returned an error.

You can also download the evaluation report in .csv format for deeper analysis.

What to do next

After reviewing evaluation results and identifying areas for improvement, take action by refining tools, updating knowledge, and adjusting the configurations. Use focused rerun to validate specific changes and continue iterating to enhance agent performance. Regular analysis helps your agent become more accurate, reliable, and aligned with business goals over time.