Evaluate with CSV files

Overview

The CSV upload evaluation experience lets you test your agent by uploading test cases in CSV format. You can run targeted or full evaluations and review detailed results to identify issues before deployment.

Key capabilities:

Upload CSV files with prompts and expected answers
Run targeted or full evaluations
Review pass or fail status, and quality metrics
Download evaluation reports

Before you begin

Add the tools, collaborators, and knowledge your agent needs. Evaluations use your agent's current configuration, so preparing these inputs ensures realistic testing.

Important:

Agent evaluation is not available in isolated tenants in IBM Cloud.
When full redaction is enabled, draft evaluation is disabled. The test experience depends on trace data, and without accessible trace details, evaluations cannot run reliably. For more information on trace details, see Monitoring agents.
AI-generated responses from your agent can vary. Validate the responses before processing.

Accessing the test option

To access the test option:

Go to the agent configuration page.
Click Test agent > Legacy experience .

The Manage test cases and evaluations page opens.

Preparing test cases

Create a CSV file (maximum size: 5 MB) that contains test cases for your agent.

To prepare test cases:

Click Upload tests > Download .CSV template to download a sample file.
For each test case, add:
- One Prompt that represents a realistic user utterance
- One Answer that defines the expected agent response

Example CSV format:

Prompt,Answer
"What is the capital of France?","Paris"
"List three healthcare providers.","Provider A, Provider B, Provider C"

Uploading test cases

To upload test cases:

Click Upload tests on the test management page.
Click or drag your CSV file into the upload box.
Click Upload to confirm.

The system validates the file format and checks that it meets the 5 MB size limit. If you uploaded files earlier, the system keeps them available but automatically deselects them. Only the newly uploaded test cases remain selected for evaluation.

Note:

Upload one file at a time to keep each evaluation focused and traceable.

Managing test cases

After you upload your CSV file, you can view and manage the test cases in the Test cases table.

Available actions:


Action	Description
Run	Run only the selected test cases
Delete	Remove the selected test cases
Cancel	Deselect all selected prompts
Run all	Select and run all available test cases

Table features:

Search: Quickly locate specific test cases (available only when no test cases are selected)
Sort: Click column headers to sort by Prompt, Date created, or Last run
Pagination: View up to 5 prompts per page

Running evaluations

After you upload and select your test cases, click Run to start the evaluation.

Note:

While an evaluation is in progress, the Test cases table remains temporarily disabled. The system re-enables it after the evaluation finishes.
Your evaluation may take up to 10 minutes, depending on the number of test cases and overall system load.

Reviewing evaluation results

Each evaluation appears as a row in the Evaluations table.


Property	Description
Date evaluated	When you initiated the evaluation
Evaluation status	In progress, Completed, or Error
Number of tests	How many prompts you included in the evaluation
Run by	User who started the evaluation
Download	Export your evaluation report in CSV format

Analyzing evaluation metrics

To analyze evaluation metrics, select an evaluation by clicking the timestamp under Date evaluated. This opens a detailed dashboard that shows individual test results grouped into categories:


Category	Description
Pass	The agent successfully processed the prompt and returned an output
Fail	The prompt encountered an error during execution
Prompts that need attention	Prompts flagged under Answer quality, Tool call, or Message completion indicate areas where the agent's configuration may need refinement

Answer quality

Answer quality metrics evaluate how closely the agent's responses match user expectations:

Faithfulness: Measures how accurately the output reflects and stays grounded in the provided context or source information (default threshold: 0.70)
Relevance: Assesses how relevant the answer is to the user's question (default threshold: 0.70)
Correctness: Measures how closely the generated output matches the reference answer based on the ground truth in your CSV file (default threshold: 0.70)

Tool quality

Tool quality metrics evaluate both the tools available to the agent and the specific tool calls it executes:

Accuracy: Validates the syntax of tool calls, including parameter structure and correctness
Relevance: Assesses whether the selected tool addresses the user's request based on conversation context (default threshold: 0.80)

Message completion

Message completion evaluates how reliably the agent completes messages at run time:

Success: Number of messages that complete successfully without exceptions
Failed: Number of messages that fail during execution and return an error

What to do next

After you review evaluation results and identify improvement areas, refine tools, update knowledge, and adjust configurations. Regular analysis helps you build an agent that becomes more accurate, reliable, and aligned with your business goals over time.

Consider migrating to the new evaluation experience for improved test case creation and management.