Evaluate with simulated users

Edit online

Test and validate agent behavior before deployment. Record conversations from preview chat as test cases, run evaluations in which an LLM-based user simulator drives multi-turn conversations against your agent to reproduce realistic user behavior, and analyze comprehensive metrics to identify issues and improve performance.

Important:

Agent evaluation is not available in isolated tenants in IBM Cloud.
When full redaction is enabled, draft evaluation is disabled. The test experience depends on trace data, and without accessible trace details, evaluations cannot run reliably. For more information, see Monitoring agents.
AI-generated responses from your agent can vary. Validate the responses before processing.

Evaluation workflow

Use the following steps to evaluate and improve a draft agent:

Recording test cases from preview chat: Save conversations from preview chat as test cases.
Running evaluations: Evaluate all test cases to verify agent behavior.
Reviewing evaluation results: Check success rates and detailed metrics.
Iterate and improve: Refine tools, update knowledge, and rerun evaluations

Recording test cases from preview chat

Before you begin testing, add the tools, collaborators, and knowledge your agent needs. Because evaluations use your agent's current configuration, preparing these inputs ensures that each test case is evaluated in a realistic scenario.

Record conversations from preview chat as test cases to quickly build your evaluation suite. The system automatically extracts conversation details, reducing manual test creation effort.

Multi-turn conversations are supported. If your agent asks follow-up questions, continue the conversation, all turns will be captured in the test case.

Recording a test case

To record a test case:

Open the preview chat for your agent.
Have a conversation that represents your test scenario. If the agent asks follow-up questions, continue the conversation, all turns will be captured.
Click Save as test or Save icon in the preview chat.
Review the auto-extracted information:
- Name: Test case identifier, typically the first user message.
- Response summary: Auto-generated description of the conversation flow and expected behavior.
- Test condition: Tool calls: Tools called during the conversation. Each tool call can have multiple parameters, and you can configure match types for each parameter individually:
  - Exact: Parameter value must match precisely
  - Fuzzy: Parameter value can differ if numerically, date-wise, or semantically similar (for example, "March 30, 2024" or "30/03/2024")
  - Ignored: Parameter value is not evaluated during test execution
Optional: Click Advanced options to customize additional fields:
- Conversation context: Description of the test scenario and what the agent should accomplish.
- Starting phrase: The initial user input that begins the conversation.
- Keywords: Specific words that must appear in the agent's responses for the test to pass.
Click Save.

A confirmation message appears with the View tests link above the preview chat. Click View tests to open the Test agent page and manage your test cases.

Figure 1. Record test cases from conversations

Accessing the Test agent page

After recording test cases from preview chat, access the Test agent page to manage your tests, run evaluations, and review results.

To access the Test agent page:

Click the View tests link that appears above the preview chat after saving a test case, or
Click Test agent above the preview chat.

Managing test cases

After you create test cases, you can view and manage them in the Tests section of the Test agent page. Each test case displays the test name, a preview of the expected response, last run time, and last modified information.

Sorting test cases

You can sort test cases by clicking the Sort by dropdown and selecting Recently added or Recently updated.

Managing individual test cases

For every test case, the options menu offers the following choices:

Table 1. Options
Option	Description
Run test	Run only this specific test case
Edit test	Modify test case details such as name, context, starting phrase, and keywords
Modify JSON	Edit the test case configuration in JSON format for advanced customization
Delete	Remove the test case

Creating additional test cases

To create additional test cases, click Create test +.

Running evaluations

After you create test cases, run evaluations where an LLM-based user simulator plays the role of a real user so you can measure how your agent behaves in realistic interactions.

To evaluate all test cases, go to the Test agent page and click Evaluate all. To run a single test case, click the Options icon next to the test case and select Run test.

Monitor progress in the Evaluation results section. Each test case runs and results display in the Evaluation results table.

Understanding pass or fail criteria

A test case passes when all configured conditions are met:

Tool calls: All expected tools are called with correct parameters (based on match type configuration)
Keywords: All specified keywords appear in the agent's response
Semantic match: The response is semantically similar to the expected answer

A test case fails if any configured condition is not met. Refer Analyzing evaluation metrics to identify which specific conditions failed.

Troubleshooting test failures

If a test fails unexpectedly:

Tool call mismatches: Check if parameters need Fuzzy matching instead of Exact. Use the View traces option to see actual tool calls made.
Keyword mismatches: Verify keywords are spelled correctly and appear in the expected response format.
Semantic mismatches: Review the response summary to ensure the expected behavior is clearly defined.
Configuration changes: If you modified tools or knowledge, rerun evaluations to test against the current agent configuration.

Note:

While an evaluation is in progress, the Test cases section remains temporarily disabled. The system re-enables it after the evaluation finishes.
Your evaluation may take up to 10 minutes, depending on the number of test cases and overall system load.

Reviewing evaluation results

You see each evaluation as a row in the Evaluation results section, which provides key details to help you track and analyze your test outcomes. You can search for specific evaluations using the search box.

Table 2. Evaluation results
Property	Description
Date	When the evaluation started
Success rate	Percentage of tests that passed
Successful tests	Number of tests that passed
Total tests	Total number of tests run
Run by	User who started the evaluation
Download	Export your evaluation report in CSV format

You can select an evaluation result, then click the Download button to export the report in CSV format or click the Delete button to remove it.

To view detailed results:

Click the date link in the Evaluation results table.
Review the comprehensive metrics dashboard showing overall performance and individual test results.

The detailed results page displays the agent name and description at the top, followed by overall metrics cards and individual test results. Each test result row includes an overflow menu with options to Run test or View traces for debugging.

Tip:

Use View traces to see the exact tool calls, parameters, and responses generated during test execution. This is essential for debugging tool call mismatches.

Analyzing evaluation metrics

To analyze evaluation metrics, you select an evaluation by clicking the timestamp under Date in the Evaluation results table.

Performance metrics

Performance metrics for the selected test case:

Table 3. Performance metrics
Metric	Description
Orchestrate Agent Routing F1	F1 score for agent routing. Range: 0-1.
Total Steps	Number of conversation turns.
LLM Steps	Number of language model invocations.
Average Agent Response Time (s)	Response time in seconds.
Total Tool Calls	Number of tool invocations.
Tool Call Recall	Completeness score. Range: 0-1.
Tool Call Precision	Accuracy score. Range: 0-1.
Expected Tool Calls	Number of tool calls expected.
Correct Tool Calls	Number of correctly executed tool calls.
Missed Tool Calls	Number of expected tool calls not executed.
Tool Calls with Incorrect Parameters	Number of tool calls with wrong parameters.
Text Match	Semantic similarity score between the agent's response and expected response. Range: 0-1.
Journey Completion	Whether the conversation journey was completed.

Pass or fail indicators

Pass or fail indicators for the selected test case:

Table 4. Pass or fail indicators
Metric	Description
Tool Match Success	Whether correct tools were selected (Yes/No).
Keyword Match	Whether expected keywords appear in response (Yes/No).
Journey Success	Whether the agent completed the expected conversation flow (Yes/No).
Semantic Match	Whether response is semantically similar to expected (Yes/No).

Accessing the Legacy experience

The previous evaluation interface (CSV upload) is available for 30 days after the feature release. To access it, go to the Test agent page and click Legacy experience in the top right.

For more information, see Evaluate with CSV files.

What to do next

After you review evaluation results and identify improvement areas, take action by refining tools, updating knowledge, and adjusting configurations.