Evaluate with simulated users
Test and validate agent behavior before deployment. Record conversations from preview chat as test cases, run evaluations in which an LLM-based user simulator drives multi-turn conversations against your agent to reproduce realistic user behavior, and analyze comprehensive metrics to identify issues and improve performance.
-
Agent evaluation is not available in isolated tenants in IBM Cloud.
-
When full redaction is enabled, draft evaluation is disabled. The test experience depends on trace data, and without accessible trace details, evaluations cannot run reliably. For more information, see Monitoring agents.
-
AI-generated responses from your agent can vary. Validate the responses before processing.
Evaluation workflow
Use the following steps to evaluate and improve a draft agent:
-
Recording test cases from preview chat: Save conversations from preview chat as test cases.
-
Running evaluations: Evaluate all test cases to verify agent behavior.
-
Reviewing evaluation results: Check success rates and detailed metrics.
-
Iterate and improve: Refine tools, update knowledge, and rerun evaluations
Recording test cases from preview chat
Before you begin testing, add the tools, collaborators, and knowledge your agent needs. Because evaluations use your agent's current configuration, preparing these inputs ensures that each test case is evaluated in a realistic scenario.
Record conversations from preview chat as test cases to quickly build your evaluation suite. The system automatically extracts conversation details, reducing manual test creation effort.
Multi-turn conversations are supported. If your agent asks follow-up questions, continue the conversation, all turns will be captured in the test case.
Recording a test case
To record a test case:
-
Open the preview chat for your agent.
-
Have a conversation that represents your test scenario. If the agent asks follow-up questions, continue the conversation, all turns will be captured.
-
Click Save as test or Save icon in the preview chat.
-
Review the auto-extracted information:
-
Name: Test case identifier, typically the first user message.
-
Response summary: Auto-generated description of the conversation flow and expected behavior.
-
Test condition: Tool calls: Tools called during the conversation. Each tool call can have multiple parameters, and you can configure match types for each parameter individually:
-
Exact: Parameter value must match precisely
-
Fuzzy: Parameter value can differ if numerically, date-wise, or semantically similar (for example, "March 30, 2024" or "30/03/2024")
-
Ignored: Parameter value is not evaluated during test execution
-
-
-
Optional: Click Advanced options to customize additional fields:
-
Conversation context: Description of the test scenario and what the agent should accomplish.
-
Starting phrase: The initial user input that begins the conversation.
-
Keywords: Specific words that must appear in the agent's responses for the test to pass.
-
-
Click Save.
A confirmation message appears with the View tests link above the preview chat. Click View tests to open the Test agent page and manage your test cases.
Figure 1. Record test cases from conversations
Accessing the Test agent page
After recording test cases from preview chat, access the Test agent page to manage your tests, run evaluations, and review results.
To access the Test agent page:
-
Click the View tests link that appears above the preview chat after saving a test case, or
-
Click Test agent above the preview chat.
Managing test cases
After you create test cases, you can view and manage them in the Tests section of the Test agent page. Each test case displays the test name, a preview of the expected response, last run time, and last modified information.
Sorting test cases
You can sort test cases by clicking the Sort by dropdown and selecting Recently added or Recently updated.
Managing individual test cases
For every test case, the options menu offers the following choices:
|
Option |
Description |
|---|---|
|
Run test |
Run only this specific test case |
|
Edit test |
Modify test case details such as name, context, starting phrase, and keywords |
|
Modify JSON |
Edit the test case configuration in JSON format for advanced customization |
|
Delete |
Remove the test case |
Creating additional test cases
To create additional test cases, click Create test +.
Running evaluations
After you create test cases, run evaluations where an LLM-based user simulator plays the role of a real user so you can measure how your agent behaves in realistic interactions.
To evaluate all test cases, go to the Test agent page and click Evaluate all. To run a single test case, click the Options icon next to the test case and select Run test.
Monitor progress in the Evaluation results section. Each test case runs and results display in the Evaluation results table.
Understanding pass or fail criteria
A test case passes when all configured conditions are met:
-
Tool calls: All expected tools are called with correct parameters (based on match type configuration)
-
Keywords: All specified keywords appear in the agent's response
-
Semantic match: The response is semantically similar to the expected answer
A test case fails if any configured condition is not met. Refer Analyzing evaluation metrics to identify which specific conditions failed.
Troubleshooting test failures
If a test fails unexpectedly:
-
Tool call mismatches: Check if parameters need Fuzzy matching instead of Exact. Use the View traces option to see actual tool calls made.
-
Keyword mismatches: Verify keywords are spelled correctly and appear in the expected response format.
-
Semantic mismatches: Review the response summary to ensure the expected behavior is clearly defined.
-
Configuration changes: If you modified tools or knowledge, rerun evaluations to test against the current agent configuration.
-
While an evaluation is in progress, the Test cases section remains temporarily disabled. The system re-enables it after the evaluation finishes.
-
Your evaluation may take up to 10 minutes, depending on the number of test cases and overall system load.
Reviewing evaluation results
You see each evaluation as a row in the Evaluation results section, which provides key details to help you track and analyze your test outcomes. You can search for specific evaluations using the search box.
|
Property |
Description |
|---|---|
|
Date |
When the evaluation started |
|
Success rate |
Percentage of tests that passed |
|
Successful tests |
Number of tests that passed |
|
Total tests |
Total number of tests run |
|
Run by |
User who started the evaluation |
|
Download |
Export your evaluation report in CSV format |
You can select an evaluation result, then click the Download button to export the report in CSV format or click the Delete button to remove it.
To view detailed results:
-
Click the date link in the Evaluation results table.
-
Review the comprehensive metrics dashboard showing overall performance and individual test results.
The detailed results page displays the agent name and description at the top, followed by overall metrics cards and individual test results. Each test result row includes an overflow menu with options to Run test or View traces for debugging.
Use View traces to see the exact tool calls, parameters, and responses generated during test execution. This is essential for debugging tool call mismatches.
Analyzing evaluation metrics
To analyze evaluation metrics, you select an evaluation by clicking the timestamp under Date in the Evaluation results table.
Performance metrics
Performance metrics for the selected test case:
|
Metric |
Description |
|---|---|
|
Orchestrate Agent Routing F1 |
F1 score for agent routing. Range: 0-1. |
|
Total Steps |
Number of conversation turns. |
|
LLM Steps |
Number of language model invocations. |
|
Average Agent Response Time (s) |
Response time in seconds. |
|
Total Tool Calls |
Number of tool invocations. |
|
Tool Call Recall |
Completeness score. Range: 0-1. |
|
Tool Call Precision |
Accuracy score. Range: 0-1. |
|
Expected Tool Calls |
Number of tool calls expected. |
|
Correct Tool Calls |
Number of correctly executed tool calls. |
|
Missed Tool Calls |
Number of expected tool calls not executed. |
|
Tool Calls with Incorrect Parameters |
Number of tool calls with wrong parameters. |
|
Text Match |
Semantic similarity score between the agent's response and expected response. Range: 0-1. |
|
Journey Completion |
Whether the conversation journey was completed. |
Pass or fail indicators
Pass or fail indicators for the selected test case:
|
Metric |
Description |
|---|---|
|
Tool Match Success |
Whether correct tools were selected (Yes/No). |
|
Keyword Match |
Whether expected keywords appear in response (Yes/No). |
|
Journey Success |
Whether the agent completed the expected conversation flow (Yes/No). |
|
Semantic Match |
Whether response is semantically similar to expected (Yes/No). |
Accessing the Legacy experience
The previous evaluation interface (CSV upload) is available for 30 days after the feature release. To access it, go to the Test agent page and click Legacy experience in the top right.
For more information, see Evaluate with CSV files.
What to do next
After you review evaluation results and identify improvement areas, take action by refining tools, updating knowledge, and adjusting configurations.