Testing and evaluating a draft agent
Evaluate how your draft agent responds to real user utterances before it goes live. Use the Test option to upload test cases, run targeted or full evaluations, and review detailed results.
Testing helps confirm that recent changes to tools, collaborators, or knowledge produce the expected responses. You can iterate faster by running only relevant cases for small updates and full evaluations when validating end-to-end behavior.
Evaluating the agent before deployment helps you to fine-tune its behavior, helping ensure that it aligns with business goals and delivers consistent, measurable outcomes.
Before you begin
Add the tools, collaborators, and knowledge your agent needs before testing. As the tests use your current agent configuration, preparing these inputs can ensure that each prompt is evaluated in a realistic scenario. This setup helps simulate actual user interactions and improves the accuracy of your evaluation results.
Make sure that your test cases reflect realistic scenarios to effectively assess how the agent performs when deployed.
Accessing the test option
To validate your agent's responses before deployment, you can access the test option from the agent configuration page:
- Go to the agent configuration page.
- Click the overflow menu next to Deploy.
- Select Test.
The "Manage test cases and evaluations" page opens, where you can upload test cases, run targeted or full evaluations, and review detailed results.
Prepare test cases
Before running evaluations, prepare a .csv file (maximum size: 5 MB) containing test cases for your agent. To get started:
- To get started, click Upload tests > Download .CSV template to download a sample file.
- Each row in the
.csvfile must include onePrompt(the user utterance) and oneAnswer(the expected agent response).
This structure helps ensure that your test cases are formatted correctly and reflect realistic interaction scenarios.
Sample test cases
Use the following format in your .csv file:
Prompt,Answer
"What is the capital of France?","Paris"
"List three healthcare providers.","Provider A, Provider B, Provider C"
The uploader validates the .csv format and supports files up to 5 MB. Upload one file at a time to keep evaluations focused and traceable.
Uploading tests
After preparing your .csv file, upload your test cases by using the following steps:
- Click Upload tests on the test management page.
- Click or drag your
.csvfile into the upload box. - Click Upload to confirm.
- During the upload process, the system validates the file format and verifies if it meets the 5 MB size limit. If you had previous uploads, they are deselected automatically. Only the newly uploaded test cases are selected for evaluation.
Managing test cases
Upon uploading your .csv file, you can view and manage the test cases that you imported. The Test cases table displays all available prompts and provides options to run, delete, or deselect them individually or
in bulk.
Use the following actions to manage your test cases:
| Action | Description |
|---|---|
| Run | Run only the selected test cases. |
| Delete | Remove the selected test cases if you no longer need them. |
| Cancel | Deselect all selected prompts. |
On clicking Cancel, you can use Run all to select and run all available test cases, including newly added and previously uploaded ones.
Viewing and organizing test cases
The Test cases table includes features to help you efficiently view, organize, and navigate large sets of test cases.
Table columns:
| Column | Description |
|---|---|
| Prompt | Displays the questions from your .csv file. |
| Date created | Shows when the .csv file was uploaded. |
| Last run | Indicates the last time that the test case was run. |
Table features:
| Feature | Description |
|---|---|
| Search | Quickly locate specific test cases, useful when working with large datasets. (Note: Search is only available when no test cases are selected.) |
| Sort | Click column headers to sort by Prompt, Date created, or Last run. |
| Pagination | View up to 5 prompts per page for better navigation. |
These features help streamline the review process, making it easier to manage extensive test sets and focus on the most relevant cases.
Running a test evaluation
After your test cases are uploaded and selected, click Run to start the evaluation. You can choose to run targeted test cases or a full evaluation, depending on your testing goals. The results appear in the Evaluations table after the process is complete.
- While an evaluation is in progress, the Test cases table is temporarily disabled to prevent conflicting changes. It is re-enabled after the evaluation completes.
- Evaluations might take up to 10 minutes, depending on the number of test cases and overall system load.
- Use targeted evaluations for quick checks after small updates, and full evaluations to validate end-to-end behavior.
Reviewing evaluation results
Each evaluation appears as a row in the Evaluations table and includes key details to help you track and analyze test outcomes:
- Date evaluated: Indicates when the evaluation was initiated.
- Evaluation status: Displays the current state of the evaluation,
in progress,completed, orerror. - Number of tests: Shows how many prompts were included in the evaluation.
- Run by: Identifies the user who initiated the evaluation.
- Trash icon: Deletes all evaluations. Use the checkbox to delete evaluations individually.
- Download: Export your evaluation report in the
.csvformat for further analysis or record keeping.
You can download reports to review results offline. This helps in tracking performance over time and identifying areas for improvement.
Analyzing evaluation metrics
To analyze evaluation metrics, select an evaluation by clicking the timestamp under Date evaluated in the Evaluations table. This action opens a detailed dashboard view with individual test results, by showing:
- Pass and Fail status for each prompt:
- Pass: The agent successfully processed the prompt and returned an output.
- Fail: The prompt encountered an error during execution.
- Prompts that need attention:
- Prompts flagged under Answer quality, Tool call, or Message completion can indicate areas where the agent’s configuration needs refinement.
Use focused rerun to validate specific changes, helping you iterate faster and improve accuracy without rerunning the entire test set.
Answer quality
Metrics are used to assess the degree to which the agent’s responses align with user expectations. Evaluation is based on the following criteria:
- Faithfulness: Measures how accurately the output reflects and stays grounded in the provided context or source information.
- Relevance: Assesses how relevant the answer is to the user’s question.
- Correctness: How closely the generated text matches the reference answer, based on ground truth provided in tests with the
.csvfile.
Tool quality
The evaluation considers both the set of tools available to the agent and the specific tool calls that were run. The two metrics used to assess tool call are as follows:
- Accuracy: This metric performs syntactic validation of tool calls. It checks whether the tools were called with the correct parameters and structure.
- Relevance: This metric assesses whether the selected tool addresses the user's immediate request, based on the context of the conversation. It evaluates the alignment between the tool call and user intent by comparing it against all available tools. A large language model (LLM) conducts the assessment.
Message completion
This metric evaluates how successfully the agent completes messages at run time. It helps you assess the reliability and stability of your agent’s responses.
- Success: The number of messages that ran successfully without any exceptions.
- Failed: The number of messages that failed during execution and returned an error.
You can also download the evaluation report in .csv format for deeper analysis.
What to do next
After reviewing evaluation results and identifying areas for improvement, take action by refining tools, updating knowledge, and adjusting the configurations. Use focused rerun to validate specific changes and continue iterating to enhance agent performance. Regular analysis helps your agent become more accurate, reliable, and aligned with business goals over time.