AI evaluation in Instana
Assess and monitor the quality of your LLM outputs with comprehensive evaluation capabilities.
What is AI evaluation?
AI evaluation is a systematic approach to assess the quality, accuracy, and reliability of outputs generated by AI models. As organizations increasingly deploy LLM-powered applications, it becomes critical to measure and monitor their performance to ensure they meet business requirements and quality standards.
AI evaluation typically involves:
- Quality assessment: Measuring accuracy, relevance, and completeness of model responses
- Consistency checking: Ensuring logical coherence and clarity in outputs
- Performance monitoring: Tracking evaluation metrics to identify issues
How Instana enables AI evaluation
Instana's GenAI Observability platform provides a comprehensive AI evaluation framework that allows you to:
- Define custom evaluators: Create evaluation criteria tailored to your specific use cases
- Leverage pre-built evaluators: Use industry-standard evaluation metrics out of the box
- Monitor results: Track pass/fail rates, scores, and detailed evaluation outcomes
- Generative AI evaluation: Utilize advanced AI models to evaluate your LLM outputs objectively
The evaluation system integrates seamlessly with Instana's trace collection, allowing you to evaluate real production data and gain insights into your GenAI application's performance.
Prerequisites
Before you use AI evaluation features, complete the following setup:
Enable AI gateway with gen AI evaluation capability: The evaluation feature requires an AI gateway configured with the "LLM-as-a-judge" capability. This gateway uses WatsonX AI models to perform automated evaluations.
Required WatsonX credentials:
- Project ID: Your WatsonX project identifier
- API key: Authentication key for WatsonX API access
- URL: WatsonX service endpoint URL
Available models:
Instana supports three AI models for evaluation:
openai/gpt-oss-120b(default)- Additional model options available in the AI Gateway configuration
Configure AI gateway
- On the Instana UI, go to AI Gateway.
- Select LLM Gateways.
- Create a new gateway or edit an existing one
- Enter your WatsonX credentials:
- Project ID
- API Key
- URL
- Select the capability: GenAI evaluation.
- Choose your preferred AI model (or use the default)
- Save the configuration
- Enable the gateway by toggling the enable switch
After enabling the gateway, it becomes available for use in evaluations. The play button to run evaluations is only activated when the gateway is enabled.
Getting started with AI evaluation
Requesting deployment
Initially, a Request deployment button appears in the Evaluation tab. To request deployment of the Evaluation feature, open a case and provide your Tenant Unit name. Once the deployment request is completed, the Evaluation feature will be available for use.
Accessing the evaluation interface
- Open the Instana UI
- Select GenAI Observability from the sidebar
- Click on the Evaluations tab in the navigation bar
An expanded banner displays with an overview of the evaluations feature and a 3-step setup guide.
Main features and workflow
Create evaluators
Start by defining the evaluation criteria that are used to assess your LLM outputs.
Step 1: Navigate to evaluators tab
Click on the Evaluators tab to view and create evaluators.
Step 2: Choose evaluator type
Pre-built evaluators:
Instana provides five ready-to-use evaluators:
- Accuracy: Measures correctness of the output against expected results
- Relevance: Assesses how relevant the output is to the input query
- Completeness: Evaluates whether the output fully addresses the query
- Logical consistency: Checks for logical coherence in the response
- Clarity: Measures how clear and understandable the output is
Custom evaluator:
For specialized evaluation needs, create a custom evaluator:
- Select the Custom option
- Provide the following details:
- Name: A descriptive name for your evaluator
- Evaluation criteria: Define your custom evaluation logic
- Use the provided placeholder as a reference
- Write clear, specific criteria describing what constitutes a good output
- Threshold: Set the minimum score (0-1) for a passing evaluation
- Scores above or equal to this threshold are marked as "passed"
- Scores below are marked as "failed"
- Save the evaluator
Your custom evaluator is now available for use in evaluation definitions.
Define evaluations
Create evaluation definitions that specify which traces to evaluate and which evaluators to use.
Step 1: Navigate to evaluation definitions tab
Click on the Evaluation definitions tab.
Step 2: Create a new evaluation definition
- Click Create definition
- Select traces to evaluate:
- Use the UI trace selector
- Apply time range (for example, last 24 hours, last 7 days, custom range) to narrow down traces
- Preview the selected traces before proceeding
- Select evaluators:
- Choose from the list of available evaluators created in the evaluators tab
- Select one or more evaluators based on your evaluation needs
- Each evaluator runs independently on the selected traces
- Provide evaluation details:
- Name: Give your evaluation a descriptive name
- Description: Explain the purpose and scope of this evaluation (optional)
- Save the evaluation definition
Your evaluation is now ready to be executed.
Step 3: Execute and view evaluation results
Monitor and analyze the results of your evaluation runs.
Execute an evaluation
- From the Evaluation definitions tab, view the list of available evaluations
- Click the play button under the Actions column for the evaluation to run
- The evaluation process begins and runs in the background
Viewing results in evaluation runs tab
Navigate to the Evaluation runs tab (default view) to see:
Run-level summary:
- Run start time: When the evaluation began
- Evaluation name: Name of the evaluation definition
- Evaluators: List of evaluators used in this run
- Number of traces: Total traces evaluated
- Duration: Time taken to complete the evaluation
- Run status:
COMPLETED(all evaluators completed successfully),FAILED(at least one evaluator failed), orIN-PROGRESS(evaluators still running)
Evaluator-level details:
For each evaluator in the run, see:
- Evaluator name: The metric name (same as evaluator name)
- Score: Mean score across all evaluated traces
- Pass rate: Percentage of traces that passed the threshold
- Duration: Time taken by this evaluator
- Pass/fail count: Number of traces that passed vs. failed
- Status:
PENDING,IN-PROGRESS,COMPLETED,INCOMPLETE,PASSED,FAILED, orUNDEFINED
Best practices
1. Start with pre-built evaluators
Begin with the standard evaluators (Accuracy, Relevance, Completeness, Logical consistency, Clarity) to establish baseline metrics before creating custom evaluators.
2. Set appropriate thresholds
- Start with moderate thresholds (for example. 0.5) and adjust based on results
- Different use cases may require different thresholds
- Monitor pass rates and adjust thresholds to match your quality standards
3. Combine evaluators for comprehensive assessment
You can select multiple evaluators for a single evaluation to assess different quality dimensions simultaneously. Each evaluator runs independently and produces its own score and pass/fail result. For example:
- Accuracy and Completeness for factual Q&A systems
- Relevance + Clarity for customer support chatbots
- All five pre-built evaluators for critical production systems
4. Evaluate representative samples
- Select traces that represent typical user interactions
- Use time-based filters to evaluate recent production data
5. Custom evaluator guidelines
When creating custom evaluators:
- Write clear, specific evaluation criteria
- Follow the placeholder text in the evaluator description
- Test with sample data before production use
- Document the purpose and expected behavior
Troubleshooting
Evaluation run fails to start
- Check AI gateway configuration: Ensure WatsonX credentials are valid and the gateway is enabled
- Verify trace and evaluator selection: Evaluations can only be created when both traces and evaluators are selected
Low pass rates
- Review threshold settings: Threshold may be too high for your use case
- Examine failed traces: Look for patterns in why traces are failing
- Validate evaluation criteria: Ensure criteria match your actual requirements
Slow evaluation runs
- Reduce trace count: Evaluate smaller batches for faster results
- Reduce evaluator count: Use fewer evaluators to speed up the evaluation process
Missing results
- Wait for completion: Evaluations run asynchronously; check status
- Review error messages: Check the error field in the run status response
Summary
Instana's AI evaluation feature provides a powerful framework for assessing and monitoring the quality of your gen AI applications. By combining pre-built evaluators, custom evaluation criteria, and automated gen AI evaluation capabilities, you can:
- Ensure consistent quality in production LLM outputs
- Identify and address quality issues proactively
- Make data-driven decisions about model selection and configuration
Start by setting up your AI gateway, creating evaluators that match your quality standards, defining evaluations for your critical traces, and monitoring results to maintain high-quality gen AI applications.