Metrics computation with the Python SDK
The Python SDK is a Python library where you can work directly with the Watson OpenScale or watsonx.governance service. You can use the Python SDK to bind your machine learning engine, configure a logging database, and select and monitor deployments.
You can use the Python SDK to calculate metrics and algorithms in a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for model evaluations.
To learn how to calculate metrics and algorithms with the Python SDK, you can use sample notebooks.
Traditional model evaluation metrics
With Python SDK version 3.0.14 or later, you can calculate the following fairness metrics and explanation algorithms:
FairScore transformer
You can use the FairScore transformer as a post-processing bias mitigation technique. This technique transforms probability estimates or the scores of probabilistic binary classification models regarding fairness goals. To use FairScore Transformer in Watson OpenScale, you must train a Fair score transformer.
Individual fairness post-processor
The individual fairness post-processor is a post-processing transformer algorithm that transforms individual scores to achieve individual fairness. You can use it with the Python SDK to support multi-class text classification. You must train this algorithm before you can use it to transform model outputs.
Input reduction
You can use the input reduction algorithm to calculate the minimum set of features that you must specify to keep model predictions consistent. The algorithm excludes the features that do not affect model predictions.
Likelihood compensation
The likelihood compensation (LC) is a framework for explaining the deviations of the prediction of a black box model from the ground truth. With test data and the predict function of a black box model, LC can identify the anomalies in the test data and explain what caused the sample to become an anomaly. The LC explanation is provided as deltas, which when added to the original test data or anomaly, converges the prediction of the model to the ground truth. LC provides local explanations and is supported only for regression models.
Local Interpretable Model-Agnostic Explanations (LIME)
LIME identifies which features are most important for a specific data point by analyzing up to 5000 other close-by data points. In an ideal setting, the features with high importance in LIME are the features that are most important for that specific data point.
Mean individual disparity
You can use the mean individual disparity to verify whether your model generates similar predictions or scores for similar samples. This metric calculates the difference in probability estimates of multi-class classification models for similar samples.
Multidimensional subset scanning
You can use the multidimensional subset scanning algorithm as a general bias scan method. This method detects and identifies which subgroups of features have statistically significant predictive bias for a probabilistic binary classifier. This algorithm helps you decide which features are the protected attributes and which values of these features are the privileged group for monitor evaluations.
Performance measures
You can use the following performance measure metrics to evaluate models with a confusion matrix that is calculated using ground truth data and model predictions from sample data:
- average_odds_difference
- average_abs_odds_difference
- error_rate_difference
- error_rate_ratio
- false_negative_rate_difference
- false_negative_rate_ratio
- false_positive_rate_difference
- false_positive_rate_ratio
- false_discovery_rate_difference
- false_discovery_rate_ratio
- false_omission_rate_difference
- false_omission_rate_ratio
Protected attributes extraction
The protected attribute extraction algorithm transforms text data sets to structured data sets. The algorithm tokenizes the text data, compares the data to patterns that you specify, and extracts the protected attribute from the text to create structured data. You can use this structured data to detect bias against the protected attribute with a Watson OpenScale bias detection algorithm. The protected attribute extraction algorithm only supports gender as a protected attribute.
Protected attributes perturbation
The protected attribute perturbation algorithm generates counterfactual statements by identifying protected attribute patterns in text data sets. It also tokenizes the text and perturbs the keywords in the text data to generate statements. You can use the original and perturbed data sets to detect bias against the protect attribute with a Watson OpenScale bias detection algorithm. The protected attribute perturbation algorithm only supports gender as a protected attribute.
Protodash explainer
The protodash explainer identifies input data from a reference set that need explanations. This method minimizes the maximum mean discrepancy (MMD) between the reference datapoints and a number of instances that are selected from the training data. To help you better understand your model predictions, the training data instances mimic a similar distribution as the reference datapoints.
Shapley Additive explainer (SHAP)
SHAP is a game-theoretic approach that explains the output of machine learning models. It connects optimal credit allocation with local explanations by using Shapley values and their related extensions.
SHAP assigns each model feature an importance value for a particular prediction, which is called a Shapley value. The Shapley value is the average marginal contribution of a feature value across all possible groups of features. The SHAP values of the input features are the sums of the difference between baseline or expected model output and the current model output for the prediction that is being explained. The baseline model output can be based on the summary of the training data or any subset of data that explanations must be generated for.
The Shapley values of a set of transactions can be combined to get global explanations that provide an overview of which features of a model are most important.
Smoothed empirical differential (SED)
The SED is fairness metric that you can use to describe fairness for your model predictions. SED quantifies the differential in the probability of favorable and unfavorable outcomes between intersecting groups that are divided by features. All intersecting groups are equal, so there are no unprivileged or privileged groups. This calculation produces a SED value that is the minimum ratio of Dirichlet smoothed probability for favorable and unfavorable outcomes between intersecting groups in the data set. The value is in the range 0-1, excluding 0 and 1, and a larger value specifies a better outcome.
Statistical parity difference
Statistical parity difference is a fairness metric that you can use to describe fairness for your model predictions. It is the difference between the ratio of favorable outcomes in unprivileged and privileged groups. This metric can be computed from either the input data set or the output of the data set from a classifier or predicted data set. A value of 0 implies that both groups receive equal benefit. A value less than 0 implies higher benefit for the privileged group. A value greater than 0 implies higher benefit for the unprivileged group.
Prompt template evaluation metrics
With Watson OpenScale Python SDK version 3.0.39 or later, you can also calculate the following metrics for prompt template evaluations:
Content analysis
You can use the following content analysis metrics to evaluate your foundation model output against your model input or context:
- Coverage
-
Coverage measures the extent that the foundation model output is generated from the model input by calculating the percentage of output text that is also in the input.
- Task types:
- Text summarization
- Retrieval Augmented Generation (RAG)
- Thresholds:
- Lower bound: 0
- Upper bound: 1
- How it works: Higher scores indicate that a higher percentage of output words are within the input text.
- Task types:
- Density
-
Density measures how extractive the summary in the foundation model output is from the model input by calculating the average of extractive fragments that closely resemble verbatim extractions from the original text.
- Task types:
- Text summarization
- Retrieval Augmented Generation (RAG)
- Thresholds: Lower bound: 0
- How it works: Lower scores indicate that the model output is more abstractive and on average the extractive fragments do not closely resemble verbatim extractions from the original text.
- Task types:
- Compression
-
Compression measures how much shorter the summary is when compared to the input text. It calculates the ratio between the number of words in the original text and the number of words in the foundation model output.
- Task types: Text summarization
- Thresholds: Lower bound: 0
- How it works: Higher scores indicate that the summary is more concise when compared to the original text.
- Repetitiveness
-
Repetitiveness measures the percentage of n-grams that repeat in the foundation model output by calculating the number of repeated n-grams and the total number of n-grams in the model output.
- Task types: Text summarization
- Thresholds: Lower bound: 0
- Abstractness
-
Abstractness measures the ratio of n-grams in the generated text output that do not appear in the source content of the foundation model.
- Task types:
- Text summarization
- Retrieval Augmented Generation (RAG)
- Thresholds:
- Lower bound: 0
- Upper bound: 1
- How it works: Higher scores indicate high abstractness in the generated text output.
- Task types:
Keyword inclusion
Keyword inclusion measures the similarity of nouns and pronouns between the foundation model output and the reference or ground truth. It calculates the precision, recall, and f1 scores by using keywords in the model output and in the ground truth.
- Task types:
- Text summarization
- Question answering
- Retrieval Augmented Generation (RAG)
Question robustness
Question robustness detects the English-language spelling errors in the model input questions. It calculates the percentage of incorrect questions that are sent to the model. You can specify a list of keywords that you want to exclude from the calculation that might not match English spelling conventions to generate more accurate results.
- Task types:
- Question answering
- Retrieval Augmented Generation (RAG)
Adversarial robustness
Adversarial robustness measures the robustness of your model and prompt template against adversarial attacks such as prompt injections and jailbreaks. It calculates the number of times the model refuses to provide responses to the attack vectors across different categories of jailbreak and prompt injection attacks. Then, it divides this sum by the number of attack vectors for each category to calculate a robustness score. For more information, see Computing Adversarial robustness and Prompt Leakage Risk using IBM watsonx.governance.
- Task types:
- Classification
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower bound: 0
- Upper bound: 1
- How it works: Higher scores indicate that the prompt template is robust and less vulnerable to adversarial attacks. To compute the metric, evaluations use a keyword detector that includes a list of phrases that indicate refusals from the model to provide responses to attacks. The model responses are compared to the list of phrases to calculate the metric score. These scores represent lower bound of the actual model robustness. If the model does not explicitly refuse to provide responses to attacks, scores indicate that the prompt template is not robust.
- Attack categories:
- Basic: Basic attacks use direct prompts to generate unwanted responses for models that are not trained to protect against any attacks.
- Intermediate: Intermediate attacks use natural language to pre-condition foundation models to follow instructions.
- Advanced: Advanced attacks require knowledge of model encoding or access to internal resources.
Prompt leakage risk
Prompt leakage risk measures the risk of leaking the prompt template by calculating the similarity between the leaked prompt template and original prompt template. The metric calculates a weighted average of similarity scores that is computed on a set of predefined attack vectors. The weighted average is calculated with a rank value between 1 and 4, where rank 4 represents the prompt attack vector that is the easiest for attackers to exploit. For more information, see Computing Adversarial robustness and Prompt Leakage Risk using IBM watsonx.governance.
- Task types:
- Classification
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower bound: 0
- Upper bound: 1
- How it works: A value of 0 indicates that the prompt template is robust against leakage attacks. A value of 1 indicates that the prompt template is vulnerable to prompt leaking attacks. If the score is closer to 1, you can try possible steps to mitigate attacks, such as including additional prompt instructions or using runtime detectors.
Retrieval quality
You can use the retrieval quality metrics to measure the quality of how the retrieval system ranks relevant contexts. Retrieval quality metrics are calculated with fine-tuned models or with LLM-as-a-judge models. LLM-as-a-judge models are LLM models that you can use to evaluate the performance of other models.
To calculate the metrics with LLM-as-a-judge models, you must create a scoring function that calls the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.
You can calculate the following retrieval quality metrics:
- Context relevance
-
Context relevance measures how relevant the context that your your model retrieves is with the question that is specified in the prompt. When multiple context variables exist, the context relevance scores are generated when the metric is calculated with fine-tuned models only.
- Task types: Retrieval Augmented Generation (RAG)
- Thresholds:
- Lower bound: 0
- Upper bound: 1
- How it works: Higher scores indicate that the context is more relevant to the question in the prompt.
- Retrieval precision
-
Retrieval precision measures the quanity of relevant contexts from the total of contexts that are retrieved.
- Task types: Retrieval Augment Generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that all of the retrieved contexts are relevant. A value of 0 indicates that none of the retrieved contexts are relevant. If the score is trending upwards, the retrieved contexts are relevant to the question. If the score is trending downwards, the retrieved contexts are not relevant to the question.
- Average precision
-
Average precision evaluates whether all of the relevant contexts are ranked higher or not by calculating the mean of the precision scores of relevant contexts.
- Task types: Retrieval Augment Generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that all the relevant contexts are ranked higher. A value of 0 indicates that none of the retrieved contexts are relevant. If the score is trending upwards, the relevant contexts are ranked higher. If the score is trending downwards, the relevant contexts are not ranked lower.
- Reciprocal rank
-
Reciprocal rank is the reciprocal rank of the first relevant context.
- Task types: Retrieval Augment Generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the first relevant context is at the first position. A value of 0 indicates that none of the relevant contexts are retrieved. If the score is trending upwards, the first relevant context is ranked higher. If the score is trending downwards, the first relevant context is ranked lower.
- Hit rate
-
Hit rate measures whether there is at least one relevant context among the retrieved contexts.
- Task types: Retrieval Augment Generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that there is at least one relevant context. A value of 0 indicates that no relevant context is in the retrieved contexts. If the score is trending upwards, at least one relevant context is in the retrieved context. If the score is trending downwards, no relevant contexts are retrieved.
- Normalized Discounted Cumulative Gain
-
Normalized Discounted Cumulative Gain (NDCG) measures the ranking quality of the retrieved contexts.
- Task types: Retrieval Augment Generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the retrieved contexts are ranked in the correct order. If the score is trending upwards, the ranking of the retrieved contexts is correct. If the score is trending downwards, the ranking of the retrieved contexts is incorrect.
Answer quality
You can use answer quality metrics to evaluate the quality of model answers. Answer quality metrics are calculated with fine-tuned models or with LLM-as-a-judge models. LLM-as-a-judge models are LLM models that you can use to evaluate the performance of other models.
To calculate the metrics with LLM-as-a-judge models, you must create a scoring function that calls the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.
You can calculate the following answer quality metrics:
- Faithfulness
-
Faithfulness measures how grounded the model output is in the model context and provides attributions from the context to show the most important sentences that contribute to the model output. The attributions are provided when the metric is calculated with fine-tuned models only.
- Task types: Retrieval Augmented Generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: Higher scores indicate that the output is more grounded and less hallucinated.
- Answer relevance
-
Answer relevance measures how relevant the answer in the model output is to the question in the model input.
- Task types: Retrieval Augmented Generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: Higher scores indicate that the model provides relevant answers to the question.
- Answer similiarity
-
Answer similarity measures how similar the answer or generated text is to the ground truth or reference answer to determine the quality of your model performance. The answer similarity metric is supported for configuration with LLM-as-a-judge models only.
- Task types: Retrieval Augmented Generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: Higher scores indicate that the answer is more similar to the reference output.
- Unsuccessful requests
-
Unsuccessful requests measures the ratio of questions that are answered unsuccessfully out of the total number of questions. The unsuccessful requests metric is not calculated with fine-tuned or LLM-as-a-judge models.
- Task types:
- Retrieval Augmented Generation (RAG)
- Question answering
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: Higher scores indicate that the model can not provide answers to the question.
- Task types:
Content validation
Content validation metrics use string-based functions to analyze and validate generated LLM output text. The input must contain a list of generated text from your LLM to generate content validation metrics.
If the input does not contain transaction records, the metrics measure the ratio of successful content validations and compares the ratio to the total number of validations. If the input contains transaction records, the metrics measure the
ratio of successful content validations when compared to the total number of validations and calculate validation results with the specified record_id
.
You can calculate the following content validation metrics:
- Length less than
-
The length less than metric measures whether the length of each row in the prediction is less than a specified maximum value.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the row length in the prediction is less than the specified value. A value of 0 indicates that the row length is not less than the specified value.
- Task types:
- Length greater than
-
The length greater than metric measures whether the length of each row in the prediction is greater than a specified maximum value.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the row length in the prediction is greater than the specified value. A value of 0 indicates that the row length is not greater than the specified value.
- Task types:
- Contains email
-
The contains email metric measures whether each row in the prediction contains emails.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows contain emails. A value of 0 indicates that the rows do not contain emails.
- Task types:
- Is email
-
The is email metric measures whether the rows in the prediction contain valid emails.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows contain valid emails. A value of 0 indicates that the rows do not contain valid emails.
- Task types:
- Contains_JSON
-
The contains_JSON metric measures if the rows in the prediction contains JSON syntax.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows contain JSON syntax. A value of 0 indicates that the rows do not contain JSON syntax.
- Task types:
- Is JSON
-
The is JSON metric measures whether the rows in the prediction contains valid JSON syntax.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows contain valid JSON syntax. A value of 0 indicates that the rows do not contain valid JSON syntax.
- Task types:
- Contains link
-
The contains link metric measures whether the rows in the prediction contain any links.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows in the prediction contain links. A value of 0 indicates that the rows in the prediction do not contain links.
- Task types:
- No invalid links
-
The no invalid links metric measures whether the rows in the prediction have no invalid links.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows have no invalid links. A value of 0 indicates that the rows in the prediction do have invalid links.
- Task types:
- Contains valid link
-
The contains valid link metric measures whether the rows in the prediction contain valid links.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows contain valid links. A value of 0 indicates that the rows do not contain valid links.
- Task types:
- Starts with
-
The starts with metric measures whether the rows in the prediction start with the specified substring.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows start with the specified substring. A value of 0 indicates that the rows do not start with the specified substring.
- Task types:
- Ends with
-
The ends with metric measures whether the rows in the prediction end with the specified substring.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that rows end with the specified substring. A value of 0 indicates that the rows do not end with the specified substring.
- Task types:
- Equals to
-
The equals to metric measures whether the rows in the prediction are equal to the specified substring.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the row is equal to the specified substring. A value of 0 indicates that the row is not equal to the specified substring.
- Task types:
- Contains all
-
The contains all metric measures whether the rows in the prediction contain all of the specified keywords.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that all of the specified keywords are found in the rows. A value of 0 indicates that the specified keywords are not found in the rows.
- Task types:
- Contains none
-
The contains none metric measues whether the rows in the prediction do not contain any of the specified keywords.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows contain none of the specified keywords. A value of 0 indicates 0 indicates that the rows do contain the specified keywords.
- Task types:
- Contains any
-
The contains any metric measures whether the rows in the prediction contain any of the specified keywords.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows do contain any of the specified keywords. A value of 0 indicates that the rows do not contain any of the specified keywords.
- Task types:
- Regex
-
The regex metric measures whether the rows in the prediction contain the specified regex expression.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows do contain the specified regex expression. A value of 0 indicates that the rows do not contain the specified regex expression.
- Task types:
- Contains string
-
The contains string metric measures whether each row in the prediction contains the specified string.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the rows contain the specified string. A value of 0 indicates that the rows do not contain the specified string.
- Task types:
- Fuzzy match
-
The fuzzy match metric measures if the prediction fuzzy matches the keyword.
- Task types:
- Summarization
- Generation
- Question answering
- Entity extraction
- Retrieval augmented generation (RAG)
- Thresholds:
- Lower limit: 0
- Upper limit: 1
- How it works: A value of 1 indicates that the prediction fuzzy mathches the keyword. A value of 0 indicates that the prediction does not fuzzy match the keyword.
- Task types: