Metrics computation with the Python SDK

The ibm-watsonx-gov Python SDK is a Python library that you can use to programatically monitor, manage, and govern machine learning models and generative AI assets. You can use the Python SDK to calculate metrics and algorithms in a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for model evaluations.

Use the ibm-watsonx-gov Python SDK, to calculate evaluation metrics and generate insights. You can automate these tasks by using modules and integrating them with your application. You can also use sample notebooks to compute metrics.

Modules

The Python SDK supports the following modules that can help you automate tasks for model evaluations and generate insights:

Metrics

The Python SDK supports metrics that help you evaluate traditional machine learning model evaluations and prompt template evaluations for generative AI assets. For more information, see Evaluation metrics.

The following metrics are currently available only with the Python SDK:

Table 13. Python SDK evaluation metric descriptions
Metric	Description
Adversarial robustness	Measures the robustness of your model and prompt template against adversarial attacks such as prompt injections and jailbreaks
Keyword inclusion	Measures the similarity of nouns and pronouns between the foundation model output and the reference or ground truth
Prompt leakage risk	Measures the risk of leaking the prompt template by calculating the similarity between the leaked prompt template and original prompt template
Natural Robustness	Measures how consistently a model responds to natural input variations, such as paraphrasing, typos, or style changes. A higher score indicates greater stability against non-malicious changes.
Question robustness	Detects the English-language spelling errors in the model input questions

The following metric category is also available only with the Python SDK:

Content validation metrics

Content validation metrics use string-based functions to analyze and validate generated LLM output text. The input must contain a list of generated text from your LLM to generate content validation metrics.

If the input does not contain transaction records, the metrics measure the ratio of successful content validations and compares the ratio to the total number of validations. If the input contains transaction records, the metrics measure the ratio of successful content validations when compared to the total number of validations and calculate validation results with the specified record_id.

You can calculate the following content validation metrics:

Table 14. Content validation evaluation metric descriptions
Metric	Description
Contains all	Measures whether the rows in the prediction contain all of the specified keywords
Contains any	Measures whether the rows in the prediction contain any of the specified keywords
Contains email	Measures whether each row in the prediction contains emails
Contains_JSON	Measures if the rows in the prediction contains JSON syntax
Contains link	Measures whether the rows in the prediction contain any links
Contains none	Measues whether the rows in the prediction do not contain any of the specified keywords
Contains string	Measures whether each row in the prediction contains the specified string
Contains valid link	Measures whether the rows in the prediction contain valid links
Ends with	Measures whether the rows in the prediction end with the specified substring
Equals to	Measures whether the rows in the prediction are equal to the specified substring
Fuzzy match	Measures if the prediction fuzzy matches the keyword
Is email	Measures whether the rows in the prediction contain valid emails
Is JSON	Measures whether the rows in the prediction contains valid JSON syntax
Length greater than	Measures whether the length of each row in the prediction is greater than a specified maximum value
Length less than	Measures whether the length of each row in the prediction is less than a specified maximum value
No invalid links	Measures whether the rows in the prediction have no invalid links
Regex	Measures whether the rows in the prediction contain the specified regex expression
Starts with	Measures whether the rows in the prediction start with the specified substring