Creating custom evaluations and metrics

To create custom evaluations, select a set of custom metrics to quantitatively track your model deployment and business application. You can define these custom metrics and use them alongside metrics that are generated by other types of evaluations.

You can use one of the following methods to manage custom evaluations and metrics:

Managing custom metrics with the Python SDK

To manage custom metrics with the Python SDK, you must perform the following tasks:

  1. Register the custom monitor with metrics definition.
  2. Enable the custom monitor.
  3. Store metric values.

The following advanced tutorial shows how to do this:

You can disable and re-enable custom monitoring at any time. You can remove the custom monitor if you do not need it anymore.

For more information, see the Python SDK documentation.

Register the custom monitor with metrics definition

Before you can start by using custom metrics, you must register the custom monitor, which is the processor that tracks the metrics. You also must define the metrics themselves.

  1. Use the get_definition(monitor_name) method to import the Metric and Tag objects.
  2. Use the metrics method to define the metrics, which require name, thresholds, and type values.
  3. Use the tags method to define metadata.

The following code is from the working sample notebook that was previously mentioned:

def get_definition(monitor_name):
    monitor_definitions = wos_client.monitor_definitions.list().result.monitor_definitions

    for definition in monitor_definitions:
        if monitor_name == definition.entity.name:
            return definition

    return None


monitor_name = 'my model performance'
metrics = [MonitorMetricRequest(name='sensitivity',
                                thresholds=[MetricThreshold(type=MetricThresholdTypes.LOWER_LIMIT, default=0.8)]),
          MonitorMetricRequest(name='specificity',
                                thresholds=[MetricThreshold(type=MetricThresholdTypes.LOWER_LIMIT, default=0.75)])]
tags = [MonitorTagRequest(name='region', description='customer geographical region')]

existing_definition = get_definition(monitor_name)

if existing_definition is None:
    custom_monitor_details = wos_client.monitor_definitions.add(name=monitor_name, metrics=metrics, tags=tags, background_mode=False).result
else:
    custom_monitor_details = existing_definition

To check how you're doing, run the client.data_mart.monitors.list() command to see whether your newly created monitor and metrics are configured properly.

You can also get the monitor ID by running the following command:

custom_monitor_id = custom_monitor_details.metadata.id

print(custom_monitor_id)

For a more detailed look, run the following command:

custom_monitor_details = wos_client.monitor_definitions.get(monitor_definition_id=custom_monitor_id).result
print('Monitor definition details:', custom_monitor_details)

Enable the custom monitor

Next, you must enable the custom monitor for subscription. This activates the monitor and sets the thresholds.

  1. Use the target method to import the Threshold object.
  2. Use the thresholds method to set the metric lower_limit value. Supply the metric_id value as one of the parameters. If you don't remember, you can always use the custom_monitor_details command to get the details as shown in the previous example.

The following code is from the working sample notebook that was previously mentioned:

target = Target(
        target_type=TargetTypes.SUBSCRIPTION,
        target_id=subscription_id
    )

thresholds = [MetricThresholdOverride(metric_id='sensitivity', type = MetricThresholdTypes.LOWER_LIMIT, value=0.9)]

custom_monitor_instance_details = wos_client.monitor_instances.create(
            data_mart_id=data_mart_id,
            background_mode=False,
            monitor_definition_id=custom_monitor_id,
            target=target
).result

To check on your configuration details, use the subscription.monitoring.get_details(monitor_uid=monitor_uid) command.

Store metric values

You must store, or save, your custom metrics to the region where your service instance exists.

  1. Use the metrics method to set which metrics you are storing.
  2. Use the subscription.monitoring.store_metrics method to commit the metrics.

The following code is from the working sample notebook that was previously mentioned:

from datetime import datetime, timezone, timedelta
from ibm_watson_openscale.base_classes.watson_open_scale_v2 import MonitorMeasurementRequest
custom_monitoring_run_id = "11122223333111abc"
measurement_request = [MonitorMeasurementRequest(timestamp=datetime.now(timezone.utc),
                                                 metrics=[{"specificity": 0.78, "sensitivity": 0.67, "region": "us-south"}], run_id=custom_monitoring_run_id)]
print(measurement_request[0])

published_measurement_response = wos_client.monitor_instances.measurements.add(
    monitor_instance_id=custom_monitor_instance_id,
    monitor_measurement_request=measurement_request).result
published_measurement_id = published_measurement_response[0]["measurement_id"]
print(published_measurement_response)

To list all custom monitors, run the following command:

published_measurement = wos_client.monitor_instances.measurements.get(monitor_instance_id=custom_monitor_instance_id, measurement_id=published_measurement_id).result
print(published_measurement)

Managing custom metrics with the custom metrics provider and Python SDK

To manage custom metrics with the Python SDK, you must perform the following tasks:

  1. Register the custom monitor with metrics definition.

  2. Implement the metric endpoint.

  3. Create the custom metrics provider.

  4. Enable the custom monitor.

The following advanced tutorial shows how to do this:

For more information, see the Python SDK documentation.

Implement the metric endpoint

To generate a custom metric that includes record-level metrics, implement the custom metric and record-level metrics logic in the Custom Metrics Provider (Python Function or REST API).

Record-level metrics are metrics that are computed per row on the feedback or payload logging dataset, or on any other dataset that is used in a custom monitor. You can include record-level metrics in custom monitor implementations by computing the metrics in the custom metrics provider (Python function or REST API), saving them to a custom dataset, and viewing them in the Custom Monitors page in the UI.

For example, suppose that you want to compute the HAP metric on user inputs and outputs for each record (for security or compliance tracking). You can do this by implementing a record-level metric. When users view the evaluation results, they can click a timestamp to see the user input and user output records.

To implement the custom metric and record-level metrics logic in the Custom Metrics Provider (Python Function or REST API), do the following steps:

  1. Create a custom dataset.

    If you want to store record-level metrics, create a custom dataset after you define the custom monitor and its instance. The dataset can be created using the OpenScale SDK.

    Tip: In the simplified flow from the SDK, this step is not required. The flow automatically creates the custom dataset.
    custom_dataset_info = wos_client.custom_monitor.create_custom_dataset(
                    data_mart_id='data_mart_id_here',
                    subscription_id='subscription_id_here',
                    custom_monitor_id='custom_monitor_id_here'
                )
    
  2. Implement the logic in the Custom Metrics Provider (Python Function or REST API).

    • OpenScale sends inputs such as data_mart_id, subscription_id, custom_monitor_id, etc., when invoking the custom metrics provider.

    • Read the data from the feedback table, payload logging table, or any other dataset using the input payload variables

    • Compute the required record-level metrics for each row and save the computed record-level metrics to the custom dataset.

    • Compute the custom metrics and publish the aggregated metrics to the measurements API.

    • Update the monitor run status to Finished. A full working example is provided in the simplified and detailed notebooks.

Create the custom metrics provider

Define the custom metric provider endpoint details with the authentication information. OpenScale generates the token and invokes the REST endpoint with the token during the runtime

The following code is from the working sample notebook

wos_client.integrated_systems.add(name="Custom Metrics Provider",
    description="Custom Metrics Provider", type="custom_metrics_provider",
    credentials=  {
        "auth_type":"bearer",
        "token_info": {
           "url":  IAM_URL,
           "headers": { "Content-type": "application/x-www-form-urlencoded" }
           "payload": "grant_type=urn:ibm:params:oauth:grant-type:apikey&response_type=cloud_iam&apikey=<api_key>”,
           "method": "POST"
        }
    },
    connection={
        "display_name": "Custom Metrics Provider",
        "endpoint": rest_endpoint_url
    }
).result
Tip: The simplied flow from the SDK automatcally registers the custom monitor, enables the custom monitor, and creates the integrated system provider. All you need to do is implement the metric endpoint.

Managing custom metrics with the custom metrics provider and user interface

Do the following steps:

  1. Add metric groups.
  2. Implement the metric endpoint.
  3. Add the metric endpoint.
  4. Configure custom monitors.

Add metric groups

  1. On the home page, click Configure, and then click Metric groups.
  2. Click Add metric group.
  3. To configure a metric group by using a JSON file, click Import from file. Upload a JSON file and click Import.
  4. To configure a metric group manually, click Configure new group.
    1. Type a name for the metric group, and click Apply. The name must be less than or equal to 48 characters.
    2. Click the Edit icon on the Model types to support tile and select one or more model types that your evaluation supports. Click Next.
    3. To specify an evaluation schedule, click the toggle and specify the interval. Click Next.
    4. Specify the details for the input parameters. For each input parameter, enter the details and then click Add. The parameter name that you specify must match the parameter name that is specified in the metric API. If a parameter is required to configure your custom monitor, select the Required parameter checkbox.
    5. Click Save.

Add metric endpoints

  1. On the home page, click Configure, and then click Metric endpoints.

  2. Click Add metric endpoint.

  3. Specify a name and a description for the metric endpoint.

  4. Click the Edit icon on the Connection tile and specify the connection details. Click Next.

  5. Select the metric groups that you want to associate with the metric endpoint and click Save.

    If you enable batch support, when you specify a custom watsonx.ai Runtime endpoint URL, you can add the following input parameters to a metric group:

Input parameters
Parameter Datatype Value
custom_metrics_provider_type String wml_batch
space_id String Your space ID
deployment_id String watsonx.ai Runtime deployment ID of your custom metric endpoint
hardware_spec_id String watsonx.ai Runtime hardware spec ID of your custom metric endpoint
custom_metrics_wait_time Int time in seconds(ex: 60)

Configure custom monitors

  1. On the home page, click Insights Dashboard.
  2. On a model deployment tile, select Configure monitors.
  3. In the Evaluations section, select the name of the metric group that you added.
  4. Select the Edit icon on the Metric endpoint tile.
  5. Select a metric endpoint and click Next. If you don't want to use a metric endpoint, select None.
  6. Use the toggles to specify the metrics that you want to use to evaluate the model and provide threshold values. Click Next.
  7. Specify values for the input parameters. If you selected JSON as the data type for the metric group, add the JSON data. Click Next.

You can now evaluate models with a custom monitor.

Accessing and visualizing custom metrics

Visualization of your custom metrics appears on the Insights Dashboard.

The RAG Quality monitor page shows a time series graph that displays metrics include HAP and PII.

If you configured record-level metrics, click a time stamp in the chart to see the records.

A new window shows a table of records. The columns show the metric values.

If you're monitoring an LLM, you can also click a row to see the transaction details.

A new window shows the transaction details, including the input and output.

Learn more

Reviewing evaluation results