Keep your AI applications on track with AI runtime accuracy monitoring

Learn how IBM Watson OpenScale enables AI accuracy monitoring and closes the feedback loop for your machine learning models:

  • Why is maintaining model accuracy a challenge?
  • What is AI runtime accuracy monitoring, and how does it work?
  • How can we configure AI accuracy monitoring in IBM Watson OpenScale?

When we talk about measuring the accuracy of an AI machine learning model, we need to consider two distinct parts of the model’s lifecycle: the build stage and the deployment stage.

In the build stage, data scientists train an AI model against a training data set, and then evaluate its performance by comparing the model’s output against the labeled test data to calculate appropriate accuracy metrics.

The type of metric depends on the type of model—for a regression algorithm, you would use the coefficient of determination (R²); for a binary classification algorithm, the area under an ROC curve; and for a multi-class classification algorithm, the number of times any class was predicted correctly, normalized by the number of data points.

These measurements help the data science team judge whether a model is ready for production; if so, it can be integrated into business applications and move into the deployment stage.

Deployment creates a blind spot for accuracy monitoring

The deployment stage is when AI accuracy becomes truly important, because it’s the point when the business starts using the output of the model to make decisions. However, ironically, it’s also the point when accuracy becomes much more difficult to measure. Instead of having a neat set of labeled training and testing data to compare against, the model is handling unlabeled real-world data at scale, and in many cases, there is no easy way to tell whether its accuracy is degrading over time.

If the AI model does start to falter, and the business doesn’t realize that its recommendations are no longer reliable, it can have a negative impact on decision-making and potentially damage the business. A bank might find that it has rejected customers’ loan applications unnecessarily, for example; or a manufacturer might suffer downtime on its production line because its predictive maintenance model fails to anticipate a breakdown in one of its machines.

To mitigate these risks, data science teams will periodically retrain and redeploy AI models. However, the choice of when to retrain is typically ad hoc, and often comes too late. As a result, the data science team may spend valuable time retraining models that are still performing well, or may leave invalid models in production for too long, allowing them to make poor predictions that impact business decision-making.

Achieve AI runtime accuracy monitoring with IBM Watson OpenScale

IBM Watson OpenScale offers a better approach by using its built-in payload and feedback databases to enable AI accuracy monitoring during runtime.

The payload database captures the real-time data that each model is scoring. A sample of this payload data can be assigned to human reviewers (who could be subject-matter experts from the relevant area of the business, or external crowdsourcing specialists such as Figure Eight) for manual labeling.

Once these experts have labeled the data, it is automatically added to the feedback database. Watson OpenScale uses the feedback database to calculate a new accuracy metric for the model and displays it in the dashboard. From the dashboard, data scientists can then click through to examine the specific transactions that generated the inaccurate results, and decide whether retraining is justified. This operation runs continuously as long as sufficient data is available in the feedback database. Thus, it provides an immediate view of model accuracy in the runtime.

By generating real-time alerts if the accuracy score falls below an acceptable threshold, Watson OpenScale helps to ensure that data scientists and operations personnel are immediately notified as the model starts to degrade—reducing the risk of unknowingly using an unreliable model to make decisions.

At the same time, the solution helps to unburden the data science team and empower line-of-business subject-matter experts to play a more active role in AI model monitoring and management. These subject-matter experts know the data and the intended business outcomes, so they are often the best people to provide feedback on the model’s predictions. If real feedback data is not available, they are also in the best position to assess whether it is appropriate to source labeled data from alternate sources, such as a crowdsourcing vendor. As a result, by giving these experts greater control of the process, Watson OpenScale can help businesses troubleshoot problems faster and improve results.

How to set up AI accuracy monitoring in Watson OpenScale

Let’s take a look at how to set up an Accuracy Monitor in Watson OpenScale. First, navigate to the “What is Accuracy?” page, and click Next.

If your model was created using Apache Spark, you need to select a Spark instance to act as an engine for re-evaluating and retraining the model. You can either select an existing Spark instance or create a new one if necessary.

On the next screen, you specify the type of the model: binary classification, multi-class classification, or regression. This enables Watson OpenScale to calculate the appropriate accuracy metrics for the model type.

Next, you set the accuracy threshold as a percentage between 0 and 100. Once the Accuracy Monitor is up and running, it will automatically notify the model owner when the accuracy score falls below this threshold.

On the next screen, you specify the type of the model: binary classification, multi-class classification, or regression. This enables Watson OpenScale to calculate the appropriate accuracy metrics for the model type.

Finally, you set a minimum sample size for the evaluation data set in the feedback database. This will prevent the Accuracy Monitor from running the evaluation and updating the accuracy score until the selected number of records have been added to the feedback database. Setting an appropriate sample size is important, because a small sample could potentially skew the results.

When you click Save, the Accuracy Monitor will be provisioned and will start to run. To see the results, simply navigate to the Insights tab on the dashboard, where each model will now appear as a separate tile, displaying its current accuracy score and a warning if the score is lower than the threshold.

For more detail, you can click on one of the tiles to view charts and visualizations, showing how accuracy and other key metrics such as fairness and performance are changing over time.

This is just a brief glimpse of how IBM Watson OpenScale can transform the day-to-day management and monitoring of machine learning models and make it easier to run AI-infused applications safely and at scale.

Discover for yourself IBM Watson OpenScale.