whizzing circuits of ice and whimsy

A guide to monitoring machine learning

What is machine learning monitoring?

Machine learning model monitoring, or “ML model monitoring,” is a set of tools, practices and mechanisms to track whether deployed ML models still exhibit acceptable model performance. Increasingly, many firms view ML model monitoring as an integral part of the ML lifecycle.

The real world is different from a test environment. For one thing, testing tends to be stable while the world is always in flux. A machine learning model that tested well in its validation phase might perform less well in a real production environment as time goes on. Things change.

The problem underlying such “model drift” might come from any angle. Input data might change, or data quality might dip. Data sources can break, or pipelines can slow down.

In other words: A complex set of model inputs determines model outputs. To track it all, organizations develop machine learning monitoring practices.

How do mature organizations go about monitoring machine learning—and governing AI more broadly? Best practices are still emerging in this nascent field, but five principles can serve to guide organizations looking to develop their own programs.

1. Map metrics to actual business risk

With many monitoring tools, there is a risk of performing a kind of “observability theater.” Perhaps 99 out of 100 metrics on the dashboard look good, lulling an organization into a false sense of security. But an early strategic planning meeting might have identified that single outlying metric as the one that mattered most.

Not all metrics are created equal, and some use cases matter more, often with wildly different costs for bad model predictions.  

A useful monitoring system has a set of organization-specific decisions and thresholds built around the numbers on the dashboard.

  • Which performance metrics matter most? 

  • What is the precise level at which a stakeholder needs to be alerted? 

  • What triggers troubleshooting, and who monitors notifications (and how much of this process can be automated)? 

  • When is it necessary to retrain a model altogether? And when would a more topical debug in Python serve as a workable hotfix?

Some organizations answer these questions on their own; others seek to bring in outside help.

2. Establish baselines for normal behavior

Ultimately, one can’t be sure that a production model is drifting if one hasn’t first defined what “normal” looks like.

What is the expected degree of model accuracy? What is acceptable latency and throughput? Rigorously benchmarking the model quality starting point will come in handy when live production data is coming in. Without such baselines and benchmarks, troubleshooting might not be possible, precisely because “trouble” hasn’t been clearly defined.

AI Academy

Become an AI expert

Gain the knowledge to prioritize AI investments that drive business growth. Get started with our free AI Academy today and lead the future of AI in your organization.

3. Monitor input data directly

The model behavior is drifting. But is it necessarily the model’s fault? Regressions that look like bad algorithms might ultimately trace back to the source: data.

So-called “data drift” happens when production data no longer resembles the dataset used as training data. Perhaps outliers that occur in the real world simply weren’t present during training.

(While “data drift” is a change in the distribution of the data itself, the related “concept drift” refers to a change in the relationship between input data and the target variable. “Prediction drift” refers to changes in model outputs. IBM has a separate, thorough guide to drift detection.)

Ultimately, the best model monitoring strategies employ techniques to monitor data distribution as a crucial component of model performance. One common method is the Kolmogorov-Smirnov test, which helps detect whether the distribution of real-time production data differs significantly from the model training distribution.

Most managed ML platforms, like Sagemaker or Evidently AI (an open-source option), have tools to probe data in these and other ways. IBM’s wastonx.governance solution also has tools to detect drift in production data.

4. Measure real-world outcomes 

There is no better way to evaluate a model than to check its predictions against the ground truth. Any robust monitoring system will build in a practice of checking, perhaps manually, whether a model’s predictions indeed align with what is known to be true.

The problem is that ground truth is sometimes difficult or impossible to come by. A credit decision can take time to validate. A macro-economic prediction can take months to confirm or prove wrong. Under such uncertainty, a mature monitoring workflow uses proxy metrics that correlate to ground truth.

What a proxy metric looks like is bespoke to each model or use case. Metrics can range from changes in feature values to unusual subsets of traffic. In mature machine learning monitoring operations, data science teams spend time theorizing on data types that can serve as proxies.

5. Monitor the surrounding production ecosystem

No model is an island. Models interface with pipelines, APIs, dashboards and any number of human workflows. Just as doctors often need to order multiple tests to perform a differential diagnosis, AI practitioners need to keep an open mind about what part of a model’s ecosystem might be diminishing functionality. 

For this reason, mature machine learning monitoring often eyes more than performance metrics alone. Ideally, it should track latencythroughputingest reliability and overall data quality issues.

Data scientists will inevitably have leading roles in ML monitoring teams, but by no means should theirs be the only voices. Practitioners often will need to use a holistic operational lens, drawing from fields like MLOps. As a first step, it might be worthwhile to collectively map the ecosystem around an AI model, creating a visualization of all key dependencies.

Author

David Zax

Staff Writer

IBM Think

Related solutions
IBM watsonx.ai™

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

  1. Explore watsonx.ai
  2. Book a live demo