Machine learning model monitoring, or “ML model monitoring,” is a set of tools, practices and mechanisms to track whether deployed ML models still exhibit acceptable model performance. Increasingly, many firms view ML model monitoring as an integral part of the ML lifecycle.
The real world is different from a test environment. For one thing, testing tends to be stable while the world is always in flux. A machine learning model that tested well in its validation phase might perform less well in a real production environment as time goes on. Things change.
The problem underlying such “model drift” might come from any angle. Input data might change, or data quality might dip. Data sources can break, or pipelines can slow down.
In other words: A complex set of model inputs determines model outputs. To track it all, organizations develop machine learning monitoring practices.
How do mature organizations go about monitoring machine learning—and governing AI more broadly? Best practices are still emerging in this nascent field, but five principles can serve to guide organizations looking to develop their own programs.
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
With many monitoring tools, there is a risk of performing a kind of “observability theater.” Perhaps 99 out of 100 metrics on the dashboard look good, lulling an organization into a false sense of security. But an early strategic planning meeting might have identified that single outlying metric as the one that mattered most.
Not all metrics are created equal, and some use cases matter more, often with wildly different costs for bad model predictions.
A useful monitoring system has a set of organization-specific decisions and thresholds built around the numbers on the dashboard.
Some organizations answer these questions on their own; others seek to bring in outside help.
Ultimately, one can’t be sure that a production model is drifting if one hasn’t first defined what “normal” looks like.
What is the expected degree of model accuracy? What is acceptable latency and throughput? Rigorously benchmarking the model quality starting point will come in handy when live production data is coming in. Without such baselines and benchmarks, troubleshooting might not be possible, precisely because “trouble” hasn’t been clearly defined.
The model behavior is drifting. But is it necessarily the model’s fault? Regressions that look like bad algorithms might ultimately trace back to the source: data.
So-called “data drift” happens when production data no longer resembles the dataset used as training data. Perhaps outliers that occur in the real world simply weren’t present during training.
(While “data drift” is a change in the distribution of the data itself, the related “concept drift” refers to a change in the relationship between input data and the target variable. “Prediction drift” refers to changes in model outputs. IBM has a separate, thorough guide to drift detection.)
Ultimately, the best model monitoring strategies employ techniques to monitor data distribution as a crucial component of model performance. One common method is the Kolmogorov-Smirnov test, which helps detect whether the distribution of real-time production data differs significantly from the model training distribution.
Most managed ML platforms, like Sagemaker or Evidently AI (an open-source option), have tools to probe data in these and other ways. IBM’s wastonx.governance solution also has tools to detect drift in production data.
There is no better way to evaluate a model than to check its predictions against the ground truth. Any robust monitoring system will build in a practice of checking, perhaps manually, whether a model’s predictions indeed align with what is known to be true.
The problem is that ground truth is sometimes difficult or impossible to come by. A credit decision can take time to validate. A macro-economic prediction can take months to confirm or prove wrong. Under such uncertainty, a mature monitoring workflow uses proxy metrics that correlate to ground truth.
What a proxy metric looks like is bespoke to each model or use case. Metrics can range from changes in feature values to unusual subsets of traffic. In mature machine learning monitoring operations, data science teams spend time theorizing on data types that can serve as proxies.
No model is an island. Models interface with pipelines, APIs, dashboards and any number of human workflows. Just as doctors often need to order multiple tests to perform a differential diagnosis, AI practitioners need to keep an open mind about what part of a model’s ecosystem might be diminishing functionality.
For this reason, mature machine learning monitoring often eyes more than performance metrics alone. Ideally, it should track latency, throughput, ingest reliability and overall data quality issues.
Data scientists will inevitably have leading roles in ML monitoring teams, but by no means should theirs be the only voices. Practitioners often will need to use a holistic operational lens, drawing from fields like MLOps. As a first step, it might be worthwhile to collectively map the ecosystem around an AI model, creating a visualization of all key dependencies.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.