Abstract 3D city grid with white buildings, intersecting roads, and glowing blue lines

A guide to monitoring machine learning

By David Zax

What is machine learning monitoring?

Machine learning model monitoring, or “ML model monitoring,” is a set of tools, practices and mechanisms to track whether deployed ML models still exhibit acceptable model performance. Increasingly, many firms view ML model monitoring as an integral part of the ML lifecycle.

The real world is different from a test environment. For one thing, testing tends to be stable while the world is always in flux. A machine learning model that tested well in its validation phase might perform less well in a real production environment as time goes on. Things change.

The problem underlying such “model drift” might come from any angle. Input data might change, or data quality might dip. Data sources can break, or pipelines can slow down.

In other words: A complex set of model inputs determines model outputs. To track it all, organizations develop machine learning monitoring practices.

How do mature organizations go about monitoring machine learning—and governing AI more broadly? Best practices are still emerging in this nascent field, but five principles can serve to guide organizations looking to develop their own programs.

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

1. Map metrics to actual business risk

With many monitoring tools, there is a risk of performing a kind of “observability theater.” Perhaps 99 out of 100 metrics on the dashboard look good, lulling an organization into a false sense of security. But an early strategic planning meeting might have identified that single outlying metric as the one that mattered most.

Not all metrics are created equal, and some use cases matter more, often with wildly different costs for bad model predictions. 

A useful monitoring system has a set of organization-specific decisions and thresholds built around the numbers on the dashboard.

Which performance metrics matter most?
What is the precise level at which a stakeholder needs to be alerted?
What triggers troubleshooting, and who monitors notifications (and how much of this process can be automated)?
When is it necessary to retrain a model altogether? And when would a more topical debug in Python serve as a workable hotfix?

Some organizations answer these questions on their own; others seek to bring in outside help.

2. Establish baselines for normal behavior

Ultimately, one can’t be sure that a production model is drifting if one hasn’t first defined what “normal” looks like.

What is the expected degree of model accuracy? What is acceptable latency and throughput? Rigorously benchmarking the model quality starting point will come in handy when live production data is coming in. Without such baselines and benchmarks, troubleshooting might not be possible, precisely because “trouble” hasn’t been clearly defined.

Think Keynotes

How enterprises excel in the AI era

Move beyond AI hype to measurable value. See how IBM is transforming into an AI-first enterprise and turning agentic AI into productivity, reinvestment and real business impact.

Build with watsonx Orchestrate®

3. Monitor input data directly

The model behavior is drifting. But is it necessarily the model’s fault? Regressions that look like bad algorithms might ultimately trace back to the source: data.

So-called “data drift” happens when production data no longer resembles the dataset used as training data. Perhaps outliers that occur in the real world simply weren’t present during training.

(While “data drift” is a change in the distribution of the data itself, the related “concept drift” refers to a change in the relationship between input data and the target variable. “Prediction drift” refers to changes in model outputs. IBM has a separate, thorough guide to drift detection.)

Ultimately, the best model monitoring strategies employ techniques to monitor data distribution as a crucial component of model performance. One common method is the Kolmogorov-Smirnov test, which helps detect whether the distribution of real-time production data differs significantly from the model training distribution.

Most managed ML platforms, like Sagemaker or Evidently AI (an open-source option), have tools to probe data in these and other ways. IBM’s wastonx.governance solution also has tools to detect drift in production data.

4. Measure real-world outcomes

There is no better way to evaluate a model than to check its predictions against the ground truth. Any robust monitoring system will build in a practice of checking, perhaps manually, whether a model’s predictions indeed align with what is known to be true.

The problem is that ground truth is sometimes difficult or impossible to come by. A credit decision can take time to validate. A macro-economic prediction can take months to confirm or prove wrong. Under such uncertainty, a mature monitoring workflow uses proxy metrics that correlate to ground truth.

What a proxy metric looks like is bespoke to each model or use case. Metrics can range from changes in feature values to unusual subsets of traffic. In mature machine learning monitoring operations, data science teams spend time theorizing on data types that can serve as proxies.

5. Monitor the surrounding production ecosystem

No model is an island. Models interface with pipelines, APIs, dashboards and any number of human workflows. Just as doctors often need to order multiple tests to perform a differential diagnosis, AI practitioners need to keep an open mind about what part of a model’s ecosystem might be diminishing functionality.

For this reason, mature machine learning monitoring often eyes more than performance metrics alone. Ideally, it should track latency, throughput, ingest reliability and overall data quality issues.

Data scientists will inevitably have leading roles in ML monitoring teams, but by no means should theirs be the only voices. Practitioners often will need to use a holistic operational lens, drawing from fields like MLOps. As a first step, it might be worthwhile to collectively map the ecosystem around an AI model, creating a visualization of all key dependencies.

Author

Staff Writer

IBM Think

Data science and MLOps for data leaders

Join forces with other leaders to drive the three essential pillars of MLOps and trustworthy AI: trust in data, trust in models and trust in processes.

Resources

MLOps explained

Techsplainers by IBM breaks down the essentials of MLOps, from key concepts to real‑world use cases. Clear, quick episodes help you learn the fundamentals fast.

Explore IBM Granite

IBM® Granite® is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

Unlock the power of generative AI and ML

Learn how to incorporate generative AI, machine learning and foundation models into your business operations for improved performance.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

What is machine learning?

Machine learning is a branch of AI and computer science that focuses on using data and algorithms to enable AI to imitate the way that humans learn.

How to thrive in this new era of AI with trust and confidence

Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.

Related solutions

IBM watsonx.ai™

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai

AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools

AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services

Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.