The AI Model Lifecycle isn’t completed when a model is deployed to production; you need to continually monitor and manage the model in production.
This includes tracking several performance and business KPIs (key performance indicators). Continuous monitoring and management of deployed artificial intelligence (AI) models are critical for business leaders to trust the predictions. Analysts have reported that lack of trust in the AI models has been one of the main reasons inhibiting AI adoption in enterprises.
Several AI models have delivered excellent results but operate as black-box models where it’s not possible to understand the reasoning behind their predictions. Additionally, some AI models suffer from bias against one or more features or a class of customers because the data used to train that model didn’t have a good representative sample. The lack of explainability and the potential concerns with fairness (or bias) expose an enterprise to significant risk — both to its reputation and finances — if it adopts such AI models in its business processes and customer interactions.
In addition to the concerns of fairness and explainability, AI models suffer from “model drift,” which means that the performance of the model starts degrading over time because the production (or runtime) data no longer resembles the original training data used to train the model. Model drift can effectively render the model useless, which triggers an urgent need to retrain and update the model to maintain the value it delivers.
By monitoring deployed models in production, an AIOps team can detect model degradation or data drift efficiently and trigger model retraining actions to react accordingly. There are two common approaches for model retraining:
- Direct data retraining: In this scenario, data scientists execute the tasks of retraining the model by leveraging recent data that better represents production data.
- Full lifecycle retraining: In this scenario, the model retraining executes across all stages of the lifecycle, requiring independent validation before the new model gets deployed to production.
Depending on the industry and use case, one approach may be better suited than the other for model retraining. For example, a marketing model for predicting what assets or offers to promote may be retrained using the direct data retraining approach, while a loan approval model will likely be required to go through the full lifecycle retraining approach.
It is worth noting that in addition to model degradation and data drift, other triggers can initiate model retraining tasks, such as implementing improved algorithms, applying the AI model to new business use cases, or adhering to new regulations specific to the industry and/or the business use case.
IBM Watson OpenScale and the IBM Cloud Pak® for Data
AIOperations teams and business users monitor AI models deployed in production with IBM Watson OpenScale, an integrated offering in the IBM Cloud Pak for Data. Watson OpenScale provides the final piece of the puzzle to help organizations get AI projects out of development and into production.
IBM Watson OpenScale includes a powerful operations console that makes it easier for business users to track and measure AI outcomes. This allows business users to correlate outcomes to their organization’s KPIs and improve models to account for changing business situations. These analytics capabilities can also be easily integrated with many common business reporting tools to provide insights to a wider audience. The solution augments the AI environment with instrumentation, payload logging, and monitoring services that provide deep insights, end-to-end auditability, and fine-grained control. IBM Watson OpenScale consists of multiple configurable monitors, including Quality (or Accuracy), Fairness, Explainability, and Drift.
Quality (or accuracy)
The quality monitor (or accuracy monitor) reports how well the AI model is predicting outcomes, and it does this by comparing the model predictions to ground truth data (labeled data).
OpenScale provides several different quality measurements (Figure 1) suitable for different types of AI models, such as Area under ROC curve (AUC), precision, recall, and F1-Measure for binary classification models; weighted true positive rate, weighted recall, weighted precision, weighted F1-Measure for multiclass classification models; and mean absolute error (MAE), mean squared error (MSE), R squared and root of mean squared error (RMSE) for regression models.
When Watson OpenScale detects problems with quality — such as accuracy threshold violations — a new version of the model must be trained that fixes the problem.
AI models in production need to make fair decisions and can’t be biased in their recommendations or else they introduce the risk of exposing the organization to potential legal, financial, and reputational damage. Using fairness monitors, OpenScale is configured to identify “favourable” or “unfavourable” outcomes in “reference” and “monitored” populations. Typically, the reference group represents the majority group and the monitored group represents the minority group (or the group AI models could exhibit bias against).
A score is calculated based on the probability of favourable outcomes for the monitored (or minority) group vs. probability of favourable outcomes for the reference (or majority) group. The Watson OpenScale algorithm computes bias on an hourly basis, using the last N records present in the payload logging table; the value of N is specified when configuring the Fairness monitor.
The algorithm perturbs these last N records to generate additional data. The perturbation is done by changing the value of the fairness attribute from reference to monitored (or vice-versa) and sending the perturbed data to the model to evaluate its response. The algorithm looks at the last N records in the payload table and the response of the model on the perturbed data to decide if the model could exhibit bias towards the monitored group. A model is deemed to be biased if — across the combined dataset (original and perturbed records) — the percentage of favourable outcomes for the monitored” group is less than the percentage of favourable outcomes for the reference group, by some threshold value. This threshold value is specified when configuring the Fairness monitor.
AIOps teams and business users can easily detect a potential bias in a deployed AI model by reviewing OpenScale’s fairness dashboard, as shown in Figure 2. If bias is detected, business leaders and model builders can act swiftly to update the model and mitigate bias in production deployments. OpenScale also offers a debiased model endpoint that is trained to improve the fairness of the AI model. Organizations either update their models to reduce bias or they can choose to embed the debiased model from OpenScale directly in their production applications.
Business users embedding AI models in their applications leverage Watson OpenScale’s explainability feature to better understand which factors contributed to an AI outcome for a specific transaction. It is critical for an organization to be able to deliver an explanation for a decision in order to meet regulatory demands and customer expectations around transparency. For example, if a customer is denied a loan and that decision is partly due to an AI model prediction, the business needs to deliver a clear explanation of the decision to the customer.
Watson OpenScale supports multiple algorithms for explaining transactions, including open source, such as LIME (Local Interpretable Model-Agnostic Explanations), and IBM technology, such as Contrastive Explanation. OpenScale does so by applying thousands of perturbations around the specific data points associated with a transaction to identify which features most significantly impacted the prediction of the model.
Local Interpretable Model-Agnostic Explanations (LIME) is an open source Python library that Watson OpenScale uses to analyse the input and output values of a model to create human-understandable interpretations of the model. Both LIME and contrastive explanation are valuable tools for making sense of a model, but they offer different perspectives. Contrastive explanations reveal how much values need to change to either change the prediction or still have the same prediction. The factors that need the maximum change are considered more important in this type of explanation. In other words, the features with highest importance in contrastive explanations are those where the model is least sensitive. On the other hand, LIME reveals which features are most important for a specific data point. The 5,000 perturbations that are typically done for analysis are very close to the data point and, in an ideal setting, the features with high importance in LIME are those which are most important for that specific data point.
Figure 3 shows one example of OpenScale’s explainability results for one specific transaction, with the contrastive explanation annotated with the red rectangle and the LIME explanation (shown at the bottom of the dashboard) highlighting the features that contributed to the model’s prediction.
Over time, the importance and impact of certain features in a model change. This affects the associated applications and resulting business outcomes. AIOps teams and business users monitor AI models and detect drift by reviewing Watson OpenScale’s drift dashboard (Figure 4), which captures potential drop in accuracy and/or drop in data consistency:
- Drop in accuracy: Estimates the drop in accuracy of the model at runtime. Model accuracy drops if there is an increase in transactions that are similar to those that the model did not evaluate correctly in the training data. This type of drift is calculated for structured binary and multi-class classification models only.
- Drop in data consistency: Estimates the drop in consistency of the data at runtime as compared to the characteristics of the data at training time.
As data changes, the ability of the AI model to make accurate predictions may deteriorate. Drift magnitude is the extent of the degradation of predictive performance over time. When the AIOperations team or business users identify model drift from OpenScale’s drift dashboard, they will request model builders to take corrective action and update or train a new model.
Drift is the degradation of predictive performance over time because of hidden context. As your data changes over time, the ability of your model to make accurate predictions may deteriorate. Watson OpenScale analyzes all transactions to find the ones that contribute to accuracy drift. It then groups the transactions into clusters based on the similarity of each feature's contribution to the drift in accuracy. In each cluster, Watson OpenScale also estimates the important features that played a major role in the drift in accuracy and classifies their feature impact as large, some, and small.
Watson OpenScale analyzes each transaction to estimate if the model prediction is accurate. If the model prediction is inaccurate, the transaction is marked as drifted. The estimated accuracy is then calculated as the fraction of non-drifted transactions to the total number of transactions analyzed. The base accuracy is the accuracy of the model on the test data. Watson OpenScale calculates the extent of the drift in accuracy as the difference between base accuracy and estimated accuracy.
Application KPI performance
In addition to monitoring AI models, Watson OpenScale also includes capabilities for monitoring business processes and applications. This is achieved by correlating AI metrics and measures with business application key performance indicators (KPIs).
Each business event (or transaction) is a result of the business process that can include multiple scorings of different AI models. Having business events in the Watson OpenScale system means that business data can be sliced by time or by clustering and linked to the corresponding AI scoring payloads. Watson OpenScale then measures KPIs on the business payload and AI metrics on the scoring payload and correlates those metrics together.
The correlation results between business KPIs and AI model metrics are especially beneficial in certain scenarios, such as the following:
- Use the correlation to decide where to invest and which problem — such as drift, fairness, or quality — leads to the highest loss from a business perspective.
- In cases where there is no correlation between KPIs and AI metrics, this can be cause for further analysis and might lead to questions of how AI is used and whether there is some gap in the process.
Model risk management
Management of model risk is critical to meet regulatory requirements and to protect institutions from operational and reputational risk. Model risk is a type of risk when a mathematical model is used to predict and measure quantitative information and the model performs inadequately, leading to adverse outcomes and significant operational losses for the institution.
There are several factors that can contribute to model risk, including data issues, incorrect model design, coding and technical errors, inherent uncertainty, and several others. Existing model risk management practices are not optimized for AI models based on machine learning and deep learning techniques, which require a different approach for testing and validation.
Watson OpenScale offers a model risk management solution by monitoring AI models for critical metrics, such as drift, fairness, and explainability, as described earlier. Furthermore, integrating Watson OpenScale with OpenPages, IBM’s GRC (Governance, Risk, and Compliance) offering, delivers an end-to-end model governance solution.
Get started with IBM Watson OpenScale
While IBM Watson OpenScale integrates seamlessly with IBM tools for building and running AI models, such as IBM Watson Studio and IBM Watson Machine Learning, it has been designed as an open platform that will easily operate with model deployments from other vendors, such as AWS SageMaker, AzureML, and more.
Regardless of the existing investments in model design, training, and evaluation tools, IBM Watson OpenScale offers value by closing the gaps between the data science team, IT team, and business process owners. Above all, it provides a unique set of monitoring and management tools that help build trust and implement control and governance structures around AI investments.
Check out the whole blog series
Other blog entries detail the following phases in AI Model Lifecycle Management:
- AI Model Lifecycle Management: Collect Phase
- AI Model Lifecycle Management: Organize Phase
- AI Model Lifecycle Management: Build Phase
- AI Model Lifecycle Management: Deploy Phase
- AI Model Lifecycle Management: Monitor Phase (Technical Perspective)
- AI Model Lifecycle Management: Monitor Phase (Customer Perspective)
This post will be updated with links as they become available. For a deeper dive into the subject, see our white paper.