Published: 8 October 2024
Contributors: Amanda McGrath, Alexandra Jonker
AI interpretability helps people better understand and explain the decision-making processes that power artificial intelligence (AI) models.
AI models use a complex web of data inputs, algorithms, logic, data science and other processes to return insights. The more complex the model, the more difficult it can be for humans to understand the steps that led to its insights—even if those humans are the ones who designed and built it. An interpretable model is one whose decisions can be easily understood by users.
The use of AI is expanding. Systems that use large language models (LLMs) are becoming routine parts of daily life, from smart home devices to credit card fraud detection to the broad use of ChatGPT and other generative AI tools. As highly complex models (including deep-learning algorithms and neural networks) become more common, AI interpretability becomes more important.
Additionally, AI systems and machine-learning algorithms are increasingly prevalent in healthcare, finance and other industries that involve critical or life-altering decisions. With such high stakes, the public needs to be able to trust the outcomes are fair and reliable. That trust depends on understanding how AI systems arrive at their predictions and make their decisions.
White-box AI models have inputs and logic that are easy to see and understand. For example, basic decision trees, which show a clear flow between each step, are not difficult for the average person to decipher. White-box models tend to use more linear decision-making systems that are easy to interpret, but can result in less accuracy or fewer compelling insights or applications.
Black-box AI models are more complicated and offer less transparency into their inner workings. The user generally doesn’t know how the model reaches its results. These more complex models tend to be more accurate and precise. But because they are difficult or impossible to understand, they come with concerns about their reliability, fairness, biases and other ethical issues. Making black-box models more interpretable is one way to build trust in their use.
AI interpretability focuses on understanding the inner workings of an AI model while AI explainability aims to provide reasons for the model's outputs.
Interpretability is about transparency, allowing users to comprehend the model's architecture, the features it uses and how it combines them to deliver predictions. An interpretable model’s decision-making processes are easily understood by humans. Greater interpretability requires greater disclosure of its internal operations.
Explainability is about verification, or providing justifications for the model's outputs, often after it makes its predictions. Explainable AI (XAI) is used to identify the factors that led to the results. Various explainability methods can be used to present the models in ways that make their complex processes and underlying data science clear to a human being using natural language.
AI interpretability helps to debug models, detect biases, ensure compliance with regulations and build trust with users. It allows developers and users to see how their models affect people and businesses and to develop them responsibly.
Interpretability is important for several reasons:
Without interpretability, users are left in the dark. This lack of accountability can erode public trust in the technology. When stakeholders fully understand how a model makes its decisions, they are more likely to accept its outputs. Model interpretability allows for transparency and clarity, which makes users feel comfortable relying on it in real-world applications such as medical diagnoses or financial decisions.
Biases within training data can be amplified by AI models. The resulting discriminatory outcomes perpetuate societal inequalities but also expose organizations to legal and reputational risks. Interpretable AI systems can help detect if a model is making biased decisions based on protected characteristics, such as race, age or gender. Interpretability allows model developers to identify and mitigate discriminatory patterns, helping ensure fairer outcomes.
Interpretable machine learning allows the creators of ML algorithms and ML models to identify and fix errors. No machine learning model is 100% accurate from the start. Without understanding the AI's reasoning, debugging is an inefficient and risky process. By understanding how the ML model works, developers and data scientists can pinpoint the sources of incorrect predictions and optimize the model's performance. This process, in turn, increases its overall reliability and aids optimization.
Some regulations, such as the Equal Credit Opportunity Act (ECOA) in the United States or the General Data Protection Regulation (GDPR) in the European Union, require that decisions made by automated systems be transparent and explainable. And a growing number of AI-specific regulations, including the European Union’s EU AI Act, are setting standards for AI development and use. Interpretable AI models can provide clear explanations for their decisions, helping to meet these regulatory requirements. Interpretability can also help with auditing issues, liability and data privacy protections.
Without interpretability, developers and researchers might struggle to translate AI insights into actionable results or advance the technology with changes. Interpretability makes it easier to transfer knowledge about a model’s underpinnings and decisions among stakeholders and to use its knowledge to inform other model development.
Stanford University researcher Nigam Shah identifies three main types of interpretability: engineers’ interpretability, causal interpretability and trust-inducing interpretability.1
This type focuses on how the AI model reached its output. It involves understanding the model's internal workings and is relevant to developers and researchers who need to debug or improve the model.
This type focuses on why the model produced its output. It involves identifying the factors that have the greatest influence on the model's predictions and how changes in these factors affect the outcomes.
This type focuses on providing explanations that build trust in the model's outputs. It involves presenting the model's decision-making process in a way that is understandable and relatable to users, even if they do not have technical expertise.
Several characteristics influence the interpretability of AI models:
Intrinsic interpretability refers to models that are inherently interpretable, such as decision trees and linear regression models. Their simple structures are easy to understand. However, post-hoc interpretability involves applying interpretation methods to pre-trained models to explain their behavior. Post-hoc interpretation is best for more complex or black-box models.
Local interpretability focuses on explaining individual predictions and helps show why the model reached a particular result. Global interpretability aims to understand the model's behavior across the entire dataset, showing its overall patterns and trends.
Model-specific interpretability methods use a model's internal structure to provide explanations. Model-agnostic methods work with any type of model.
Various methods can establish interpretability in AI models.
Some models are simple enough for intrinsic interpretation. These inherently interpretable models rely on straightforward structures such as decision trees, rule-based systems and linear regressions. Humans can easily understand the decision-making patterns and processes of linear models.
More complex models require post-hoc interpretation, in which interpretation methods are applied to pre-trained models to explain the model’s output. Some common post-hoc interpretation methods include:
LIME helps to explain a model's predictions by focusing on a single prediction at a time. It does this by creating a simpler, interpretable model that mimics the behavior of the complex model for that specific prediction. It uses feature attribution to determine the influence of a particular characteristic (such as shape, color or another data point) on the model’s output. For example, it takes a specific prediction and then generates many similar instances by slightly tweaking or adjusting the feature values. From there, it creates a simpler, more interpretable model based on these “perturbed” feature values and their results. In short, LIME provides a simplified, local explanation for how the complex model would behave.
SHAP is a cooperative game theory-style approach to interpretability that considers all possible combinations of features and how they affect the prediction. It assigns a value (called a Shapley value) to each feature based on how much it contributes to the prediction in different scenarios. SHAP can work with any machine learning system. It offers both local explanations for individual predictions that are delivered by machine learning algorithms and global explanations for the model as a whole. However, because of its computational complexity SHAP can be a slower and more expensive method.
PDPs show how a feature affects the model's predictions, on average, across the dataset. They help visualize the relationship between a feature and the model's output, holding all other features constant. This method is useful for interpreting a small number of features or when stakeholders want to focus on a specific subset of features.
ICE plots show how much a predicted outcome depends on a specific feature. They are similar to PDPs but show the relationship between a feature and the model's output for individual instances, rather than averaging across the dataset. They can complement PDPs by providing a more detailed view of the model’s behavior—for example, by highlighting variability and showing interactions between features at an instance level. And they are useful when computer science researchers or stakeholders want to identify outliers or unusual patterns in the model’s operations..
AI interpretability is important in any industry that uses AI models to make decisions that impact individuals or society. Some industries in which AI interpretability is relevant include:
Medical professional use artificial intelligence for diagnosis, treatment recommendations and research. Interpretability can help doctors and patients trust and understand an AI model’s decisions and identify bias or errors in its reasoning.
Finance professionals can use AI to detect fraud, quantify risk, assign credit scores and make recommendations for investments. Interpretability is essential to regulatory compliance and auditing in the finance and banking industry. And understanding a model’s decision-making process for activities such as loan approvals and risk management can help prevent biased results.
The criminal justice sector can use AI to analyze crime scenes, DNA and forensic evidence and local or national crime patterns. Users might also turn to AI to offer sentencing recommendations and perform other routine judicial operations. Interpretability is critical to ensuring fairness, accuracy and accountability.
Some human resources departments use AI for resume screening and candidate evaluation. Interpretability is one way to prevent discrimination in the initial hiring process.
The insurance industry uses artificial intelligence for assessing risk, processing insurance claims and setting pricing. Interpretability can help customers understand their premiums and insurers justify their decisions.
As more marketing, sales and customer service functions rely on AI-powered chatbots, interpretability can offer important safeguards. Understanding why a chatbot makes a recommendation or decision builds trust in the AI system and help improve or personalize its offerings.
Interpretability comes with some challenges and limitations.
Often there is a trade-off between model performance and interpretability. Simpler or white-box models are more interpretable but might have lower accuracy compared to complex black-box models such as deep neural networks.
Interpretability also suffers from a lack of standardization. Different methods can provide different explanations for the same model, making it difficult to compare and validate them without formal frameworks. And interpretability is often subjective. What may be considered easy to understand for one user might not be enough for another.
Some experts say that interpretability is not necessary in some cases, or can be counterproductive in others. If the model is private or has no significant impact, or the problem is already the subject of much accepted study, greater interpretability could be redundant or unnecessary. In some cases, greater interpretability might present safety concerns, as more transparency could allow bad actors to exploit a system or allow users to game the system in a way that undermines its efficacy.
Direct and monitor the artificial intelligence (AI) activities of your organization with tools for governance, compliance and risk management.
We work with clients to create a comprehensive, transparent AI strategy that weaves responsible governance into the fabric of your business.
Scale artificial intelligence to more parts of your business with greater confidence and stronger results.
AI enables computers and machines to simulate human learning, comprehension, problem solving, decision making, creativity and autonomy.
No technology in history has generated so much interest in so short a time. Here's how to step outside the hype cycle to see its benefits (and limitations.)
Discover how natural language processing can help you to converse more naturally with computers.
AI has enormous value but capturing the full benefits of AI means facing and handling its potential pitfalls.
XAI is a set of processes that allows human users to comprehend and trust the results and output created by machine learning algorithms.
How can we monitor the technology to make sure it’s used ethically? Here's what the latest research says.
All links reside outside ibm.com.
1 Miller, Katharine. Should AI models be explainable? That depends. Stanford Institute for Human-Centered Artificial Intelligence. March 2021.