Anomaly detection, or outlier detection, is the identification of observations, events or data points that deviate from what is usual, standard or expected, making them inconsistent with the rest of a data set.
Anomaly detection has a long history in the field of statistics, where analysts and scientists would study charts looking for any elements that appeared abnormal. Today, anomaly detection leverages artificial intelligence (AI) and machine learning (ML) to automatically identify unexpected changes in a data set’s normal behavior.
Anomalous data can signal critical incidents happening under the hood, such as an infrastructure failure, a breaking change from an upstream source or security threats. Anomalies can also highlight opportunities for architectural optimization or improving marketing strategies.
Anomaly detection has a range of use cases across various industries. For example, it is used in finance for fraud detection, in manufacturing to identify defects or equipment malfunctions, in cybersecurity to detect unusual network activity and in healthcare to identify abnormal patient conditions.
Outlier detection can be challenging because anomalies are often rare, and the characteristics of normal behavior can be complex and dynamic. From a business perspective, identifying actual anomalies rather than false positives or data noise is essential.
Data anomalies can have a significant impact in the field of data science, leading to incorrect or misleading conclusions. For example, a single outlier can significantly skew the mean of a data set, making it an inaccurate representation of the data. Additionally, data anomalies can impact the performance of machine learning algorithms, as they can cause the model to fit the noise rather than the underlying pattern in the data.
Identifying and handling data anomalies is crucial for several reasons:
Improved data quality: Identifying and handling data anomalies can significantly improve data quality, which is essential for accurate and reliable data analysis. By addressing data anomalies, analysts can reduce noise and errors in the data set, ensuring that the data is more representative of the true underlying patterns.
Enhanced decision making: Data-driven decision making relies on accurate and reliable data analysis to inform decisions. By identifying and handling data anomalies, analysts can ensure that their findings are more trustworthy, leading to better-informed decisions and improved outcomes.
Optimized machine learning performance: Data anomalies can significantly impact the performance of machine learning algorithms, as they can cause the model to fit the noise rather than the underlying pattern in the data. By identifying and handling data anomalies, analysts can optimize the performance of their machine learning models, ensuring that they provide accurate and reliable predictions.
An anomaly detection system can uncover two general types of anomalies: unintentional and intentional.
Unintentional anomalies are data points that deviate from the norm due to errors or noise in the data collection process. These errors can be either systematic or random, originating from issues like faulty sensors or human error during data entry. Unintentional anomalies can distort the data set, making it challenging to derive accurate insights.
Intentional anomalies are data points that deviate from the norm due to specific actions or events. These anomalies can provide valuable insights into the data set, as they may highlight unique occurrences or trends. For example, a sudden spike in sales during a holiday season could be considered an intentional anomaly, as it deviates from the typical sales pattern but is expected due to a real-world event.
In business data, three main time-series data anomalies exist: point anomalies, contextual anomalies and collective anomalies.
Point anomalies, also known as global outliers, are individual data points that exist far outside the rest of the data set. They can be either intentional or unintentional and may result from errors, noise or unique occurrences. An example of a point anomaly is a bank account withdrawal that is significantly larger than any of the user’s previous withdrawals.
Contextual anomalies are data points that deviate from the norm within a specific context. These anomalies are not necessarily outliers when considered in isolation but become anomalous when viewed within their specific context.
For example, consider home energy usage. If there is a sudden increase in energy consumption at midday when no family members are typically home, the anomaly would be contextual. This data point might not be an outlier when compared to energy usage in the morning or evening (when people are usually home), but it is anomalous in the context of the time of day it occurs.
Collective anomalies involve a set of data instances that together deviate from the norm, even though individual instances may appear normal. An example of this type of anomaly would be a network traffic data set that shows a sudden surge in traffic from multiple IP addresses at the same time.
Using an anomaly detection system to detect data anomalies is a critical aspect of data analysis, ensuring that the findings are accurate and reliable. Various anomaly detection methods can be used in building an anomaly detection system.
Visualization is a powerful tool for detecting data anomalies, as it allows data scientists to quickly identify potential outliers and patterns in the data. By plotting the data using charts and graphs, analysts can visually inspect the data set for any unusual data points or trends.
Statistical tests can be used by data scientists to detect data anomalies by comparing the observed data with the expected distribution or pattern.
For example, the Grubbs test can be used to identify outliers in a data set by comparing each data point to the mean and standard deviation of the data. Similarly, the Kolmogorov-Smirnov test can be used to determine whether a data set follows a specific distribution, such as a normal distribution.
Machine learning algorithms can be used to detect data anomalies by learning the underlying pattern in the data and then identifying any deviations from that pattern. Some of the most common ML anomaly detection algorithms include:
An anomaly detection algorithm can learn to identify patterns and detect anomalous data using various machine learning training techniques. The amount of labeled data, if any, in a data team’s training data set determines which of the main anomaly detection techniques they will use—unsupervised, supervised or semi-supervised.
With unsupervised anomaly detection techniques, data engineers train a model by providing it with unlabeled data sets that it uses to discover patterns or abnormalities on its own. Although these techniques are by far the most commonly used due to their wider and relevant application, they require massive data sets and computing power. Unsupervised machine learning is most often found in deep learning scenarios, which rely on artificial neural networks.
Supervised anomaly detection techniques use an algorithm that is trained on a labeled data set that includes both normal and anomalous instances. Due to the general unavailability of labeled training data and the inherent unbalanced nature of the classes, these anomaly detection techniques are rarely used.
Semi-supervised techniques maximize the positive attributes of both unsupervised anomaly detection and supervised anomaly detection. By providing an algorithm with some portion of labeled data, it can be partially trained. Data engineers then use the partially trained algorithm to label a larger data set autonomously, referred to as “pseudo-labeling.” Assuming they prove reliable, these newly labeled data points are combined with the original data set to fine tune the algorithm.
Finding the right combination of supervised and unsupervised machine learning is vital to machine learning automation. Ideally, the vast majority of data classifications would be done without human interaction in an unsupervised manner. That said, data engineers should still be able to feed algorithms with training data that will help create business-as-usual baselines. A semi-supervised approach allows for scaling anomaly detection with the flexibility to make manual rules regarding specific anomalies.
Anomaly detection models are used extensively in the banking, insurance and stock trading industries to identify fraudulent activities in real time, such as unauthorized transactions, money laundering, credit card fraud, bogus tax return claims and abnormal trading patterns.
Intrusion detection dystems (IDSs) and other cybersecurity technologies use anomaly detection to help identify unusual or suspicious user activities or network traffic patterns, indicating potential security threats or attacks like malware infections or unauthorized access.
Anomaly detection algorithms are often employed together with computer vision to identify defects in products or packaging by analyzing high-res camera footage, sensor data and production metrics.
Anomaly detection can be used to monitor the performance of IT systems, and to keep operations running smoothly by identifying unusual patterns in server logs and reconstructing faults from patterns and past experiences to predict potential issues or failures.
By identifying irregularities in data from Internet of Things (IoT) sensors and operation technology (OT) devices, anomaly detection can help predict equipment failures or maintenance needs in industries like aviation, energy and transportation. When used to monitor energy consumption patterns and identify anomalies in usage, anomaly detection can lead to more efficient energy management and early detection of equipment failures.
Merchants use anomaly detection models to identify unusual patterns in customer behavior, which can help with fraud detection, predicting customer churn and improving marketing strategies. In e-commerce, anomaly detection is applied to identify fake reviews, account takeovers, abnormal purchasing behavior and other indicators of fraud or cybercrime.
IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
Use this ebook to align with other leaders on the 3 key goals of MLOps and trustworthy AI: trust in data, trust in models and trust in processes.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Learn how to select the most suitable AI foundation model for your use case.
Learn how to incorporate generative AI, machine learning and foundation models into your business operations for improved performance.
Learn why having a complete freedom in choice of programming languages, tools and frameworks improves creative thinking and evolvement.
Use data science tools and solutions to uncover patterns and build predictions by using data, algorithms, machine learning and AI techniques.
Introducing Cognos Analytics 12.0, AI-powered insights for better decision-making.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.