Outlier analysis
Use Outlier analysis to identify and correct data points in an exploration that deviate significantly from the norm, potentially indicating errors or anomalies. The analysis also includes AI-generated insights for any outliers that were detected and corrected in the view, helping you to investigate anomalies and improve the quality of your data.
To perform an outlier analysis, click an exploration in a book and select the rows that you want to analyze. Then, click AI operations in the toolbar and select Outlier analysis.
You can select up to 25 rows to run Outlier analysis but the feature considers only the first 120 columns in the exploration for the analysis.
By default, the resulting analysis displays the data in the form of a Scatter chart, making it easy to see any outliers. The algorithm selected to conduct the outlier analysis is evaluated by using the Silhouette score and the best ranked algorithm is then used to run the analysis. However, you can select a different algorithm and regenerate the analysis. You can also switch the chart type from a Scatter chart to a Line chart.
An AI-generated summary of the outlier analysis automatically generates when you run the feature. If you have Outlier correction enabled, the summary includes insights for the outliers that were corrected.
Algorithms used in outlier analysis
- Tukey's method
- Tukey's method identifies outliers by calculating the interquartile range (IQR) and defining lower and upper bounds. Data points outside these bounds are considered outliers.
- Z-Score method
- The Z-Score method is a statistical technique that standardizes data points to determine how many standard deviations they are from the mean. A Z-score greater than a specified threshold (commonly 3) indicates an outlier.
- One-Class Support Vector Machine (SVM)
- One-Class SVM is a supervised machine learning algorithm that learns a decision boundary around the majority class (normal instances of the data points). Data points outside this boundary are classified as outliers.
- Isolation Forest
- Isolation Forest builds an ensemble of decision trees to isolate data points. Outliers are detected as points that require fewer splits to isolate than the majority of data points.
- Local Outlier Factor (LOF)
- LOF algorithm measures the local density deviation of a given data point with respect to its neighbors. Data points with a lower density than their neighbors are classified as outliers.
- DBSCAN
- DBSCAN is a density-based clustering algorithm that groups data points based on their density. Data points in low-density regions that do not belong to any cluster are considered outliers.
- Ensemble Voting method
- The Ensemble Voting method combines the results of the various outlier detection algorithms to improve the robustness of the detection process. By using a voting mechanism, this method selects the most frequently identified outliers across all methods, enhancing overall accuracy. This approach is effective when different algorithms yield varying results, as it uses the strengths of each method.
The effectiveness of each outlier detection method is evaluated with Silhouette score. This score measures how similar an object is to its own cluster compared to other clusters, providing a way to assess the quality of clustering and separation of outliers. A higher Silhouette score indicates better-defined clusters.
For datasets with fewer than 50 data points, only Tukey's and Z-Score methods are applied. More complex algorithms, such as One-Class SVM, Isolation Forest, LOF, and DBSCAN, require a larger dataset to provide reliable results and accurately capture data distributions.
AI-generated summary
The Outlier analysis feature uses granite-3-8b-instruct from the Granite family of IBM foundation models to generate the summary.
- Sourced from quality data sets in domains such as finance (SEC Filings), law (Free Law), technology (Stack Exchange), science (arXiv, DeepMind Mathematics), literature (Project Gutenberg (PG-19)), and more.
- Compliant with rigorous IBM data clearance and governance standards.
- Scrubbed of hate, abuse, and profanity, data duplication, and blocklisted URLs, among other things.
Tips on working with AI
While working with Outlier analysis, here are few things that you can do that might lead the AI model to generate a better analysis:
- Ensure that the values in the exploration data are numeric.
- Remove spacers from rows or columns.
- Remove duplicate values.