Outlier analysis

Use Outlier analysis to identify and correct data points in an exploration that deviate significantly from the norm, potentially indicating errors or anomalies. The analysis also includes AI-generated insights for any outliers that were detected and corrected in the view, helping you to investigate anomalies and improve the quality of your data.

The initial release of the Outlier analysis feature is available only in Planning Analytics Workspace Cloud and in English only.
Important: The Outlier analysis feature is part of the Planning Analytics AI assistant add-on, which requires a separate license purchase; it is not included in a standard Planning Analytics Workspace license. To purchase a license, contact your IBM® sales representative or go directly to your IBM account.

To perform an outlier analysis, click an exploration in a book and select the rows that you want to analyze. Then, click AI operations in the toolbar and select Outlier analysis.

You can select up to 25 rows to run Outlier analysis but the feature considers only the first 120 columns in the exploration for the analysis.

Outlier analysis option is available under AI operations in the toolbar

By default, the resulting analysis displays the data in the form of a Scatter chart, making it easy to see any outliers. The algorithm selected to conduct the outlier analysis is evaluated by using the Silhouette score and the best ranked algorithm is then used to run the analysis. However, you can select a different algorithm and regenerate the analysis. You can also switch the chart type from a Scatter chart to a Line chart.

To correct the outliers, enable Outlier correction. The visualization adjusts by using a “normalized” data point for the outliers.
Important: Outlier correction does not modify the actual data point, that was identified as an outlier, on your Planning Analytics database.

An AI-generated summary of the outlier analysis automatically generates when you run the feature. If you have Outlier correction enabled, the summary includes insights for the outliers that were corrected.

Algorithms used in outlier analysis

Tukey's method
Tukey's method identifies outliers by calculating the interquartile range (IQR) and defining lower and upper bounds. Data points outside these bounds are considered outliers.
Z-Score method
The Z-Score method is a statistical technique that standardizes data points to determine how many standard deviations they are from the mean. A Z-score greater than a specified threshold (commonly 3) indicates an outlier.
One-Class Support Vector Machine (SVM)
One-Class SVM is a supervised machine learning algorithm that learns a decision boundary around the majority class (normal instances of the data points). Data points outside this boundary are classified as outliers.
Isolation Forest
Isolation Forest builds an ensemble of decision trees to isolate data points. Outliers are detected as points that require fewer splits to isolate than the majority of data points.
Local Outlier Factor (LOF)
LOF algorithm measures the local density deviation of a given data point with respect to its neighbors. Data points with a lower density than their neighbors are classified as outliers.
DBSCAN
DBSCAN is a density-based clustering algorithm that groups data points based on their density. Data points in low-density regions that do not belong to any cluster are considered outliers.
Ensemble Voting method
The Ensemble Voting method combines the results of the various outlier detection algorithms to improve the robustness of the detection process. By using a voting mechanism, this method selects the most frequently identified outliers across all methods, enhancing overall accuracy. This approach is effective when different algorithms yield varying results, as it uses the strengths of each method.

The effectiveness of each outlier detection method is evaluated with Silhouette score. This score measures how similar an object is to its own cluster compared to other clusters, providing a way to assess the quality of clustering and separation of outliers. A higher Silhouette score indicates better-defined clusters.

For datasets with fewer than 50 data points, only Tukey's and Z-Score methods are applied. More complex algorithms, such as One-Class SVM, Isolation Forest, LOF, and DBSCAN, require a larger dataset to provide reliable results and accurately capture data distributions.

AI-generated summary

The Outlier analysis feature uses granite-3-8b-instruct from the Granite family of IBM foundation models to generate the summary.

The Granite models are decoder-only models that can efficiently predict and generate language. These models were built with trusted data that has the following characteristics:
  • Sourced from quality data sets in domains such as finance (SEC Filings), law (Free Law), technology (Stack Exchange), science (arXiv, DeepMind Mathematics), literature (Project Gutenberg (PG-19)), and more.
  • Compliant with rigorous IBM data clearance and governance standards.
  • Scrubbed of hate, abuse, and profanity, data duplication, and blocklisted URLs, among other things.
Note: IBM is committed to building AI that is open, trusted, targeted, and empowering. For more information about contractual protections that are related to IBM indemnification, see the IBM Client Relationship Agreement and IBM watsonx.ai service description.

Tips on working with AI

While working with Outlier analysis, here are few things that you can do that might lead the AI model to generate a better analysis:

  • Ensure that the values in the exploration data are numeric.
  • Remove spacers from rows or columns.
  • Remove duplicate values.