Data preparation for categorical fields

A field is treated as categorical whenever its usage property is set to attribute or identifier.

Overview

The main information that is extracted from categorical fields is observed frequency for each unique category value. Appropriate analytic methods are applied to categorical fields, but their accuracy and performance can be adversely affected when the number of different categories becomes large. The main data preparation step is to start merging categories when their number becomes large.

Algorithms

The basic algorithm that is used is merging categories. Categories are sorted by their frequency in descending order and the categories beyond default number are merged in a single category. Missing values are treated as a single separate category. In other words, IBM® Cognos Analytics uses missing values in a similar way as for the numeric fields. Categorical fields are treated as nominal. Intrinsic order is not assumed among categories.

Details

Certain field exclusion criteria apply to categorical fields. A categorical field is excluded from further analysis if it has only a single value or the number of unique, non-merged categories exceeds 50% of the number of valid data rows.

Otherwise, the categorical field is merged and the default number of non-merged categories is 49. The rest of the categories are merged into a single extra category. All categories with row count smaller than 3 also get merged. A categorical field is also excluded if the percentage of valid data rows corresponding to the merged category exceeds 25%.

Missing values are treated as a separate category and considered in the merging step as such.