Data preparation for target fields

Specification of the target field is required for key drivers and decision tree visualizations.

Overview

Always specify the target field and at least one extra field. Models are trained by using supplied target values and are used to detect predictive relationships and eventually to predict target values given the input field values. Data preparation for the target field differs from the data preparation for the rest of the fields. Missing values in the target are not used for building models, but the rest of the information is preserved and sometimes adjusted to obtain unbiased models.

Algorithms

The main data preparation step related to target fields is removal of all data rows with missing target value. This happens before any other data preparation steps. While it ensures that only reliable information is used for model building, the number of removed rows can be substantial. The resulting model might have a limited scope in such instances. Numeric target fields are not binned, but the extreme outliers are handled to not adversely affect the later created models. Categorical target fields are treated much like other categorical fields. The only difference is that missing values have been removed for the categorical targets.

Details

Extreme outliers are detected based on lower and upper boundaries. The upper boundary is constructed by using an upper percentile such that only 2.5% percent of target values are found to have a greater value. The difference between the upper percentile and the median is multiplied by 2.5 and added to the median to obtain the upper boundary. Similar steps are applied to obtain the lower boundary. The target values that are found beyond the computed boundaries are replaced by the corresponding boundary value in all subsequent analysis.