Preparing inputs and targets

Because no data is ever in a perfect state for processing, you may want to adjust some of the settings before running an analysis. For example, this might include the removal of outliers, specifying how to handle missing values, or adjusting the type.

Note: If you change the values on this panel, the Objectives tab is automatically updated to select the Custom analysis option.

Prepare the input and target fields for modeling. Toggles all fields on the panel either on or off.

Adjust Type and Improve Data Quality. For inputs and the target you can specify several data transformations separately; this is because you may not want to change the values of the target. For example, a prediction of income in dollars is more meaningful than a prediction measured in log(dollars). In addition, if the target has missing values there is no predictive gain to filling missing values, whereas filling missing values in inputs may enable some algorithms to process information that would otherwise be lost.

Additional settings for these transformations, such as the outlier cutoff value, are common to both the target and inputs.

You can select the following settings for either, or both, inputs and target:

  • Adjust the type of numeric fields. Select this to determine if numeric fields with a measurement level of Ordinal can be converted to Continuous, or vice versa. You can specify the minimum and maximum threshold values to control the conversion.
  • Reorder nominal fields. Select this to sort nominal (set) fields into order, from smallest to largest category.
  • Replace outlier values in continuous fields. Specify whether to replace outliers; use this in conjunction with the Method for replacing outliers options below.
  • Continuous fields: replace missing values with mean. Select this to replace missing values of continuous (range) features.
  • Nominal fields: replace missing values with mode. Select this to replace missing values of nominal (set) features.
  • Ordinal fields: replace missing values with median. Select this to replace missing values of ordinal (ordered set) features.

Maximum number of values for ordinal fields. Specify the threshold for redefining ordinal (ordered set) fields as continuous (range). The default is 10; therefore, if an ordinal field has more than 10 categories it is redefined as continuous (range).

Minimum number of values for continuous fields. Specify the threshold for redefining scale or continuous (range) fields as ordinal (ordered set). The default is 5; therefore, if a continuous field has fewer than 5 values it is redefined as ordinal (ordered set).

Outlier cutoff value. Specify the outlier cutoff criterion, measured in standard deviations; the default is 3.

Method for replacing outliers. Select whether outliers are to be replaced by either trimming (coerce) with the cutoff value, or to delete them and set them as missing values. Any outliers set to missing values follow the missing value handling settings selected above.

Put all continuous input fields on a common scale. To normalize continuous input fields, select this check box and choose the normalization method. The default is z-score transformation, where you can specify the Final mean, which has a default of 0, and the Final standard deviation, which has a default of 1. Alternatively, you can choose to use Min/max transformation and specify the minimum and maximum values, which have default values of 0 and 100 respectively.

This field is especially useful when you select Perform feature construction on the Construct & Select Features panel.

Rescale a continuous target with a Box-Cox transformation. To normalize a continuous (scale or range) target field, select this check box. The Box-Cox transformation has default values of 0 for the Final mean and 1 for the Final standard deviation.

Note: If you choose to normalize the target, the dimension of the target will be transformed. In this case you may need to generate a Derive node to apply an inverse transformation in order to turn the transformed units back into a recognizable format for further processing. See the topic Generating a Derive node for more information.