Feature Selection Options
The Options tab allows you to specify the default settings for selecting or excluding input fields in the model nugget. You can then add the model to a stream to select a subset of fields for use in subsequent model-building efforts. Alternatively, you can override these settings by selecting or deselecting additional fields in the model browser after generating the model. However, the default settings make it possible to apply the model nugget without further changes, which may be particularly useful for scripting purposes.
See the topic Feature Selection Model Results for more information.
The following options are available:
All fields ranked. Selects fields based on their ranking as important, marginal, or unimportant. You can edit the label for each ranking as well as the cutoff values used to assign records to one rank or another.
Top number of fields. Selects the top n fields based on importance.
Importance greater than. Selects all fields with importance greater than the specified value.
The target field is always preserved regardless of the selection.
Importance Ranking Options
All categorical. When all inputs and the target are categorical, importance can be ranked based on any of four measures:
- Pearson chi-square. Tests for independence of the target and the input without indicating the strength or direction of any existing relationship.
- Likelihood-ratio chi-square. Similar to Pearson's chi-square but also tests for target-input independence.
- Cramer's V. A measure of association based on Pearson's chi-square statistic. Values range from 0, which indicates no association, to 1, which indicates perfect association.
- Lambda. A measure of association reflecting the proportional reduction in error when the variable is used to predict the target value. A value of 1 indicates that the input field perfectly predicts the target, while a value of 0 means the input provides no useful information about the target.
Some categorical. When some—but not all—inputs are categorical and the target is also categorical, importance can be ranked based on either the Pearson or likelihood-ratio chi-square. (Cramer's V and lambda are not available unless all inputs are categorical.)
Categorical versus continuous. When ranking a categorical input against a continuous target or vice versa (one or the other is categorical but not both), the F statistic is used.
Both continuous. When ranking a continuous input against a continuous target, the t statistic based on the correlation coefficient is used.