Overview (SELECTPRED command)

The SELECTPRED procedure submits a large number of predictor variables and selects a smaller subset for use in predictive modeling procedures. This procedure uses a univariable method that considers each predictor in isolation, as opposed to the multivariable method of selecting procedures that is used by the NAIVEBAYES procedure. The SELECTPRED procedure supports both categorical and scale dependent variables and accepts very large sets of predictors. SELECTPRED is useful when rapid computing is required.

Options

Methods. The SELECTPRED procedure includes a screening step as well as the univariable predictor selection method.

Missing values. Cases with missing values for the dependent variable or for all predictors are excluded. The SELECTPRED procedure has an option for treating user-missing values of categorical variables as valid. User-missing values of scale variables are always treated as invalid.

Output. SELECTPRED displays pivot table output by default but offers an option for suppressing most such output. The procedure optionally displays lists of categorical and scale predictors—not a model—by way of global macro variables. These global macro variables can be used to represent the subset of selected predictors in subsequent procedures.

Basic specification

The basic specification is the SELECTPRED command followed by a dependent variable.

By default, SELECTPRED determines the measurement level of the dependent variable based on its dictionary setting. All other variables in the dataset — except the weight variable if it is defined — are treated as predictors, with the dictionary setting of each predictor determining its measurement level. SELECTPRED performs the screening step and then selects a subset of predictors by using a univariable method. User-missing values are excluded, and default pivot table output is displayed.

Syntax rules

All subcommands are optional.
Only a single instance of each subcommand is allowed.
An error occurs if a keyword is specified more than once within a subcommand.
Parentheses, equals signs, and slashes that are shown in the syntax chart are required.
The command name, subcommand names, and keywords must be spelled in full.
Empty subcommands are not honored.

Operations

The SELECTPRED procedure begins by excluding the following types of cases and predictors.

Cases with a missing value for the dependent variable.
Cases with missing values for all predictors.
Predictors with missing values for all cases.
Predictors with the same value for all cases.

A further screening step can be used to exclude the following types of predictors:

Predictors with a large percentage of missing values.
Categorical predictors with a large percentage of cases representing a single category.
Categorical predictors with a large percentage of categories containing one case.
Scale predictors with a small coefficient of variation (standard deviation divided by mean).

The univariable predictor selection method also ranks each predictor based on its association with the dependent variable and selects the top subset of predictors to use in subsequent modeling procedures. The ranking criterion that is used for any given predictor depends on the measurement level of the dependent variable as well as the measurement level of the predictor.

Note: Measurement level can affect the results. If any variables (fields) have an unknown measurement level, a data pass is performed to determine the measurement level before the analysis begins. For information on the determination criteria, see SET SCALEMIN.

Categorical dependent variable.

For categorical predictors, the ranking criterion can be the Pearson chi-square p-value, likelihood ratio chi-square p-value, Cramér's V, or Lambda.
For scale predictors, the F p-value from a one-way ANOVA is always used as the criterion.
For mixed predictors, the categorical predictors use the chi-square p-value or the likelihood ratio chi-square p-value, and the scale predictors use the F p-value from a one-way ANOVA.

Scale dependent variable.

For categorical predictors, the F p-value from a one-way ANOVA is always used as the criterion.
For scale predictors, the p-value for a Pearson correlation coefficient is always used.
For mixed predictors, both types of p-values are used, depending on the predictor type.

Frequency weight

If a WEIGHT variable is specified, its values are used as frequency weights by the SELECTPRED procedure.

Cases with missing weights or weights that are less than 0.5 are not used in the analyses.
The weight values are rounded to the nearest whole numbers before use. For example, 0.5 is rounded to 1, and 2.4 is rounded to 2.

Limitations

SPLIT FILE settings are ignored by the SELECTPRED procedure.