BUILD_OPTIONS Subcommand (LINEAR command)
The BUILD_OPTIONS
subcommand
specifies the criteria used to build the model.
OBJECTIVE. The objective determines which of the following model types is built.
- STANDARD. Create a standard model. The method builds a single model to predict the target using the predictors. Generally speaking, standard models are easier to interpret and can be faster to score than boosted, bagged, or large dataset ensembles.
- BOOSTING. Enhance model accuracy (boosting). The method builds an ensemble model using boosting, which generates a sequence of models to obtain more accurate predictions. Ensembles can take longer to build and to score than a standard model.
- BAGGING. Enhance model stability (bagging). The method builds an ensemble model using bagging (bootstrap aggregating), which generates multiple models to obtain more reliable predictions. Ensembles can take longer to build and to score than a standard model.
- LARGE. Create a model for very large datasets. The method builds an ensemble model by splitting the dataset into separate data blocks. Choose this option if your dataset is too large to build any of the models above, or for incremental model building. This option can take less time to build, but can take longer to score than a standard model.
USE_AUTO_DATA_PREPARATION = TRUE
| FALSE. This option allows the procedure to transform
the target and predictors in order to maximize the predictive power
of the model. The original versions of transformed fields are excluded
from the model. By default, the following automatic data preparation
actions are performed. Note that this option is ignored and no automatic
data preparation is performed if OBJECTIVE=LARGE
.
- Date and Time handling. Each date predictor is transformed into new a continuous predictor containing the elapsed time since a reference date (1970-01-01). Each time predictor is transformed into a new continuous predictor containing the time elapsed since a reference time (00:00:00).
- Adjust measurement level. Continuous predictors with less than 5 distinct values are recast as ordinal predictors. Ordinal predictors with greater than 10 distinct values are recast as continuous predictors.
- Outlier handling. Values of continuous predictors that lie beyond a cutoff value (3 standard deviations from the mean) are set to the cutoff value.
- Missing value handling. Missing values of nominal predictors are replaced with the mode. Missing values of ordinal predictors are replaced with the median. Missing values of continuous predictors are replaced with the mean.
- Supervised merging. This makes a more parsimonious model by reducing the number of fields to be processed in association with the target. Similar categories are identified based upon the relationship between the input and the target. Categories that are not significantly different (that is, having a p-value greater than 0.1) are merged. If all categories are merged into one, the original and derived versions of the field are excluded from the model because they have no value as a predictor.
CONFIDENCE_LEVEL. This is the level of confidence used to compute interval estimates of the model coefficients in the Coefficients view. Specify a value greater than 0 and less than 100. The default is 95.
MODEL_SELECTION. Model selection methods determines how predictors are entered into the model.
- FORWARDSTEPWISE. This starts with no effects in the model and adds and removes effects one step at a time until no more can be added or removed according to the stepwise criteria. This is the default.
- BESTSUBSETS. This checks "all possible" models, or at least a larger subset of the possible models than forward stepwise, to choose the best according to the best subsets criterion. The model with the greatest value of the criterion is chosen as the best model. Note that Best subsets selection is more computationally intensive than forward stepwise selection. When best subsets is performed in conjunction with boosting, bagging, or very large datasets, it can take considerably longer to build than a standard model built using forward stepwise selection.
- NONE. Enters all available predictors into the model.
CRITERIA_FORWARD_STEPWISE. This is the statistic used to determine whether an effect should
be added to or removed from the model when forward stepwise selection
is used. If MODEL_SELECTION = FORWARDSTEPWISE
is not specified, this keyword is ignored.
- AICC. Information Criterion (AICC) is based on the likelihood of the data given the model, and is adjusted to penalize overly complex models.
- F. The F statistic criterion is based on a statistical test of the improvement in model error.
- ADJUSTEDRSQUARED. Adjusted R-squared is based on the fit of the data, and is adjusted to penalize overly complex models.
- ASE. The average squared error (ASE) is an overfit prevention criterion based on the fit of the overfit prevention set. The overfit prevention set is a random subsample of approximately 30% of the original dataset that is not used to train the model.
If any criterion other than the F statistic is chosen, then at each step the effect that corresponds to the optimal change in the criterion is added to the model (the greatest increase for Adjusted R-squared, decrease for AICC and ASE). Any effects in the model that correspond to a decrease (increase for AICC and ASE) in the criterion are removed.
If the F statistic is chosen
as the criterion, then at each step the effect that has the smallest
p-value less than the specified PROBABILITY_ENTRY
threshold is added to the model. Any effects in the model with a
p-value greater than the specified PROBABILITY_REMOVAL
threshold are removed.
PROBABILITY_ENTRY. When forward stepwise selection is used with F Statistics as the criterion, this is the threshold for entering effects into the model. Specify a number greater than 0 and less than 1. The default is 0.05.
- The value of
PROBABILITY_ENTRY
must be less than the value ofPROBABILITY_REMOVAL
.
PROBABILITY_REMOVAL When forward stepwise selection is used with F Statistics as the criterion, this is the threshold for removing effects from the model. Specify a number greater than 0 and less than 1. The default is 0.10.
MAX_EFFECTS = number. Customize maximum number of effects in the final model when forward selection is used. By default, all available effects can be entered into the model. Alternatively, if the stepwise algorithm ends a step with the specified maximum number of effects, the algorithm stops with the current set of effects.
MAX_STEPS = number. Customize the maximum number of steps when forward selection is used. The stepwise algorithm stops after a certain number of steps. By default, this is 3 times the number of available effects. Alternatively, specify a positive integer maximum number of steps.
CRITERIA_BEST_SUBSETS. This is the statistic used to choose the "best" model when best subsets
selection is used. If MODEL_SELECTION = FORWARDSTEPWISE
is not specified, this keyword is ignored.
- AICC. Information Criterion (AICC) is based on the likelihood of the data given the model, and is adjusted to penalize overly complex models.
- ADJUSTEDRSQUARED. Adjusted R-squared is based on the fit of the data, and is adjusted to penalize overly complex models.
- ASE. Overfit Prevention Criterion (ASE) is based on the fit of the overfit prevention set. The overfit prevention set is a random subsample of approximately 30% of the original dataset that is not used to train the model.
REPLICATE_RESULTS = TRUE**|FALSE. Setting a random seed allows you to replicate analyses. The random number generator is used to choose which records are in the overfit prevention set.
SEED = 54752075** | number. When REPLICATE_RESULTS=TRUE
, this is the value of the random seed. Specify an integer. The default
is 54752075.