Overview (LINEAR_RIDGE extension command)

LINEAR_RIDGE is an extension command that uses the Python sklearn.linear_model.Ridge class to estimate linear ridge regression models. In addition to fitting a model with a specified value of the alpha regularization parameter, LINEAR_RIDGE can display a ridge trace plot of coefficient values for a range of alpha values, or facilitate choice of the hyperparameter value through k-fold crossvalidation on specified grids of values. If a single model is fitted or alpha selection through crossvalidation is performed, the final model can be applied to held-out data created by a partition of the input data to obtain a valid estimate of out-of-sample performance of the model.

Options

Mode

In Fit mode, a single model is fitted to the training data, using a default or user-specified value for alpha. In Trace mode, three plots for the training data are displayed: a ridge trace plot of regression coefficients, a plot of R², and a plot of mean squared error (MSE), all plotted vs. alpha, varying alpha over the specified values. In Crossvalidation mode, a grid search with crossvalidation to evaluate models is performed, and the best alpha is chosen based on validation subsets of the training data.

In either Fit or Crossvalidation modes, the single final model can be applied to a held-out partition of the data not used in earlier steps to obtain a valid estimate of out-of-sample performance of the model.

Alpha

Alpha specifies one or more values of the regularization parameter. In Fit mode, a single value is used. In Trace and Crossvalidation modes, a grid of values is specified as unique values and/or a range of values determined by minimum, maximum, and increment values. The Metric for ranges of values can be either linear or base-10 logarithmic. Plots for Trace or Crossvalidation modes are shown using the specified metric for the horizontal axis.

Criteria

You can specify whether to include an intercept in the model. By default an intercept is included. Note that the dependent variable is not standardized by the extension, and the intercept is not penalized during estimation. The intercept can be penalized by including a constant variable as a predictor and suppressing the intercept, but this is not recommended.

You can specify whether to standardize predictor values. The default is to standardize, and this is recommended. Standardization is applied to each indicator or dummy variable created to represent a level of a categorical factor variable, as well as each scale covariate.

You can specify the maximum in minutes to allow for the analyses to complete. For crossvalidation mode, you can specify the number of splits or folds in the crossvalidation, and set the sklearn random_state parameter in order to allow replication of results.

Partition

Input data can be partitioned into a training set and a holdout or test set. Model fitting or alpha selection and model fitting is performed using the training data, and the final model is then applied to the holdout or test data to estimate out-of-sample performance.

Print

If crossvalidation is performed, you can specify whether to display information about only the best model, basic information about all models compared, or complete information on all splits or folds for all models.

Plot

You can specify plotting of observed values vs. predictions and/or residuals vs. predictions, based on the selected or specified model. With crossvalidation, you can also plot the average mean squared error (MSE) and/or mean R² over splits or folds for validation data as a function of the alpha parameter.

Save

You can specify saving of predictions and/or residuals for the specified or selected model.

Basic specification

The basic specification is a numeric dependent variable and at least one categorical factor or numeric predictor/covariate.

Syntax rules

The dependent variable and covariate(s) must be numeric.
Factors may be string or numeric.
Subcommands may be specified in any order. Only one instance of each subcommand is allowed.
Other than numeric values on the ALPHA subcommand, all keyword and value specifications allow only a single instance or value.

Operations

This extension accepts split variables from SPLIT FILE and weights using the WEIGHT command. If PARTITION is used in conjunction with SPLIT FILE, partitions are formed within each split.
If no MODE subcommand is specified, a single model is fitted. If no ALPHA subcommand is specified, a default value of 1 for the ALPHA regularization parameter is used.
If MODE is FIT via explicit specification or implicitly by lack of a MODE subcommand, and ALPHA is specified, only a single value of ALPHA is allowed.
If MODE is TRACE, plots of regression coefficients, mean squared error (MSE) and R² for the training data vs. specified values of alpha are provided. PARTITION is honored, but no results for held-out test data are provided, since no final model results from this mode.
If MODE is CROSSVALID, alpha selection is performed using crossvalidation, based on the best average R² over the validation folds. The NFOLDS keyword can be used to change the default value of five splits or folds for crossvalidation. Plots of mean R² and/or MSE over validation folds can be requested on the PLOT subcommand.
If MODE is CROSSVALID, the PRINT subcommand can be used to show basic information about only the model with the chosen value of alpha (BEST), basic information about all models compared (COMPARE), or complete information on all splits or folds for all models (VERBOSE). BEST is the default.
Plots of observed and/or residuals vs. predicted values may be specified on the PLOT subcommand in FIT and CROSSVALID modes.
Predicted values and residuals may be saved via the SAVE subcommand in FIT and CROSSVALID modes.
If MODE is FIT or CROSSVALID and PARTITION is in effect, the single or final model fitted is applied to the held-out test data to estimate out-of-sample performance.