Overview (LINEAR_ELASTIC_NET extension command)

LINEAR_ELASTIC_NET is an extension command that uses the Python sklearn.linear_model.ElasticNet class to estimate penalized linear regression models involving a mixture of L1 norm (Lasso) and L2 norm (Ridge) penalties. In addition to fitting a model with specified values of the ratio of L1 penalty and alpha regularization parameter, LINEAR_ELASTIC_NET can display a trace plot of coefficient values for a range of alpha values for a given ratio, or facilitate choice of the hyperparameters value via k-fold crossvalidation on specified grids of values. If a single model is fitted or ratio and/or alpha selection via crossvalidation is performed, the final model can be applied to held-out data created by a partition of the input data to obtain a valid estimate of out-of-sample performance of the model.

Options

Mode
In Fit mode, a single model is fitted to the training data, using default or user-specified values for the L1 ratio and alpha. In Trace mode, three plots for the training data are displayed: a trace plot of regression coefficients, a plot of R2, and a plot of mean squared error (MSE), all plotted vs. alpha for the specified ratio, varying alpha over the specified values. In Crossvalidation mode, a grid search with crossvalidation to evaluate models is performed, and the best ratio and alpha values are chosen based on validation subsets of the training data.

In either Fit or Crossvalidation modes, the single final model can be applied to a held-out partition of the data not used in earlier steps to obtain a valid estimate of out-of-sample performance of the model.

Ratio
Ratio specifies one or more values of the L1 penalty ratio parameter, which may range from .01 to 1. In Fit and Trace modes, a single value is used. In Crossvalidation modes, a grid of values is specified as unique values and/or a range of values determined by minimum, maximum, and increment values.
Alpha
Alpha specifies one or more values of the regularization parameter. In Fit mode, a single value is used. In Trace and Crossvalidation modes, a grid of values is specified as unique values and/or a range of values determined by minimum, maximum, and increment values. The Metric for ranges of values can be either linear or base-10 logarithmic. Plots for Trace or Crossvalidation modes are shown using the specified metric for the horizontal axis.
Criteria
You can specify whether to include an intercept in the model. By default an intercept is included. Note that the dependent variable is not standardized by the extension, and the intercept is not penalized during estimation. The intercept can be penalized by including a constant variable as a predictor and suppressing the intercept, but this is not recommended.

You can specify whether to standardize predictor values. The default is to standardize, and this is recommended. Standardization is applied to each indicator or dummy variable created to represent a level of a categorical factor variable, as well as each scale covariate.

You can specify the maximum in minutes to allow for the analyses to complete. For crossvalidation mode, you can specify the number of splits or folds in the crossvalidation, and set the sklearn random_state parameter in order to allow replication of results.

Partition
Input data can be partitioned into a training set and a holdout or test set. Model fitting or alpha selection and model fitting is performed using the training data, and the final model is then applied to the holdout or test data to estimate out-of-sample performance.
Print
If crossvalidation is performed, you can specify whether to display information about only the best model, basic information about all models compared, or complete information on all splits or folds for all models.
Plot
You can specify plotting of observed values vs. predictions and/or residuals vs. predictions, based on the selected or specified model. With crossvalidation, you can also plot the average mean squared error (MSE) and/or mean R2 over splits or folds for validation data as a function of the alpha parameter for either a single specified ratio value or the selected best ratio value.
Save
You can specify saving of predictions and/or residuals for the specified or selected model.

Basic specification

The basic specification is a numeric dependent variable and at least one categorical factor or numeric predictor/covariate.

Syntax rules

  • The dependent variable and covariate(s) must be numeric.
  • Factors may be string or numeric.
  • Subcommands may be specified in any order. Only one instance of each subcommand is allowed.
  • Other than numeric values on the RATIO and ALPHA subcommands, all keyword and value specifications allow only a single instance or value.

Operations

  • This extension accepts split variables from SPLIT FILE and weights using the WEIGHT command. If PARTITION is used in conjunction with SPLIT FILE, partitions are formed within each split.
  • If no MODE subcommand is specified, a single model is fitted. If no RATIO subcommand is specified, a default value of .5 for the L1/L2 penalty ratio is used. If no ALPHA subcommand is specified, a default value of 1 for the ALPHA regularization parameter is used.
  • If MODE is FIT via explicit specification or implicitly by lack of a MODE subcommand, and ALPHA is specified, only single values of RATIO and ALPHA are allowed.
  • If MODE is TRACE, plots of regression coefficients, mean squared error (MSE) and R2 for the training data vs. specified values of alpha for the specified ratio are provided. PARTITION is honored, but no results for held-out test data are provided, since no final model results from this mode.
  • If MODE is CROSSVALID, ratio and/or alpha selection is performed using crossvalidation, based on the best average R2 over the validation folds. The NFOLDS keyword can be used to change the default value of five splits or folds for crossvalidation. Plots of mean R2 and/or MSE over validation folds can be requested on the PLOT subcommand. These plots show R2 or MSE vs. alpha values for a single specified or the identified best value of the ratio.
  • If MODE is CROSSVALID, the PRINT subcommand can be used to show basic information about only the model with the chosen values of ratio and alpha (BEST), basic information about all models compared (COMPARE), or complete information on all splits or folds for all models (VERBOSE). BEST is the default.
  • Plots of observed and/or residuals vs. predicted values may be specified on the PLOT subcommand in FIT and CROSSVALID modes.
  • Predicted values and residuals may be saved via the SAVE subcommand in FIT and CROSSVALID modes.
  • If MODE is FIT or CROSSVALID and PARTITION is in effect, the single or final model fitted is applied to the held-out test data to estimate out-of-sample performance.