Overview (KRR command)

KRR is an extension command that uses the Python sklearn.kernel_ridge.KernelRidge class to estimate kernel ridge regression models. Kernel ridge regression models are nonparametric regression models that are capable of modeling linear and nonlinear relationships between predictor variables and outcomes. Results can be highly sensitive to choices of model hyperparameters. KRR facilitates choice of hyperparameter values through k-fold cross-validation on specified grids of values using the sklearn.model_selection.GridSearchCV class.

Options

Kernel
Any or all of the eight different kernel options can be specified. The chosen kernel determines which hyperparameters are active. Hyperparameters include alpha for ridge regularization, common to all kernels, plus as many as three other hyperparameters for a specific kernel.
When multiple kernel subcommands are specified, and/or more than one value for any parameter is specified, a grid search with cross-validation to evaluate models is performed, and the best fitting model that is based on held out data is selected.
Cross-validation folds
The number of splits or folds in the cross-validation can be specified when a grid search is performed.
Criteria
The maximum number of minutes to allow for the analyses to complete.
Print
When a grid search is performed, information about only the best model, basic information about all compared models, or complete information on all splits or folds for all models can be displayed.
Plot
The plotting of observed values versus predictions and residuals versus predictions can be specified.
Save
Predictions, residuals, and dual coefficients (for a single specified model) can be saved.

Basic specification

The basic specification is a numeric dependent variable and at least one numeric predictor or covariate.

Syntax rules

  • The dependent variable and covariates must be numeric.
  • Subcommands can be specified in any order.

Operations

  • The extension accepts split variables from SPLIT FILE and weights using the WEIGHT command.
  • When the KERNEL subcommand is not specified, or specified without keywords, a linear kernel with a default value of 1 for the ALPHA regularization parameter is the result.
  • When a single KERNEL subcommand with a single value for each relevant parameter is specified, a single model that uses the specified kernel and parameter settings is fitted.
  • When a single model is specified, weights are fully applied throughout the analysis and the evaluation and scoring of results.
  • When a single KERNEL subcommand with multiple values for at least one relevant parameter is specified, or when multiple KERNEL subcommands are specified, model selection is performed using a grid search over the combinations of specified kernels and parameter values. Model quality is evaluated based on cross-validation results.
  • When weights are included, they are used to create fitted values in all analyses. Cross-validation evaluations that are used for model selection are not weighted due to current limitations in the score method in the sklearn.model_selection.GridSearchCV class.
  • When model selection is specified, the CROSSVALID subcommand's NFOLDS keyword can be used to change the default value of five splits or folds for cross-validation.
  • When model selection is specified, the PRINT subcommand can be used to show basic information about only the best model, basic information about all compared models, or complete information on all splits or folds for all models.
  • Plots of observed and residuals versus predicted values can be specified on the PLOT subcommand.
  • Predicted values and residuals can be saved through the SAVE subcommand. When only a single model is fitted, dual coefficients can also be saved.
  • The ADDITIVE_CHI2 and CHI2 kernels require all independent variable values to be non-negative. Negative input values will result in errors.
  • While negative values are generally allowed for parameter settings, if the GAMMA parameter in the CHI2, LAPLACIAN, or RBF kernels is negative, products of GAMMA and distance calculations might be too large and overflow the exponential function. This will result in errors.