Overview (REGRESSION command)

REGRESSION calculates multiple regression equations and associated statistics and plots. REGRESSION also calculates collinearity diagnostics, predicted values, residuals, measures of fit and influence, and several statistics based on these measures.

Options

Input and Output Control Subcommands. DESCRIPTIVES requests descriptive statistics on the variables in the analysis. SELECT estimates the model based on a subset of cases. REGWGT specifies a weight variable for estimating weighted least-squares models. MISSING specifies the treatment of cases with missing values. MATRIX reads and writes matrix data files.

Equation-Control Subcommands. These optional subcommands control the calculation and display of statistics for each equation. STATISTICS controls the statistics displayed for the equation(s) and the independent variable(s), CRITERIA specifies the criteria used by the variable selection method, and ORIGIN specifies whether regression is through the origin.

Analysis of Residuals, Fit, and Influence. REGRESSION creates temporary variables containing predicted values, residuals, measures of fit and influence, and several statistics based on these measures. These temporary variables can be analyzed within REGRESSION in Casewise Diagnostics tables (CASEWISE subcommand), scatterplots (SCATTERPLOT subcommand), histograms and normal probability plots (RESIDUALS subcommand), and partial regression plots (PARTIALPLOT subcommand). Any of the residuals subcommands can be specified to obtain descriptive statistics for the predicted values, residuals, and their standardized versions. Any of the temporary variables can be added to the active dataset with the SAVE subcommand.

Templates. You can specify a template, using the TEMPLATE subcommand, to override the default chart attribute settings on your system.

Basic Specification

The basic specification is DEPENDENT, which initiates the equation(s) and defines at least one dependent variable, followed by METHOD, which specifies the method for selecting independent variables.

By default, all variables named on DEPENDENT and METHOD are used in the analysis.
The default display for each equation includes a Model Summary table showing R ², an ANOVA table, a Coefficients table displaying related statistics for variables in the equation, and an Excluded Variables table displaying related statistics for variables not yet in the equation.
By default, all cases in the active dataset with valid values for all selected variables are used to compute the correlation matrix on which the regression equations are based. The default equations include a constant (intercept).
All residuals analysis subcommands are optional. Most have defaults that can be requested by including the subcommand without any further specifications. These defaults are described in the section for each subcommand.

Subcommand Order

The standard subcommand order for REGRESSION is

REGRESSION MATRIX=...
   /VARIABLES=...
   /DESCRIPTIVES=...
   /SELECT=...
   /MISSING=...
   /REGWGT=...

             --Equation Block--
   /STATISTICS=...
   /CRITERIA=...
   /ORIGIN
   /DEPENDENT=...

             --Method Block(s)--
   /METHOD=...
   [/METHOD=...]

             --Residuals Block--
   /RESIDUALS=...
   /SAVE=...
   /CASEWISE=...
   /SCATTERPLOT=...
   /PARTIALPLOT=...
   /OUTFILE=...

When used, MATRIX must be specified first.
Subcommands listed before the equation block must be specified before any subcommands within the block.
Only one equation block is allowed per REGRESSION command.
An equation block can contain multiple METHOD subcommands. These methods are applied, one after the other, to the estimation of the equation for that block.
The STATISTICS, CRITERIA, and ORIGIN/NOORIGIN subcommands must precede the DEPENDENT subcommand.
The residuals subcommands RESIDUALS, CASEWISE, SCATTERPLOT, and PARTIALPLOT follow the last METHOD subcommand of any equation for which residuals analysis is requested. Statistics are based on this final equation.
Residuals subcommands can be specified in any order. All residuals subcommands must follow the DEPENDENT and METHOD subcommands.

Syntax Rules

VARIABLES can be specified only once. If omitted, VARIABLES defaults to COLLECT.
The DEPENDENT subcommand can be specified only once and must be followed immediately by one or more METHOD subcommands.
CRITERIA, STATISTICS, and ORIGIN must be specified before DEPENDENT and METHOD. If any of these subcommands are specified more than once, only the last specified is in effect for all subsequent equations.
More than one variable can be specified on the DEPENDENT subcommand. An equation is estimated for each.
If no variables are specified on METHOD, all variables named on VARIABLES but not on DEPENDENT are considered for selection.

Operations

This procedure uses the multithreaded options specified by SET THREADS and SET MCACHE.

Operations

REGRESSION calculates a correlation matrix that includes all variables named on VARIABLES. All equations requested on the REGRESSION command are calculated from the same correlation matrix.
The MISSING, DESCRIPTIVES, and SELECT subcommands control the calculation of the correlation matrix and associated displays.
If multiple METHOD subcommands are specified, they operate in sequence on the equations defined by the preceding DEPENDENT subcommand.
Only independent variables that pass the tolerance criterion are candidates for entry into the equation. See the topic CRITERIA Subcommand (REGRESSION command) for more information.
The temporary variables PRED (unstandardized predicted value), ZPRED (standardized predicted value), RESID (unstandardized residual), and ZRESID (standardized residual) are calculated and descriptive statistics are displayed whenever any residuals subcommand is specified. If any of the other temporary variables are referred to on the command, they are also calculated.
Predicted values and statistics based on predicted values are calculated for every observation that has valid values for all variables in the equation. Residuals and statistics based on residuals are calculated for all observations that have a valid predicted value and a valid value for the dependent variable. The missing-values option therefore affects the calculation of residuals and predicted values.
No residuals or predictors are generated for cases deleted from the active dataset with SELECT IF, a temporary SELECT IF, or SAMPLE.
All variables are standardized before plotting. If the unstandardized version of a variable is requested, the standardized version is plotted.
Residuals processing is not available when the active dataset is a matrix file or is replaced by a matrix file with MATRIX OUT(*) on REGRESSION. If RESIDUALS, CASEWISE, SCATTERPLOT, PARTIALPLOT, or SAVE are used when MATRIX IN(*) or MATRIX OUT(*) is specified, the REGRESSION command is not executed.

For each analysis, REGRESSION can calculate the following types of temporary variables:

PRED. Unstandardized predicted values.

RESID. Unstandardized residuals.

DRESID. Deleted residuals.

ADJPRED. Adjusted predicted values.

ZPRED. Standardized predicted values.

ZRESID. Standardized residuals.

SRESID. Studentized residuals.

SDRESID. Studentized deleted residuals. ¹

SEPRED. Standard errors of the predicted values.

MAHAL. Mahalanobis distances.

COOK. Cook’s distances. ²

LEVER. Centered leverage values. ³

DFBETA. Change in the regression coefficient that results from the deletion of the ith case. A DFBETA value is computed for each case for each regression coefficient generated by a model. ⁴

SDBETA. Standardized DFBETA. An SDBETA value is computed for each case for each regression coefficient generated by a model. ⁵

DFFIT. Change in the predicted value when the ith case is deleted. ⁶

SDFIT. Standardized DFFIT. ⁷

COVRATIO. Ratio of the determinant of the covariance matrix with the ith case deleted to the determinant of the covariance matrix with all cases included. ⁸

MCIN. Lower and upper bounds for the prediction interval of the mean predicted response. A lowerbound LMCIN and an upperbound UMCIN are generated. The default confidence interval is 95%. The confidence interval can be reset with the CIN subcommand. ⁹

ICIN. Lower and upper bounds for the prediction interval for a single observation. A lowerbound LICIN and an upperbound UICIN are generated. The default confidence interval is 95%. The confidence interval can be reset with the CIN subcommand. ¹⁰

¹ Hoaglin, D. C., and R. E. Welsch. 1978. The hat matrix in regression and ANOVA. American Statistician, 32, 17-22.

² Cook, R. D. 1977. Detection of influential observations in linear regression. Technometrics, 19, 15-18.

³ Velleman, P. F., and R. E. Welsch. 1981. Efficient computing of regression diagnostics. American Statistician, 35, 234-242.

⁴ Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics: Identifying influential data and sources of collinearity. New York: John Wiley and Sons.

⁵ Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics: Identifying influential data and sources of collinearity. New York: John Wiley and Sons.

⁶ Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics: Identifying influential data and sources of collinearity. New York: John Wiley and Sons.

⁷ Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics: Identifying influential data and sources of collinearity. New York: John Wiley and Sons.

⁸ Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics: Identifying influential data and sources of collinearity. New York: John Wiley and Sons.

⁹ Dillon, W. R., and M. Goldstein. 1984. Multivariate analysis: Methods and applications. New York: John Wiley and Sons.

¹⁰ Dillon, W. R., and M. Goldstein. 1984. Multivariate analysis: Methods and applications. New York: John Wiley and Sons.