Multiple linear regression

Multiple linear regression model is a versatile statistical model for evaluating relationships between a continuous target and predictors.

Predictors can be continuous, categorical, or derived fields so that non-linear relationships are also supported. The model is linear because it consists of additive terms where each term is a predictor that is multiplied by an estimated coefficient. Constant (intercept) term is also typically added to the model.

Linear regression is used to generate insights for charts that contain at least two continuous fields with one identified as the target and the other as a predictor. In addition, a categorical predictor and two auxiliary continuous fields can be specified in a chart and used to generate appropriate regression model. For each candidate model, IBM® Cognos Analytics conducts an F test of model significance.

Model fitting and testing

Multiple linear model is fitted with the following steps:

Construct a design matrix that contain one row per each data row and one column per each parameter in regression model. Columns correspond to predictors or predictor categories.
Compute the regression coefficients.
1. Multiply the transposed design matrix with itself.
2. Multiply the transposed design matrix with the vector of target values.
3. Multiply the inverse of matrix from step a with the matrix from step b.

Using the obtained regression coefficients compute predicted target values for each data row. The differences between the predicted and observed target values are called residuals. The model is then tested for significance with the F test as follows.

Calculate the mean square for the error source (the unexplained variance).
1. Calculate the sum of squares for residuals.
  1. Take the square of each residual and add them together.
2. Divide the sum of squares for the error source by the appropriate degrees of freedom.
Calculate the mean square for the regression model (the explained variance).
1. Calculate the sum of squares for the model.
  1. For each row, subtract the overall mean from the predicted target value.
  2. Take the square of each of these results and add them together.
2. Divide the sum of squares for the regression model by the appropriate degrees of freedom.
Divide the mean square for the regression model by the mean square for the error source. In other words, calculate the ratio of explained variance to unexplained variance. This ratio is the F value.

The F value is compared to a theoretical F distribution to determine the probability of obtaining the F value by chance.

This probability is the significance value.
If the significance value is less than the significance level, the means are significantly different.

Adjusted R2 is used to estimate the regression model predictive strength. Significance level is set to 5% and the model predictive strength must be greater than 10% to indicate reliable predictive relationship between the target and an input field.

Model selection

Model selection procedure depends on whether a categorical predictor is present or not. When only continuous predictor is specified, the following three models are considered.

Constant model that always predicts the overall mean.
Linear model with the single predictor added to the constant.
Quadratic model where the square predictor is added to the linear model.

Quadratic model is selected if it is significant and it provides relative improvement in predictive strength of at least 10% over the linear model. If selected, the quadratic fit line is reported together with the model predictive strength.

Otherwise, linear model is selected if it satisfies the same conditions when compared to the constant model. If selected, the linear fit line is reported together with the model predictive strength.

In none of the previous models are selected. The overall mean is reported and no predictive relationship is reported between the target and the input.

When a categorical predictor is present, the selection process is more complex because up to eight different models are considered. The selection steps are similar to the previous steps as the most complex model that is significant and provides enough relative improvement over the first nested model is selected.

Predictive strength is reported for selected model as well as appropriate fit lines depending on the model and the number of categories in the categorical predictor that is selected, if any. The number of categories in the categorical predictor is limited to 3 to reduce the number of displayed fit lines.