Example: Use Linear Regression to Analyze the Relation Between Age and Household Income

You are a report author at a bank. You are asked to analyze the relationship between age and household income.

The null hypothesis is that there is no relationship between the two variables.

Note: You must have IBM® Cognos® Statistics installed and configured to create this example.

Procedure

  1. Open the BANKLOAN_CS package in IBM Cognos Report Studio.
  2. In the Welcome dialog box, click Create a new report or template.
  3. In the New dialog box, click Statistics and then click OK.
  4. In the Select Statistic dialog box, expand Correlation and Regression, click Linear Regression, and then click OK.

    The statistical object wizard opens.

  5. From the metadata tree, expand BANKLOAN_CS and drag items to the following drop zones:
    • Drag the Household income in thousands item to the Dependent variable drop zone.
    • Drag the Age in years item to the Independent variables drop zone.
    • Drag the Customer ID item to the Cases variable drop zone to define a set of cases and click Next.
  6. Leave the default entry method and missing values options, and click Next.
  7. Leave the default output options, and click Finish.
  8. Run the report.

Results

By default, you see a variable entry table, a model summary, an ANOVA table, and a regression coefficients table.

In the model summary table, the statistic of interest is the R square statistic, in this case .244.

  • The coefficient of determination (R square) is the square of the correlation coefficient (R). It represents the proportion of variance in the dependent variable that can be accounted for by the regression equation. For example, an R square value of .24 means that the regression equation can explain 24% of the variance in the dependent variable. The other 76% is unexplained.
  • Adjusted R square is a standard downward adjustment to correct for the possibility that, if there are many independent variables, some of the variance might be due to chance. The more independents, the greater the adjustment.
  • The standard error of the estimate is the standard deviation of the data points as they are distributed around the regression line.
  • The R square change refers to the amount R square increases or decreases when a variable is added to or deleted from the equation, as is done in stepwise regression.
  • The F change statistic shows the significance level associated with adding or removing the variable for each step. You can change this in the regression method area of the wizard. Steps that are not significant are not modeled.
An example of the results of linear regression showing a variables entry table, a model summary table, and an ANOVA table.

The variables table shows the variables that have been included in the analysis and the regression method that is used to enter the variables.

The ANOVA table tests the acceptability of the model. The Regression row displays information about the variation accounted for by your model. The Residual row displays information about the variation that is not accounted for by your model. If the significance value of the F statistic is less than 0.05, then the variation that the model explains is not due to chance.

Next, look at the coefficients table.

An example of the results of linear regression showing a coefficients table.

The main statistic of interest in the coefficients table is the unstandardized regression coefficient, Age in years 2.523.

The regression equation is

dependent variable = slope * independent variable + constant

The slope is how steep the regression line is, based on a scatterplot. The constant is where the regression line strikes the y-axis when the independent variable has a value of 0.

In this example, the slope is 2.523, and the constant is -26.636. So the regression equation is

predicted value of household income = 2.523 * age in years - 26.636.

That is, for the average person, we would estimate that their household income at age 30 would be

2.523 * 30- 26.63 = 49.06 (in thousands)

Note: For multiple regression, the regression equation is similar. If you have 3 independent variables (IV1, IV2, and IV3, the regression equation is

dependent variable=B(IV1) * IV1 + B(IV2) * IV2 + B(IV3) * IV3 + constant

The coefficients table also includes the confidence interval for B and the standardized coefficients.