Start of change

Regression functions (REGR_AVGX, REGR_AVGY, REGR_COUNT, ...)

The regression functions support the fitting of an ordinary-least-squares regression line of the form y = a * x + b to a set of number pairs.

FL 507

Start of change
Passthrough-only expression: This function is passthrough-only and cannot run on Db2 for z/OS® without acceleration. For information about invoking this function, see Accelerating queries with passthrough-only expressions.
End of change
Read syntax diagramSkip visual syntax diagramREGR_AVGXREGR_AVGYREGR_COUNTREGR_INTERCEPTREGR_ICPTREGR_R2REGR_SLOPEREGR_SXXREGR_SXYREGR_SYY(expression1,expression2 )

The schema is SYSIBM.

expression1
An expression that returns a value of any built-in numeric data type. It is interpreted as a value of the dependent variable (that is, a y value).
expression2
An expression that returns a value of any built-in numeric data type. It is interpreted as a value of the independent variable (that is, an x value).

The REGR_COUNT function returns the number of non-null number pairs used to fit the regression line.

The REGR_INTERCEPT (or REGR_ICPT) function returns the y-intercept of the regression line (b in the equation mentioned previously).

The REGR_R2 function returns the coefficient of determination (R-squared or goodness-of-fit) for the regression.

The REGR_SLOPE function returns the slope of the line (a in the equation mentioned previously).

The REGR_AVGX, REGR_AVGY, REGR_SXX, REGR_SXY, and REGR_SYY functions return quantities that can be used to compute various diagnostic statistics needed for the evaluation of the quality and statistical validity of the regression model.

The data type of the result of REGR_COUNT is integer. For the remaining functions, if either argument is DECFLOAT(n), the data type of the result is DECFLOAT(34); otherwise, the data type of the result is double-precision floating-point. If either argument is a special decimal floating-point value, the rules for general arithmetic operations for decimal floating-point apply. For more information, see General arithmetic operation rules for DECFLOAT

The result can be null. When not null, the result of REGR_R2 is in the range 0–1, and the result of both REGR_SXX and REGR_SYY is non-negative.

Each function is applied to the set of (expression1, expression2) pairs derived from the argument values by the elimination of all pairs for which either expression1 or expression2 is null.

If the set is not empty and VARIANCE(expression2) is positive, REGR_COUNT returns the number of non-null pairs in the set, and the remaining functions return results that are defined as follows:
REGR_SLOPE(expression1,expression2) =
COVARIANCE(expression1,expression2)/VARIANCE(expression2)
REGR_INTERCEPT(expression1, expression2) =
AVG(expression1) - REGR_SLOPE(expression1, expression2) * AVG(expression2)
REGR_R2(expression1, expression2) =
POWER(CORRELATION(expression1, expression2), 2) if VARIANCE(expression1)>0
REGR_R2(expression1, expression2) = 1 if VARIANCE(expression1)=0
REGR_AVGX(expression1, expression2) = AVG(expression2)
REGR_AVGY(expression1, expression2) = AVG(expression1)
REGR_SXX(expression1, expression2) =
REGR_COUNT(expression1, expression2) * VARIANCE(expression2)
REGR_SYY(expression1, expression2) =
REGR_COUNT(expression1, expression2) * VARIANCE(expression1)
REGR_SXY(expression1, expression2) =
REGR_COUNT(expression1, expression2) * COVARIANCE(expression1, expression2)

If the set is not empty and VARIANCE(expression2) is equal to zero, then the regression line either has infinite slope or is undefined. In this case, the functions REGR_SLOPE, REGR_INTERCEPT, and REGR_R2 each return a null value, and the remaining functions return values as defined previously. If the set is empty, REGR_COUNT returns zero and the remaining functions return a null value.

The order in which the values are aggregated is undefined, but every intermediate result must be within the range of the result data type.

The regression functions are all computed simultaneously during a single pass through the data. In general, it is more efficient to use the regression functions to compute the statistics needed for a regression analysis than to perform the equivalent computations using ordinary column functions such as AVERAGE, VARIANCE, COVARIANCE, and so forth.

The usual diagnostic statistics that accompany a linear-regression analysis can be computed in terms of the functions listed previously. For example:
Adjusted R2
1 - ( (1 - REGR_R2) * ((REGR_COUNT - 1) / (REGR_COUNT - 2)) )
Standard error
SQRT( (REGR_SYY-(POWER(REGR_SXY,2)/REGR_SXX))/(REGR_COUNT-2) )
Total sum of squares
REGR_SYY
Regression sum of squares
POWER(REGR_SXY,2) / REGR_SXX
Residual sum of squares
(Total sum of squares)-(Regression sum of squares)
t statistic for slope
REGR_SLOPE * SQRT(REGR_SXX) / (Standard error)
t statistic for y-intercept
REGR_INTERCEPT/((Standard error) * SQRT((1/REGR_COUNT)+(POWER(REGR_AVGX,2)/REGR_SXX))

Examples

  • Using the EMPLOYEE table, compute an ordinary-least-squares regression line that expresses the bonus of an employee in department (WORKDEPT) 'A00' as a linear function of the employee's salary. Set the host variables SLOPE, ICPT, RSQR (double-precision floating point) to the slope, intercept, and coefficient of determination of the regression line, respectively. Also set the host variables AVGSAL and AVGBONUS to the average salary and average bonus, respectively, of the employees in department 'A00', and set the host variable CNT (integer) to the number of employees in department 'A00' for whom both salary and bonus data are available. Store the remaining regression statistics in host variables SXX, SYY, and SXY.
    SELECT REGR_SLOPE(BONUS,SALARY), REGR_INTERCEPT(BONUS,SALARY),
    REGR_R2(BONUS,SALARY), REGR_COUNT(BONUS,SALARY),
    REGR_AVGX(BONUS,SALARY), REGR_AVGY(BONUS,SALARY),
    REGR_SXX(BONUS,SALARY), REGR_SYY(BONUS,SALARY),
    REGR_SXY(BONUS,SALARY)
    INTO	:SLOPE, :ICPT,
    :RSQR, :CNT,
    :AVGSAL, :AVGBONUS,
    :SXX, :SYY,
    :SXY
    FROM EMPLOYEE
    WHERE WORKDEPT = 'A00'
    When using the sample table, the host variables are set to the following approximate values:
    SLOPE:  +1.71002671916749E-002
    ICPT:  +1.00871888623260E+002
    RSQR:  +9.99707928128685E-001
    CNT:  3
    AVGSAL:  +4.28333333333333E+004
    AVGBONUS:  +8.33333333333333E+002
    SXX:  +2.96291666666667E+008
    SYY:  +8.66666666666667E+004
    SXY:  +5.06666666666667E+006
End of change