Data mining — Concepts

A Regression model predicts the value of a numerical data field, this is the target field, in a given data record from the known values of other data fields of the same record. The known values of other data fields are called input data fields or explanatory data fields. They can be numerical or categorical. The predicted value might not be identical to any value contained in the data used to build the model.

You can use the Regression mining function to create Regression models. You can also test these models.

A regression model is created and trained based on known data sets of data records whose target field values are known. You can apply the trained model to known or to unknown data. In unknown data, the values of the input fields are known, however, the value of the target field is not known.

The application of the model to known data is called test mode. The main purpose of the test mode is to verify the quality and the predictive power of the model by comparing its predictions to the known actual target field values of the test data. If you are working with a mining model for a long time, you should regularly perform test runs to verify that the model is still useful. For example, it might not be valid any longer because of systematic changes in the data over time.

Intelligent Miner® supports the following Regression algorithms:

Polynomial Regression
Linear Regression
Transform Regression
Radial Basis Function (RBF) Regression

By default, the Transform Regression algorithm is used. You can specify to use the Polynomial Regression algorithm, the Linear Regression algorithm, or the RBF algorithm by using DM_setAlgorithm('Polynomial'), DM_setAlgorithm('Linear'), or DM_setAlgorithm('RBF') in the DM_RegSettings object.

Transform Regression algorithm

The Transform Regression algorithm is an IBM® patented algorithm. It iteratively combines stepwise linear regression and nonlinear field transformations.

When the Transform Regression algorithm reads the training data to create a model, a part of the data is not used for the creation of the model. It is used to validate the current model by measuring its ability to correctly predict the target fields of the records that were not used to create the model.

The difference between training mode with validation and test mode is that validation steps are performed during model creation. These validation steps have an impact on the created model. The test mode runs, however, are performed after model creation. They do not impact the model.

Linear Regression algorithm

The Linear Regression algorithm assumes a linear relationship between the explanatory fields and the target field. It produces models that represent equations. The predicted value is expected to differ from the observed value, because a regression equation is an approximation of the target field. The difference is called residual.

The contributions of the explanatory fields differ from the explanation of the target field. Some fields might be more important than other fields, or some fields might not be important at all. For example, a field might be redundant because it is highly correlated with other fields. Redundant fields can degrade the quality of a model significantly. Therefore it is essential to apply domain specific knowledge before selecting the explanatory fields or to use the automatic variable selection.

Intelligent Miner Modeling recognizes fields that do not have an explanatory value. To determine whether a field has an explanatory value, the Linear Regression algorithm performs statistical tests additional to the automatic variable selection. If you know the fields that do not have an explanatory value, you can automatically select a subset of the explanatory fields for shorter run times.

The Linear Regression algorithm provides the following methods to automatically select subsets of explanatory fields:

Stepwise regression

For stepwise regression, you must specify a minimum significance level. Only the fields that have a significance level above the specified value are used by the Linear Regression algorithm.

r-squared regression

The r-squared regression method identifies an optimum model by optimizing a model quality measure. One of the following quality measures are used:

The squared Pearson correlation coefficient
The adjusted squared Pearson correlation coefficient

By default, the Linear Regression algorithm automatically selects subsets of explanatory fields by using the adjusted squared Pearson correlation coefficient to optimize the quality of the model.

Polynomial Regression algorithm

The Polynomial Regression algorithm assumes a polynomial relationship. A Polynomial Regression model is an equation that consists of the following parts:

The maximum degree of polynomial regression
An approximation of the target field
The explanatory fields

The Linear Regression algorithm is applied to find the optimal model. Additionally to the parameters of the Linear Regression algorithm, you can set the maximum degree of polynomial regression. The default value for the maximum of polynomial regression is 3.

Before you modify the default value for the maximum degree of polynomial regression, take the following considerations into account:

If you set the maximum degree of polynomial regression to 1, the Polynomial Regression algorithm is identical to the Linear Regression algorithm.
If you specify a high value for the maximum degree of polynomial regression, the Polynomial Regression algorithm tends to overfit. This means that the resulting model might approximate the training data very good, however, it fails when it is applied to data that is not used for training. Also, a polynomial of high degree tends to behave badly when it is applied to data that is different to a high degree from the input training data.

The Polynomial Regression algorithm uses the following default settings:

The maximum degree of polynomial regression is set to 3
Subsets of explanatory fields are selected by using the adjusted squared Pearson correlation coefficient to optimize the quality of the model

RBF Regression algorithm

The RBF Regression algorithm assumes a relationship between the explanatory fields and the target field. This relationship can be expressed as a linear combination of Gaussian functions. Gaussian functions are specific Radial Basis Functions. The formula looks like this:

f(i, r) = exp (-ßr²) with r = || x – c(i) ||

where the function value f(i, r) of each component of the linear combination depends only on the distance r of vector x=(x1, x2, x3, …) from center c(i). Centers are often referred to as regions.

Concepts and algorithms

Transform Regression algorithm

Linear Regression algorithm

Polynomial Regression algorithm

RBF Regression algorithm