You can use the Regression mining function to create Regression models. You can also test these models.
A regression model is created and trained based on known data sets of data records whose target field values are known. You can apply the trained model to known or to unknown data. In unknown data, the values of the input fields are known, however, the value of the target field is not known.
The application of the model to known data is called test mode. The main purpose of the test mode is to verify the quality and the predictive power of the model by comparing its predictions to the known actual target field values of the test data. If you are working with a mining model for a long time, you should regularly perform test runs to verify that the model is still useful. For example, it might not be valid any longer because of systematic changes in the data over time.
By default, the Transform Regression algorithm is used. You can specify to use the Polynomial Regression algorithm, the Linear Regression algorithm, or the RBF algorithm by using DM_setAlgorithm('Polynomial'), DM_setAlgorithm('Linear'), or DM_setAlgorithm('RBF') in the DM_RegSettings object.
The Transform Regression algorithm is an IBM® patented algorithm. It iteratively combines stepwise linear regression and nonlinear field transformations.
When the Transform Regression algorithm reads the training data to create a model, a part of the data is not used for the creation of the model. It is used to validate the current model by measuring its ability to correctly predict the target fields of the records that were not used to create the model.
The difference between training mode with validation and test mode is that validation steps are performed during model creation. These validation steps have an impact on the created model. The test mode runs, however, are performed after model creation. They do not impact the model.
The Linear Regression algorithm assumes a linear relationship between the explanatory fields and the target field. It produces models that represent equations. The predicted value is expected to differ from the observed value, because a regression equation is an approximation of the target field. The difference is called residual.
The contributions of the explanatory fields differ from the explanation of the target field. Some fields might be more important than other fields, or some fields might not be important at all. For example, a field might be redundant because it is highly correlated with other fields. Redundant fields can degrade the quality of a model significantly. Therefore it is essential to apply domain specific knowledge before selecting the explanatory fields or to use the automatic variable selection.
Intelligent Miner Modeling recognizes fields that do not have an explanatory value. To determine whether a field has an explanatory value, the Linear Regression algorithm performs statistical tests additional to the automatic variable selection. If you know the fields that do not have an explanatory value, you can automatically select a subset of the explanatory fields for shorter run times.
By default, the Linear Regression algorithm automatically selects subsets of explanatory fields by using the adjusted squared Pearson correlation coefficient to optimize the quality of the model.
The Linear Regression algorithm is applied to find the optimal model. Additionally to the parameters of the Linear Regression algorithm, you can set the maximum degree of polynomial regression. The default value for the maximum of polynomial regression is 3.
Before you modify the default value for the maximum degree of polynomial regression, take the following considerations into account:
f(i, r) = exp (-ßr²) with r = || x – c(i) ||where the function value f(i, r) of each component of the linear combination depends only on the distance r of vector x=(x1, x2, x3, …) from center c(i). Centers are often referred to as regions.