Basic syntax and parameters for the PredictColumn procedure

The basic call of the PredictColumn procedure includes required parameters. It creates a view.

Basic Syntax

IDMMX.PredictColumn(predView,
                    inputTable,
                    targetColumn)

You can choose to specify optional parameters in addition to the basic syntax.

Parameters

To predict future behaviour, you must specify the following parameters for the PredictColumn procedure:
viewName
The name of the view that you want to build.
The PredictColumn procedure creates a view and a model. Depending on the mining function that is used to build the model, the model is stored in one of the following tables under the same name as the generated view:
  • IDMMX.ClassifModels if the target column is categorical
  • IDMMX.RegressionModels if the target column is numeric
If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view.

This parameter is of type VARCHAR. Its size is 240.

inputTable
The name of the input table or the input view.

The columns of the input table that are unlikely to be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns.

This parameter is of type VARCHAR. Its size is 257.

targetColumn
The name of the target column.

The PredictColumn procedure derives the values in this column from the values of the other columns in the input table. If the target column is categorical, the Classification mining function is used. If the target column is numeric, the Regression mining function is used.

This parameter is of type VARCHAR. Its size is 128.

For information about the valid SQL types of categorical and numerical fields, see Mining field types.

Data Flow

Figure 1 shows the data flow of the PredictColumn procedure. By applying the PredictColumn procedure to the input table with the specified target column, a model and a view are generated. The view includes the columns of the input table and the columns PREDICTION and CONFIDENCE.
Figure 1. Data flow of the PredictColumn procedure
This graphic shows the data flow of the PredictColumn procedure. This procedure uses the input table to generate a model and an output view.

Output

Based on the parameters, the PredictColumn procedure creates a view. This view contains the columns of the input table and the following additional columns:
PREDICTION
This column contains the predicted values of the target column. These values are derived from the values of the input table.
CONFIDENCE
This column contains the confidence value of the prediction.
If the target column is categorical, the confidence value can range from 0 to 1.
  • A value close to 0 indicates a low probability that the prediction is correct.
  • A value close to 1 indicates a high probability that the prediction is correct.

If the target column is numeric, this column contains only null values.

With the prediction confidence, you can select the most reliable predictions.

To analyze the prediction model in detail, you can use the visualizer in the Design Studio.

Data Flow of the PredictColumn Procedure

The PredictColumn procedure splits the input data in the following disjoint data sets:
Training data set
The training data set is used to compute the prediction model.
Validation data set
The quality of the prediction model is based on the records of the validation data set.

The model quality indicates how well the model might perform on unknown data. Typically, the model quality is better on the training data than on the validation data because the model might be tuned towards the records of the training data set.

In the extreme case it is as if you learned all records of the training data set by heart. This means that you would have an optimal model quality for the training data set because for all records of the training data set the predictions would be correct. On the other hand, you would not know what to predict for a record of the validation data set unless it had the same values as a record in the training data set. Therefore, for computing the quality of the model, it is better to use data records that were not used in the training phase.