IBM built-in analytics

You can perform analytics on data in your database by using IBM built-in analytics routines.

Classify your data by using a decision tree
- DECTREE - Build and prune a decision tree model
- PREDICT_DECTREE - Apply a decision tree model
Perform regression analysis on your data
- REGTREE - Build and prune a regression tree model
- PREDICT_DECTREE - Apply a decision tree model
Cluster your data using the K-means algorithm
- KMEANS - Build a K-means model
- PREDICT_KMEANS - Apply a K-means model

Prerequisite

Initialize the built-in analytics routines infrastructure by calling the INITIALIZE routine:

The INITIALIZE routine takes one input argument, an optional argument that is a string:

"force": Causes existing meta data to be recreated.

Classify your data using a decision tree

Build and prune a decision tree model by calling the DECTREE routine:

input-string

The DECTREE routine takes one input argument, a string which is a comma-separated list of parameter-values pairs:

model=model-name,

Specifies the name to be given to the model that the DECTREE routine will build.

intable=input-table-name,

Specifies the name of the input table (the table containing the data that the routine will use to build the model.)

id=id-column-name,

Specifies the name of the ID column (the column containing the data which uniquely identifies each data instance, each row in the table.)

target=target-column-name,

Specifies the name of the target column (the column containing the class, the attribute upon which the classification is being performed.)

incolumn=column-specification-list,

The column-specification-list is a string which is a semi-colon-separated list of column-properties pairs. For example: incolumn=col1:nom:id;col2:cont,

:nom

Specifies to handle the data in the column as nominal (can only take one of a finite number of values.)

:cont

Specifies to handle the data in the column as continuous (can take any value between two values.) Numerical data is treated as continuous by default.

:id

Identifies the column as the ID column.

:target

Identifies the column as the target column.

:input

Identifies the column as an input column (the column will be included in the analytics operation.)

:ignore

Indicates the column should not be included in the analytics operation.

coldeftype=(nom|cont),

Specifies whether the data in columns should be processed by default as nominal (nom) or continuous (cont). If the coldeftype parameter is not specified, columns containing numerical data will be treated as continuous by default, and columns containing non-numerical data will be treated as nominal by default.

coldefrole=(input|ignore),

Specifies whether columns should be processed as input columns by default (input) or should not be included in the analytics operation by default (ignore). If the coldefrole parameter is not specified, columns will be treated as input columns by default.

colPropertiesTable=prop-table-name,

Specifies the name of the column properties table (the table specifying the properties of the columns in the input table.)

weights=weights-table-name,

Specifies the name of the weights table (the table specifying the weighting that should be applied to each of the columns in the input table.)

eval=(entropy|gini),

Specifies the class impurity measure (a measure of the variability of data in a class, used to determine whether to split the decision tree.) If the eval parameter is not specified, the "entropy" measure will be used.

minimprove=improve-number,

Specifies the minimum improvement in the class impurity measure to warrant splitting the decision tree. Valid values are double-precision floating-point number greater than or equal to zero. If the minimprove parameter is not specified, the value 0.01 will be used.

minsplit=split-number,

Specifies the minimum number of data instances required in each decision tree node. Valid values are integers greater than one. If the minsplit parameter is not specified, the value 50 will be used.

maxdepth=depth-number,

Specifies the maximum number of levels in the decision tree, including leaves. Valid values are integers greater than zero and less than 63. If the maxdepth parameter is not specified, the value 10 will be used.

valtable=validation-table-name,

Specifies the name of the validation table (the table that contains data for pruning the decision tree.) If the parameter valtable is not specified, the routine will not prune the decision tree.

valweights=validation-weights-table-name,

Specifies the name of the weights table that applies to the validation table.

qmeasure=(Acc|wAcc),

Specifies the pruning quality measure (used to determine whether to prune the decision tree.) If the qmeasure parameter is not specified, the "Acc" measure will be used.

statistics=(none|columns|values:n|all),

Specifies which statistics to collect:

none

No statistics will be collected.

columns

Statistics, such as the mean value, will be collected for the data in the columns in the input table.

values:n

Statistics will be collected for up to n data instances in the columns of the input table.

all

Equivalent to specifying statistics=values:100,.

If the statistics parameter is not specified, no statistics will be collected.
Apply the decision tree model by calling the PREDICT_DECTREE routine:

input-string

The PREDICT_DECTREE routine takes one input argument, a string which is a comma-separated list of parameter-values pairs:

model=model-name,

Specifies the name of a decision tree model that was built by the DECTREE routine.

intable=input-table-name,

Specifies the name of the input table.

outtable=output-table-name,

Specifies the name of the output table (the table where the predictions will be stored.)

id=id-column-name,

Specifies the name of the ID column.

target=target-column-name,

Specifies the name of the target column.

prob=(true|false),

Specifies whether to include the probability of the predicted class in the output table. If the prob parameter is not specified, the probability will not be included in the output table.

outtableprob=probability-output-table-name,

Specifies the name of the probability output table (a table where the probability of the predicted class will be stored.)

Perform regression analysis on your data

Build and prune a regression tree model by calling the REGTREE routine:

input-string

The REGTREE routine takes one input argument, a string which is a comma-separated list of parameter-values pairs:

model=model-name,

Specifies the name to be given to the model that the REGTREE routine will build.

intable=input-table-name,

Specifies the name of the input table.

id=id-column-name,

Specifies the name of the ID column.

target=target-column-name,

Specifies the name of the target column (the column containing the prediction target.)

incolumn=column-specification-list,

The column-specification-list is a string which is a semi-colon-separated list of column-properties pairs. For example: incolumn=col1:nom:id;col2:cont,

:nom

Specifies to handle the data in the column as nominal (can only take one of a finite number of values.)

:cont

Specifies to handle the data in the column as continuous (can take any value between two values.) Numerical data is treated as continuous by default.

:id

Identifies the column as the ID column.

:target

Identifies the column as the target column.

:input

Identifies the column as an input column (the column will be included in the analytics operation.)

:ignore

Indicates the column should not be included in the analytics operation.

coldeftype=(nom|cont),

Specifies whether the data in columns should be processed by default as nominal (nom) or continuous (cont). If the coldeftype parameter is not specified, columns containing numerical data will be treated as continuous by default, and columns containing non-numerical data will be treated as nominal by default.

coldefrole=(input|ignore),

Specifies whether columns should be processed as input columns by default (input) or should not be included in the analytics operation by default (ignore). If the coldefrole parameter is not specified, columns will be treated as input columns by default.

colPropertiesTable=prop-table-name,

Specifies the name of the column properties table (the table specifying the properties of the columns in the input table.)

eval=variance,

Specifies the split evaluation measure (used to determine whether to split the regression tree.) If the eval parameter is not specified, the "variance" measure will be used.

minimprove=improve-number,

Specifies the minimum improvement in the split evaluation measure to warrant splitting the regression tree. Valid values are double-precision floating-point number greater than or equal to 0.00001. If the minimprove parameter is not specified, the value 0.1 will be used.

minsplit=split-number,

Specifies the minimum number of data instances required in each regression tree node. Valid values are integers greater than one. If the minsplit parameter is not specified, the value 50 will be used.

maxdepth=depth-number,

Specifies the maximum number of levels in the regression tree, including leaves. Valid values are integers greater than zero and less than 63. If the maxdepth parameter is not specified, the value 10 will be used.

valtable=validation-table-name,

Specifies the name of the validation table. If the parameter valtable is not specified, the routine will not prune the regression tree.

qmeasure=(mse|r2),

Specifies the pruning quality measure. If the qmeasure parameter is not specified, the "mse" measure will be used.

statistics=(none|columns|values:n|all),

Specifies which statistics to collect:

none

No statistics will be collected.

columns

Statistics, such as the mean value, will be collected for the data in the columns in the input table.

values:n

Statistics will be collected for up to n data instances in the columns of the input table.

all

Equivalent to specifying statistics=values:100,.

If the statistics parameter is not specified, no statistics will be collected.
Apply the regression tree model by calling the PREDICT_REGTREE routine:

input-string

The PREDICT_REGTREE routine takes one input argument, a string which is a comma-separated list of parameter-values pairs:

model=model-name,

Specifies the name of a regression tree model that was built by the REGTREE routine.

intable=input-table-name,

Specifies the name of the input table.

outtable=output-table-name,

Specifies the name of the output table.

id=id-column-name,

Specifies the name of the ID column.

target=target-column-name,

Specifies the name of the target column.

var=(true|false),

Specifies whether to include the variance of the prediction in the output table. If the var parameter is not specified, the variance will not be included in the output table.

Cluster your data using the K-means algorithm

Build a K-means model by calling the KMEANS routine:

input-string

The KMEANS routine takes one input argument, a string which is a comma-separated list of parameter-values pairs:

model=model-name,

Specifies the name to be given to the model that the KMEANS routine will build.

intable=input-table-name,

Specifies the name of the input table.

outtable=output-table-name,

Specifies the name of the output table (the table where cluster assignments will be stored.)

id=id-column-name,

Specifies the name of the ID column.

distance=euclidean,

Specifies the distance function (a measure of the distance between a data instance and a cluster.)

k=number-of-centers,

Specifies the number of clusters. Valid values are integers greater than zero. If the k parameter is not specified, three clusters will be created.

maxiter=number-of-iterations,

Specifies the maximum number of iterations to perform. Valid values are integers greater than one and less than 1001. If the maxiter parameter is not specified, a maximum of five iterations will be performed.

randseed=generator-number,

Specifies a number to seed random number generation. Valid values are integers. If the randseed parameter is not specified, the value 12345 will be used.

idbased=(true|false),

Specifies whether to base the random number generation on the ID column value. If the idbased parameter is not specified, the random number generation will not be based on the ID column value.

incolumn=column-specification-list,

The column-specification-list is a string which is a semi-colon-separated list of column-properties pairs. For example: incolumn=col1:nom:id;col2:cont,

:nom

Specifies to handle the data in the column as nominal (can only take one of a finite number of values.)

:cont

Specifies to handle the data in the column as continuous (can take any value between two values.) Numerical data is treated as continuous by default.

:id

Identifies the column as the ID column.

:target

Identifies the column as the target column.

:input

Identifies the column as an input column (the column will be included in the analytics operation.)

:ignore

Indicates the column should not be included in the analytics operation.

coldeftype=(nom|cont),

Specifies whether the data in columns should be included by default as nominal (nom) or continuous (cont). If the coldeftype parameter is not specified, columns containing numerical data will be treated as continuous by default, and columns containing non-numerical data will be treated as nominal by default.

coldefrole=(input|ignore),

Specifies whether columns should be processed as input columns by default (input) or should not be included in the analytics operation by default (ignore). If the coldefrole parameter is not specified, columns will be treated as input columns by default.

colPropertiesTable=prop-table-name,

Specifies the name of the column properties table.

statistics=(none|columns|values:n|all),

Specifies which statistics to collect:

none

No statistics will be collected.

columns

Statistics, such as the mean value, will be collected for the data in the columns in the input table.

values:n

Statistics will be collected for up to n data instances in the columns of the input table.

all

Equivalent to specifying statistics=values:100,.

If the statistics parameter is not specified, no statistics will be collected.

transform=(L|N|S),

Specifies how table columns should be transformed:

L

Leave as-is, untransformed.

N

Normalize.

S

Standardize.

If the transform parameter is not specified, the columns will not be transformed.
Apply the K-means model by calling the PREDICT_KMEANS routine:

input-string

The PREDICT_KMEANS routine takes one input argument, a string which is a comma-separated list of parameter-values pairs:

model=model-name,

Specifies the name of a K-means model that was built by the MEANS routine.

intable=input-table-name,

Specifies the name of the input table.

outtable=output-table-name,

Specifies the name of the output table.

id=id-column-name,

Specifies the name of the ID column.