IDAX.TWOSTEP - Build a TwoStep clustering model

Use this stored procedure to synchronously or asynchronously build a TwoStep clustering model. The model distributes the input data into a hierarchical tree structure according to the distance between the data records, and then reduces the tree into k clusters. A second pass over the data associates the input data records to the next cluster.

Authorization

The privileges held by the authorization ID of the statement must include the IDAX_USER role.

Syntax

IDAX.TWOSTEP(in parameter_string varchar(32672), in synchronous boolean default true)

Parameter descriptions

parameter_string

Mandatory one-string parameter that contains pairs of <parameter>=<value> entries that are separated by a comma.

Data type: VARCHAR(32672)

The following list shows the parameter values:

model

Mandatory.

The name of the TwoStep clustering model that is to be built.

Data type: VARCHAR(64)

intable

Mandatory.

The name of the input table.

Data type: VARCHAR(128)

id

Mandatory.

The column of the input table that identifies a unique instance ID.

Data type: VARCHAR(128)

outtable

Optional.

The name of the output table in which the clusters are assigned to each record of the input table.

Data type: VARCHAR(128)

distance

Optional.

The distance function.

Allowed values are 'euclidean', 'norm_euclidean', and 'loglikelihood'.

Default: loglikelihood

Data type: VARCHAR(ANY)

k

Optional.

The number of clusters.

If k is 0 or less, the procedure determines the optimal number of clusters.

Default: 0

Data type: INTEGER

maxk

Optional.

The maximum number of clusters that can be determined automatically.

If k is bigger than 0, this parameter is ignored.

Default: 20 Min: 2

Data type: INTEGER

incolumn

Optional.

The columns of the input table that have specific properties, which are separated by a semi-colon (;).

Each column is succeeded by one or more of the following properties:

By type nominal (":nom", ":nominal") or by type continuous (":cont", ":continuous"). By default, numerical types are continuous, and all other types are nominal.
By type nominal (:nom) or by type continuous (:cont). By default, numerical types are continuous, and all other types are nominal.
By role id (":id"), target (":target"), input (":active", ":in", ":input") or ignore (":ignore", ":inactive").
By role :id, :target, :input, or :ignore.

If this parameter is not specified, all columns of the input table have default properties.

Default: none

Data type: VARCHAR(ANY)

coldeftype

Optional.

The default type of the input table columns.

Allowed values are nom and cont.

Allowed values are:

"nom" and "nominal" for type nominal
"cont" and "continuous" for type continuous

If the parameter is not specified, numeric columns are continuous, and all other columns are nominal.

Default: none

Data type: VARCHAR(ANY)

Data type: VARCHAR(10)

coldefrole

Optional.

The default role of the input table columns.

Allowed values are input and ignore.

Allowed values are:

"active", "in", and "input" for role input
"ignore" and "inactive" for role ignore

If the parameter is not specified, all columns are input columns.

Default: input

Data type: VARCHAR(ANY)

Data type: VARCHAR(8)

randseed

Optional.

The random generator seed.

If the randseed parameter is specified as <> 0, the models are identical for consecutive runs of the TwoStep procedure if these runs are done with the same data and the same parameter settings.

Default: 0

Data type: INTEGER

nodecapacity

Optional.

The branching factor of the internal tree that is used in pass 1.

Each node can have up to <nodecapacity> subnodes.

Default: 6 Min: 2

Data type: INTEGER

leafcapacity

Optional.

The number of clusters per leaf node in the internal tree that is used in pass 1.

Default: 8 Min: 2

Data type: INTEGER

outlierfraction

Optional.

The fraction of the records that is to be considered as outlier in the internal tree that is used in pass 1.

Clusters that contain less than <outlierfraction> times the mean number of data records per cluster are removed. Type: DOUBLE Default: 0.0 Min: 0.0 Max: 1.0

Default: 0.0 Min: 0.0 Max: 1.0

Data type: DOUBLE

spark_partitions

Optional.

Number of Spark partitions that are created with the input data.

This number determines the degree of parallelism that is used to create the model.

Note: If the number of rows per partition is too small, the resulting model might be imprecise. For small input data sets, even the default value might be too large. Therefore, explicitly set a smaller value.

Default: Number of available CPUs divided by 4.

Data type: INTEGER

synchronous

Optional.

Synchronous or asynchronous execution of the TWOSTEP stored procedure.

In a synchronous execution, the TWOSTEP stored procedure waits until the execution is completed. In an asynchronous execution, the TWOSTEP procedure returns as soon as it is submitted.

To get the status and the submission ID of an asynchronously executed TWOSTEP stored procedure, call the IDAX.APP_STATUS() stored procedure after you launch the TWOSTEP stored procedure.

To cancel an asynchronously executed TWOSTEP stored procedure, call the IDAX.CANCEL_APP() stored procedure while the TWOSTEP stored procedure is running.

Data type: BOOLEAN

Default: true

Returned information

The procedure does not return any SQL results.

Example

CALL IDAX.TWOSTEP('model=adult_mdl, randseed=12345, intable=adult_train, id=id');