IDAX.TWOSTEP - Build a TwoStep clustering model

Use this stored procedure to synchronously or asynchronously build a TwoStep clustering model. The model distributes the input data into a hierarchical tree structure according to the distance between the data records, and then reduces the tree into k clusters. A second pass over the data associates the input data records to the next cluster.

Authorization

The privileges held by the authorization ID of the statement must include the IDAX_USER role.

Syntax

IDAX.TWOSTEP(in parameter_string varchar(32672), in synchronous boolean default true)

Parameter descriptions

parameter_string
Mandatory one-string parameter that contains pairs of <parameter>=<value> entries that are separated by a comma.
Data type: VARCHAR(32672)
The following list shows the parameter values:
model
Mandatory.
The name of the TwoStep clustering model that is to be built.
Data type: VARCHAR(64)
intable
Mandatory.
The name of the input table.
Data type: VARCHAR(128)
id
Mandatory.
The column of the input table that identifies a unique instance ID.
Data type: VARCHAR(128)
outtable
Optional.
The name of the output table in which the clusters are assigned to each record of the input table.
Data type: VARCHAR(128)
distance
Optional.
The distance function.
Allowed values are 'euclidean', 'norm_euclidean', and 'loglikelihood'.
Default: loglikelihood
Data type: VARCHAR(ANY)
k
Optional.
The number of clusters.
If k is 0 or less, the procedure determines the optimal number of clusters.
Default: 0
Data type: INTEGER
maxk
Optional.
The maximum number of clusters that can be determined automatically.
If k is bigger than 0, this parameter is ignored.
Default: 20 Min: 2
Data type: INTEGER
incolumn
Optional.
The columns of the input table that have specific properties, which are separated by a semi-colon (;).
Each column is succeeded by one or more of the following properties:
  • By type nominal (":nom", ":nominal") or by type continuous (":cont", ":continuous"). By default, numerical types are continuous, and all other types are nominal.
  • By type nominal (:nom) or by type continuous (:cont). By default, numerical types are continuous, and all other types are nominal.
  • By role id (":id"), target (":target"), input (":active", ":in", ":input") or ignore (":ignore", ":inactive").
  • By role :id, :target, :input, or :ignore.
If this parameter is not specified, all columns of the input table have default properties.
Default: none
Data type: VARCHAR(ANY)
coldeftype
Optional.
The default type of the input table columns.
Allowed values are nom and cont.
Allowed values are:
  • "nom" and "nominal" for type nominal
  • "cont" and "continuous" for type continuous
If the parameter is not specified, numeric columns are continuous, and all other columns are nominal.
Default: none
Data type: VARCHAR(ANY)
Data type: VARCHAR(10)
coldefrole
Optional.
The default role of the input table columns.
Allowed values are input and ignore.
Allowed values are:
  • "active", "in", and "input" for role input
  • "ignore" and "inactive" for role ignore
If the parameter is not specified, all columns are input columns.
Default: input
Data type: VARCHAR(ANY)
Data type: VARCHAR(8)
randseed
Optional.
The random generator seed.
If the randseed parameter is specified as <> 0, the models are identical for consecutive runs of the TwoStep procedure if these runs are done with the same data and the same parameter settings.
Default: 0
Data type: INTEGER
nodecapacity
Optional.
The branching factor of the internal tree that is used in pass 1.
Each node can have up to <nodecapacity> subnodes.
Default: 6 Min: 2
Data type: INTEGER
leafcapacity
Optional.
The number of clusters per leaf node in the internal tree that is used in pass 1.
Default: 8 Min: 2
Data type: INTEGER
outlierfraction
Optional.
The fraction of the records that is to be considered as outlier in the internal tree that is used in pass 1.
Clusters that contain less than <outlierfraction> times the mean number of data records per cluster are removed. Type: DOUBLE Default: 0.0 Min: 0.0 Max: 1.0
Default: 0.0 Min: 0.0 Max: 1.0
Data type: DOUBLE
spark_partitions
Optional.
Number of Spark partitions that are created with the input data.
This number determines the degree of parallelism that is used to create the model.
Note: If the number of rows per partition is too small, the resulting model might be imprecise. For small input data sets, even the default value might be too large. Therefore, explicitly set a smaller value.
Default: Number of available CPUs divided by 4.
Data type: INTEGER
synchronous
Optional.
Synchronous or asynchronous execution of the TWOSTEP stored procedure.
In a synchronous execution, the TWOSTEP stored procedure waits until the execution is completed. In an asynchronous execution, the TWOSTEP procedure returns as soon as it is submitted.
To get the status and the submission ID of an asynchronously executed TWOSTEP stored procedure, call the IDAX.APP_STATUS() stored procedure after you launch the TWOSTEP stored procedure.
To cancel an asynchronously executed TWOSTEP stored procedure, call the IDAX.CANCEL_APP() stored procedure while the TWOSTEP stored procedure is running.
Data type: BOOLEAN
Default: true

Returned information

The procedure does not return any SQL results.

Example

CALL IDAX.TWOSTEP('model=adult_mdl, randseed=12345, intable=adult_train, id=id');