Use this stored procedure to synchronously or asynchronously build a TwoStep clustering
model. The model distributes the input data into a hierarchical tree structure according to the
distance between the data records, and then reduces the tree into k clusters. A second pass over the
data associates the input data records to the next cluster.
Authorization
The privileges held by the authorization ID of the statement must include the IDAX_USER role.
Syntax
IDAX.TWOSTEP(in parameter_string varchar(32672), in synchronous boolean default true)
Parameter descriptions
- parameter_string
- Mandatory one-string parameter that contains pairs of
<parameter>=<value> entries that are separated by a comma.
- Data type: VARCHAR(32672)
- The following list shows the parameter values:
-
- model
- Mandatory.
- The name of the TwoStep clustering model that is to be built.
- Data type: VARCHAR(64)
-
- intable
- Mandatory.
- The name of the input table.
- Data type: VARCHAR(128)
- id
- Mandatory.
- The column of the input table that identifies a unique instance ID.
- Data type: VARCHAR(128)
- outtable
- Optional.
- The name of the output table in which the clusters are assigned to each record of the input
table.
- Data type: VARCHAR(128)
- distance
- Optional.
- The distance function.
- Allowed values are 'euclidean', 'norm_euclidean', and 'loglikelihood'.
- Default: loglikelihood
- Data type: VARCHAR(ANY)
- k
- Optional.
- The number of clusters.
- If k is 0 or less, the procedure determines the optimal number of clusters.
- Default: 0
- Data type: INTEGER
- maxk
- Optional.
- The maximum number of clusters that can be determined automatically.
- If k is bigger than 0, this parameter is ignored.
- Default: 20 Min: 2
- Data type: INTEGER
- incolumn
- Optional.
- The columns of the input table that have specific properties, which are separated by a
semi-colon (;).
- Each column is succeeded by one or more of the following properties:
- By type nominal (":nom", ":nominal") or by type continuous (":cont",
":continuous"). By default, numerical types are continuous, and all other types are nominal.
- By type nominal (
:nom
) or by type continuous (:cont
). By
default, numerical types are continuous, and all other types are nominal.
- By role id (":id"), target (":target"), input (":active", ":in", ":input")
or ignore (":ignore", ":inactive").
- By role
:id
, :target
, :input
, or
:ignore
.
- If this parameter is not specified, all columns of the input table have default properties.
- Default: none
- Data type: VARCHAR(ANY)
- coldeftype
- Optional.
- The default type of the input table columns.
- Allowed values are
nom
and cont
.
- Allowed values are:
- "nom" and "nominal" for type nominal
- "cont" and "continuous" for type continuous
- If the parameter is not specified, numeric columns are continuous, and all other columns are
nominal.
- Default: none
- Data type: VARCHAR(ANY)
- Data type: VARCHAR(10)
- coldefrole
- Optional.
- The default role of the input table columns.
- Allowed values are
input
and ignore
.
- Allowed values are:
- "active", "in", and "input" for role input
- "ignore" and "inactive" for role ignore
- If the parameter is not specified, all columns are input columns.
- Default: input
- Data type: VARCHAR(ANY)
- Data type: VARCHAR(8)
- randseed
- Optional.
- The random generator seed.
- If the randseed parameter is specified as <> 0, the
models are identical for consecutive runs of the TwoStep procedure if these runs are done with the
same data and the same parameter settings.
- Default: 0
- Data type: INTEGER
- nodecapacity
- Optional.
- The branching factor of the internal tree that is used in pass 1.
- Each node can have up to <nodecapacity> subnodes.
- Default: 6 Min: 2
- Data type: INTEGER
- leafcapacity
- Optional.
- The number of clusters per leaf node in the internal tree that is used in pass 1.
- Default: 8 Min: 2
- Data type: INTEGER
- outlierfraction
- Optional.
- The fraction of the records that is to be considered as outlier in the internal tree that is
used in pass 1.
- Clusters that contain less than <outlierfraction> times the mean number of data records per
cluster are removed. Type: DOUBLE Default: 0.0 Min: 0.0 Max: 1.0
- Default: 0.0 Min: 0.0 Max: 1.0
- Data type: DOUBLE
- spark_partitions
- Optional.
- Number of Spark partitions that are created with the input data.
- This number determines the degree of parallelism that is used to create the model.
-
Note: If the number of rows per partition is too small, the resulting model might be imprecise. For
small input data sets, even the default value might be too large. Therefore, explicitly set a
smaller value.
- Default: Number of available CPUs divided by 4.
- Data type: INTEGER
- synchronous
- Optional.
- Synchronous or asynchronous execution of the TWOSTEP stored procedure.
- In a synchronous execution, the TWOSTEP stored procedure waits until the execution is completed.
In an asynchronous execution, the TWOSTEP procedure returns as soon as it is submitted.
- To get the status and the submission ID of an asynchronously executed TWOSTEP stored procedure,
call the IDAX.APP_STATUS() stored procedure after you launch the TWOSTEP stored procedure.
- To cancel an asynchronously executed TWOSTEP stored procedure, call the IDAX.CANCEL_APP() stored
procedure while the TWOSTEP stored procedure is running.
- Data type: BOOLEAN
- Default: true
Returned information
The procedure does not return any SQL results.
Example
CALL IDAX.TWOSTEP('model=adult_mdl, randseed=12345, intable=adult_train, id=id');