Examples for creating K-means clustering models
This example creates a clustering model for the customer churn data set.
The SAMPLES.CUSTOMER_CHURN table contains the data that is used in this example. For this example, a table that contains training data and a table that contains data for prediction, for example, applying the trained model, are needed.
The following call shows how to create these tables from the SAMPLES.CUSTOMER_CHURN table:
CALL IDAX.SPLIT_DATA('intable=samples.customer_churn, traintable=customer_churn_train, testtable=customer_churn_predict, id=cust_id, fraction=0.35');
The following call shows how to create a K-means clustering model:
CALL IDAX.KMEANS('intable=customer_churn_train, id=cust_id, k=5, maxiter=3, distance=euclidean, model=ci_km5c, outtable=ci_km5m_out');
This call uses the Euclidean function as a distance measure. It creates five clusters within three iterations at the most. It also creates four model tables and one output table that is specified through the outtable argument. This argument contains information about cluster membership for each instance from the intable training data set with the distance from the cluster center.
The following meta tables are specified through the model argument:
<model name>_MODEL
<model name>_CLUSTERS
<model name>_COLUMNS
<model name>_COLUMN_STATISTICS
The <model name>_MODEL contains the following information that applies to the entire clustering model.
- The <model name>_CLUSTERS contains all clusters in the model. It also contains information about clusters, for example, the cluster centers, the cluster size, and the sum of squared distances between cluster members and the center.
- The <model name >_COLUMNS contains all columns that are used by K-means clustering and scoring.
- The <model name>_COLUMN_STATISTICS contains column statistics information.
To analyze the created model, you can use the PRINT_KMEANS procedure as follows:
CALL IDAX.PRINT_MODEL('model=ci_km5c, mode=clusters');
Sample output is as follows:
Result set 1
--------------
CLUSTERID NAME SIZE RELSIZE WITHINSS DESCRIPTION
----------- ---------------- -------------------- ------------------------ ------------------------ ---------------
1 1 37 0.211428571428571 2.477881267144491E8 NULL
2 2 28 0.16 3482930.29850247 NULL
3 3 36 0.205714285714286 4.7292148007801384E7 NULL
4 4 27 0.154285714285714 2193463.04765905 NULL
5 5 47 0.268571428571429 8910472.36068028 NULL
5 record(s) selected.
Return Status = 0
To analyze the created model further, you can analyze the ci_km5c cluster table as follows:
SELECT * FROM ci_km5c_clusters ORDER BY clusterid;
The table contains one row for each cluster. It also contains one column for each attribute that is used for clustering with the mean attribute value for continuous attributes and the most frequent value for discrete attributes.
The following extra diagnostic columns are also included:
- clusterid
- The identification number of the cluster.
- withinss
- The sum of squared distances between the training instances that are assigned to the cluster and the cluster center.
To improve the readability of the cluster description, you can select only the diagnostic columns and skip the columns that represent the cluster centers as follows:
SELECT clusterid, size, withinss FROM ci_km5c_clusters ORDER BY clusterid;
The ci_km5m_out members table contains one row for each training instance and the following columns:
- id
- The distance identifier.
- cluster_id
- The identifier of the cluster to which the instance is assigned.
- distance
- The distance between the instance and the cluster center according to the distance measure of the clustering.
You can use this data to calculate extra indicators. The calculation is based on the assignment of training instances to clusters.
For example, to determine the distribution of the censor attribute, which indicates customer churns by means of 0 and 1, within the obtained clusters, you can use the following query:
SELECT cluster_id, censor, count(*)
FROM ci_km5m_out O, customer_churn_train T
WHERE O.id=T.cust_id
GROUP BY cluster_id, censor
ORDER BY cluster_id, censor;
You might specify a smaller value for k because the cluster sizes that are obtained by k=5 are unbalanced. Unbalanced means that each large cluster dominates the other clusters.
The following example shows how to better balance the clusters by specifying k=2:
CALL IDAX.KMEANS('intable=customer_churn_train, id=cust_id, k=2, maxiter=3, distance=euclidean, model=ci_km2c, outtable=ci_km2m_out');
SELECT clusterid, size, withinss FROM ci_km2c_clusters ORDER BY clusterid;
SELECT cluster_id, censor, count(*) FROM ci_km2m_out O, customer_churn_train T WHERE O.id=T.cust_id GROUP BY cluster_id, censor ORDER BY cluster_id, censor;
After you adjust the k value, the clusters are balanced.
You can apply the created clustering model that is based on the training set to new data as shown in the following example. The call generates cluster membership assignments for the customer churn predict set by using the clustering model that is created for k=5.
For scoring, the K-means clustering options and the statistics of columns and clusters all of which are used to build the K-means model are saved in meta tables. This information is used to score and predict new clusters.
The following list shows the meta tables:
- <model name>_CLUSTERS
- <model name>_COLUMNS
- <model name>_COLUMN_STATISTICS
CALL IDAX.PREDICT_KMEANS('intable=customer_churn_predict, id=cust_id, outtable=ci_km5m_predict, model=ci_km5c');
The output table contains one row for each instance from the specified data set and the following columns:
ID CLUSTER_ID DISTANCE
----------- ----------- ------------------------
You can use the output table to determine the distribution of the censor attribute within the obtained clusters on the predict set as follows:
SELECT cluster_id, censor, count(*)
FROM ci_km5m_predict O, customer_churn_predict T
WHERE O.id=T.id
GROUP BY cluster_id, censor
ORDER BY cluster_id, censor;