Collection of statistics for K-means clustering

For some applications, statistical information might not be necessary. To save space and time, statistics are therefore not collected by default.

If you need to collect statistics, you can use an extra parameter. This parameter specifies the level of detail for the collection of statistics.

Regardless of the statistics settings, all information that is required to score the model is stored.

For the K-means algorithm, the following information is stored:

  • Distance function
  • Mean values for each continuous column for each cluster
  • Modal values for each categorical column for each cluster

If statistics=all[:N] or statistics=values:N, all statistical information is stored down to the value level of the individual columns. Discrete statistics are limited to N distinct values at the most. The default value of N is 100. The default statistics parameter is none.

If statistics=columns, all statistical information is stored down to the level of the individual columns, whereas value level statistics are omitted. <model name>_DISCRETE_STATISTICS and <model name>_NUMERIC_STATISTICS are not created.

If statistics=none, only statistical information that is necessary for scoring is stored.