IBM Data WH K-Means Build Options Tab

By setting the build options, you can customize the build of the model for your own purposes.

If you want to build a model with the default options, click Run.

Distance measure. This parameter defines the method of measure for the distance between data points. Greater distances indicate greater dissimilarities. Select one of the following options:

  • Euclidean. The Euclidean measure is the straight-line distance between two data points.
  • Normalized Euclidean. The Normalized Euclidean measure is similar to the Euclidean measure but it is normalized by the squared standard deviation. Unlike the Euclidean measure, the Normalized Euclidean measure is also scale-invariant.
  • Mahalanobis. The Mahalanobis measure is a generalized Euclidean measure that takes correlations of input data into account. Like the Normalized Euclidean measure, the Mahalanobis measure is scale-invariant.
  • Manhattan. The Manhattan measure is the distance between two data points that is calculated as the sum of the absolute differences between their coordinates.
  • Canberra. The Canberra measure is similar to the Manhattan measure but it is more sensitive to data points that are closer to the origin.
  • Maximum. The Maximum measure is the distance between two data points that is calculated as the greatest of their differences along any coordinate dimension.

Number of clusters. This parameter defines the number of clusters to be created.

Maximum number of iterations. The algorithm does several iterations of the same process. This parameter defines the number of iterations after which model training stops.

Statistics. This parameter defines how many statistics are included in the model. Select one of the following options:

  • All. All column-related statistics and all value-related statistics are included.
    Note: This parameter includes the maximum number of statistics and might therefore affect the performance of your system. If you do not want to view the model in graphical format, specify None.
  • Columns. Column-related statistics are included.
  • None. Only statistics that are required to score the model are included.

Replicate results. Select this check box if you want to set a random seed to replicate analyses. You can specify an integer, or you can create a pseudo-random integer by clicking Generate.