IBM Support

Distance functions used with k-means and divisive clustering algorithms

Question & Answer


Question

What distance functions are available in the clustering modules and how can I use them?

Answer

IBM Netezza Analytics contains two clustering algorithms: k-means and divisive clustering.
Both algorithms have two main methods:
1. Build a model (stored procedures: KMEANS and DIVCLUSTER)
2. Apply the model to new data (stored procedures: PREDICT_KMEANS and PREDICT_DIVCLUSTER)
When a clustering stored procedure is called, the user can specify the distance function which is used by the clustering algorithm. For continuous attributes, four distance functions are available:
1. euclidean (the default distance function)
2. manhattan
3. maximum
4. canberra
For nominal attributes, the hamming distance is used.

The following examples show how to specify the manhattan distance function instead of the euclidean distance function for the k-means algorithm and divcluster algorithm:

call nza..KMEANS('model = adult_mdl, intable=nza..adult, outtable=adult_out, id=id, target=income, 
distance= manhattan, k=3');
OR call nza..DIVCLUSTER('model=adult_mdl, intable=nza..adult, outtable=adult_out, id=id, target=income,distance=manhattan, maxdepth=3');

where:

  • model = adult_mdl - defines the name of the table where model will be stored
  • intable=nza..adult - defines the name of the table containing the input dataset
  • outtable=adult_out -  defines the name of the table where cluster assignment will be stored
  • id=id -  defines the name of the column containing a unique instance identifier in the input table
  • income - defines the name of the target attribute (It will be omitted by the clustering algorithm.)
  • distance=manhattan -  defines the name of the distance function which the clustering algorithm uses
  • k=3 - defines the number of centers in the k-means algorithm
  • maxdepth=3 defines the cluster's maximum number of tree levels in the divisive clustering algorithm
[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"IBM Netezza Analytics","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Historical Number

NZ153287

Document Information

More support for:
IBM PureData System

Software version:
1.0.0

Document number:
460725

Modified date:
17 October 2019

UID

swg21568259