Overview (CLUSTER command)
CLUSTER
produces
hierarchical clusters of items based on distance measures of dissimilarity
or similarity. The items being clustered are usually cases from the
active dataset, and the distance measures are computed from their
values for one or more variables. You can also cluster variables if
you read in a matrix measuring distances between variables. Cluster
analysis is discussed in Anderberg (1973).
Options
Cluster Measures and Methods. You can
specify one of 37 similarity or distance measures on the MEASURE
subcommand and any of the seven
methods on the METHOD
subcommand.
New Variables. You can save cluster membership for specified solutions as new variables
in the active dataset using the SAVE
subcommand.
Display and Plots. You can display cluster membership,
the distance or similarity matrix used to cluster variables or cases,
and the agglomeration schedule for the cluster solution with the PRINT
subcommand. You can request either
a horizontal or vertical icicle plot or a dendrogram of the cluster
solution and control the cluster levels displayed in the icicle plot
with the PLOT
subcommand. You
can also specify a variable to be used as a case identifier in the
display on the ID
subcommand.
Matrix Input
and Output. You can write out the distance matrix and
use it in subsequent CLUSTER
, PROXIMITIES
, or ALSCAL
analyses or read in matrices produced by other CLUSTER
or PROXIMITIES
procedures using the MATRIX
subcommand.
Basic Specification
The basic
specification is a variable list. CLUSTER
assumes that the items being clustered are cases and uses the squared
Euclidean distances between cases on the variables in the analysis
as the measure of distance.
Subcommand Order
- The variable list must be specified first.
- The remaining subcommands can be specified in any order.
Syntax Rules
- The variable list and subcommands can each be specified once.
- More than one clustering method can be specified on the
METHOD
subcommand.
Operations
The CLUSTER
procedure involves four steps:
- First,
CLUSTER
obtains distance measures of similarities between or distances separating initial clusters (individual cases or individual variables if the input is a matrix measuring distances between variables). - Second, it combines the two nearest clusters to form a new cluster.
- Third, it recomputes similarities or distances of existing clusters to the new cluster.
- It then returns to the second step until all items are combined in one cluster.
This process yields a hierarchy of cluster solutions, ranging from one overall cluster to as many clusters as there are items being clustered. Clusters at a higher level can contain several lower-level clusters. Within each level, the clusters are disjoint (each item belongs to only one cluster).
-
CLUSTER
identifies clusters in solutions by sequential integers (1, 2, 3, and so on).
Limitations
-
CLUSTER
stores cases and a lower-triangular matrix of proximities in memory. Storage requirements increase rapidly with the number of cases. You should be able to cluster 100 cases using a small number of variables in an 80K workspace. -
CLUSTER
does not honor weights.