# K-Means Cluster Analysis

This procedure attempts to identify relatively homogeneous groups
of cases based on selected characteristics, using an algorithm that
can handle large numbers of cases. However, the algorithm requires
you to specify the number of clusters. You can specify initial cluster
centers if you know this information. You can select one of two methods
for classifying cases, either updating cluster centers iteratively
or classifying only. You can save cluster membership, distance information,
and final cluster centers. Optionally, you can specify a variable
whose values are used to label casewise output. You can also request
analysis of variance *F* statistics. While these statistics are
opportunistic (the procedure tries to form groups that do differ),
the relative size of the statistics provides information about each
variable's contribution to the separation of the groups.

**Example.** What are some identifiable groups of television
shows that attract similar audiences within each group? With *k*-means
cluster analysis, you could cluster television shows (cases) into *k* homogeneous
groups based on viewer characteristics. This process can be used to
identify segments for marketing. Or you can cluster cities (cases)
into homogeneous groups so that comparable cities can be selected
to test various marketing strategies.

**Statistics.** Complete solution: initial cluster centers,
ANOVA table. Each case: cluster information, distance from cluster
center.

K-Means Cluster Analysis Data Considerations

**Data.** Variables should be quantitative at the interval
or ratio level. If your variables are binary or counts, use the Hierarchical
Cluster Analysis procedure.

**Case and initial cluster center order.** The default algorithm
for choosing initial cluster centers is not invariant to case ordering.
The Use running means option in the Iterate
dialog box makes the resulting solution potentially dependent on case
order, regardless of how initial cluster centers are chosen. If you
are using either of these methods, you may want to obtain several
different solutions with cases sorted in different random orders to
verify the stability of a given solution. Specifying initial cluster
centers and not using the Use running means option
will avoid issues related to case order. However, ordering of the
initial cluster centers may affect the solution if there are tied
distances from cases to cluster centers. To assess the stability of
a given solution, you can compare results from analyses with different
permutations of the initial center values.

**Assumptions.** Distances are computed using simple Euclidean
distance. If you want to use another distance or similarity measure,
use the Hierarchical Cluster Analysis procedure. Scaling of variables
is an important consideration. If your variables are measured on different
scales (for example, one variable is expressed in dollars and another
variable is expressed in years), your results may be misleading. In
such cases, you should consider standardizing your variables before
you perform the *k*-means cluster analysis (this task can be
done in the Descriptives procedure). The procedure assumes that you
have selected the appropriate number of clusters and that you have
included all relevant variables. If you have chosen an inappropriate
number of clusters or omitted important variables, your results may
be misleading.

To Obtain a K-Means Cluster Analysis

This feature requires the Statistics Base option.

- From the menus choose:
- Select the variables to be used in the cluster analysis.
- Specify the number of clusters. (The number of clusters must be at least 2 and must not be greater than the number of cases in the data file.)
- Select either Iterate and classify or Classify only.
- Optionally, select an identification variable to label cases.

This procedure pastes QUICK CLUSTER command syntax.