Overview (QUICK CLUSTER command)
When the desired number of clusters is known, QUICK CLUSTER
groups cases efficiently into
clusters. It is not as flexible as CLUSTER
, but it uses considerably less processing time and memory, especially
when the number of cases is large.
Options
Algorithm Specifications. You can specify
the number of clusters to form with the CRITERIA
subcommand. You can also use CRITERIA
to control initial cluster selection and the criteria for iterating
the clustering algorithm. With the METHOD
subcommand, you can specify how to update cluster centers, and you
can request classification only when working with very large data
files.
Initial
Cluster Centers. By default, QUICK CLUSTER
chooses the initial cluster centers. Alternatively,
you can provide initial centers on the INITIAL
subcommand. You can also read initial cluster centers from IBM® SPSS® Statistics data files using the FILE
subcommand.
Optional Output. With the PRINT
subcommand, you can display the cluster
membership of each case and the distance of each case from its cluster
center. You can also display the distances between the final cluster
centers and a univariate analysis of variance between clusters for
each clustering variable.
Saving Results. You can write the final
cluster centers to a data file using the OUTFILE
subcommand. In addition, you can save the cluster
membership of each case and the distance from each case to its classification
cluster center as new variables in the active dataset using the SAVE
subcommand.
Basic Specification
The basic specification is a list of variables. By default, QUICK CLUSTER
produces two clusters. The
two cases that are farthest apart based on the values of the clustering
variables are selected as initial cluster centers and the rest of
the cases are assigned to the nearer center. The new cluster centers
are calculated as the means of all cases in each cluster, and if neither
the minimum change nor the maximum iteration criterion is met, all
cases are assigned to the new cluster centers again. When one of the
criteria is met, iteration stops, the final cluster centers are updated,
and the distance of each case is computed.
Subcommand Order
- The variable list must be specified first.
- Subcommands can be named in any order.
Operations
The procedure generally involves four steps:
- First, initial cluster centers are selected, either by choosing one case for each cluster requested or by using the specified values.
- Second, each case is assigned to the nearest cluster center, and the mean of each cluster is calculated to obtain the new cluster centers.
- Third, the maximum change between the new cluster centers and the initial cluster centers is computed. If the maximum change is not less than the minimum change value and the maximum iteration number is not reached, the second step is repeated and the cluster centers are updated. The process stops when either the minimum change or maximum iteration criterion is met. The resulting clustering centers are used as classification centers in the last step.
- In the last step, all cases are assigned to the nearest classification center. The final cluster centers are updated and the distance for each case is computed.
When the number of cases is large, directly clustering all cases may be impractical. As an alternative, you can cluster a sample of cases and then use the cluster solution for the sample to classify the entire group. This can be done in two phases:
- The first phase obtains a cluster solution for the sample.
This involves all four steps of the
QUICK CLUSTER
algorithm.OUTFILE
then saves the final cluster centers to a data file. - The second phase requires only one pass through the
data. First, the
FILE
subcommand specifies the file containing the final cluster centers from the first analysis. These final cluster centers are used as the initial cluster centers for the second analysis.CLASSIFY
is specified on theMETHOD
subcommand to skip the second and third steps of the clustering algorithm, and cases are classified using the initial cluster centers. When all cases are assigned, the cluster centers are updated and the distance of each case is computed. This phase can be repeated until final cluster centers are stable.