IBM Support

Clustering Binary Data (should be avoided)

Troubleshooting


Problem

I would like to perform a hierarchical cluster analysis, but my data are binary: each variable takes on only two possible values, e.g. "Yes" or "No". I have heard that this is not a good idea. Can I cluster these data?

Resolving The Problem

No, you should usually avoid clustering binary valued data using hierarchical clustering. The resulting clusters tend to be arbitrary, and are sensitive to the order that cases are present in the file. In contrast to hierarchical clustering, the SPSS TwoStep Cluster procedure, which is available in the Base module in SPSS 11.5 or later versions, uses a likelihood-based measure to model distances between categorical variables, including binary variables. Consider TwoStep Cluster (Analyze-Classify->TwoStep Cluster) for clustering of binary or other categorical variables.

To see why there can be problems in a hierarchical cluster analysis, for any pair of cases, count the number of disagreements. That is, suppose Alice answers Yes, Yes, Yes to three questions, while Bob answers No, Yes, Yes. They disagree on one question. Similarly if Cathy answers No, No, Yes, she disagrees with Alice on two questions, but she differs with Bob on only one. These counts are all the information which is left for the clustering algorithm to work on.

Suppose that there are three variables. Then there are eight possible profiles which may be observed (though some profiles may occur many times, while others may not be observed all).

Here are all eight possibilities:

0 0 0
0 0 1
0 1 1
0 1 0
1 1 0
1 0 0
1 0 1
1 1 1

At first they may appear to have been written in an unusual order, but observe that each pair of adjacent profiles disagrees on only one response, like Alice and Bob did above. Cluster these data, and look at the dendrogram: the resulting dendrogram will display the data in a different order, but with the same property. (Such an ordering, by the way, is known as a Gray Code.) Sort these data into any order you like, and cluster again. You will see that the clusters are quite arbitrary, though the ordering of the cases displayed in the dendrogram will almost certainly be a Gray Code.

Changing the clustering technique, distance matrix, and so on will not eliminate the problem.

The SPSS command syntax below illustrates this. The COMPUTE and SORT CASES commands together specify an arbitrary order; you could sort cases by a, b, or c as well.

================================================


* SPSS Command Syntax to illustrate problems with Binary data.

* Create the eight profiles possible for 3 yes-no questions.
DATA LIST LIST / a (F1.0) b (F1.0) c (F1.0) labl (A8).
BEGIN DATA
0 0 0 '000'
0 0 1 '001'
0 1 1 '011'
0 1 0 '010'
1 1 0 '110'
1 0 0 '100'
1 0 1 '101'
1 1 1 '111'
END DATA.


* Cluster binary data.
CLUSTER a b c
/METHOD BAVERAGE
/MEASURE= SEUCLID
/ID=labl
/PRINT SCHEDULE
/PLOT DENDROGRAM .

* Sort into a different order.
COMPUTE x = UNIFORM(1).
SORT CASES BY x .

* Cluster again.
CLUSTER a b c
/METHOD BAVERAGE
/MEASURE= SEUCLID
/ID=labl
/PRINT SCHEDULE
/PLOT DENDROGRAM .

* Repeat Sort, then cluster again as desired..

[{"Product":{"code":"SSLVMB","label":"SPSS Statistics"},"Business Unit":{"code":"BU001","label":"Analytics Private Cloud"},"Component":"Not Applicable","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"Not Applicable","Edition":""}]

Historical Number

19704

Document Information

Modified date:
16 June 2018

UID

swg21476716