Hierarchical Cluster Analysis Measures for Binary Data
The following dissimilarity measures are available for binary data:
- Euclidean distance. Computed from a fourfold table as SQRT(b+c), where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other.
- Squared Euclidean distance. Computed as the number of discordant cases. Its minimum value is 0, and it has no upper limit.
- Size difference. An index of asymmetry. It ranges from 0 to 1.
- Pattern difference. Dissimilarity measure for binary data that ranges from 0 to 1. Computed from a fourfold table as bc/(n**2), where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other, and n is the total number of observations.
- Variance. Computed from a fourfold table as (b+c)/4n, where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations. It ranges from 0 to 1.
- Dispersion. This similarity index has a range of −1 to 1.
- Shape. This distance measure has a range of 0 to 1, and it penalizes asymmetry of mismatches.
- Simple matching. This is the ratio of matches to the total number of values. Equal weight is given to matches and nonmatches.
- Phi 4-point correlation. This index is a binary analog of the Pearson correlation coefficient. It has a range of −1 to 1.
- Lambda. This index is Goodman and Kruskal's lambda. Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions). Values range from 0 to 1.
- Anderberg's D. Similar to lambda, this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions). Values range from 0 to 1.
- Dice. This is an index in which joint absences are excluded from consideration, and matches are weighted double. Also known as the Czekanowski or Sorensen measure.
- Hamann. This index is the number of matches minus the number of nonmatches, divided by the total number of items. It ranges from −1 to 1.
- Jaccard. This is an index in which joint absences are excluded from consideration. Equal weight is given to matches and nonmatches. Also known as the similarity ratio.
- Kulczynski 1. This is the ratio of joint presences to all nonmatches. This index has a lower bound of 0 and is unbounded above. It is theoretically undefined when there are no nonmatches; however, the software assigns an arbitrary value of 9999.999 when the value is undefined or is greater than this value.
- Kulczynski 2. This index is based on the conditional probability that the characteristic is present in one item, given that it is present in the other. The separate values for each item acting as a predictor of the other are averaged to compute this value.
- Lance and Williams. Computed from a fourfold table as (b+c)/(2a+b+c), where a represents the cell corresponding to cases present on both items, and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other. This measure has a range of 0 to 1. (Also known as the Bray-Curtis nonmetric coefficient.)
- Ochiai. This index is the binary form of the cosine similarity measure. It has a range of 0 to 1.
- Rogers and Tanimoto. This is an index in which double weight is given to nonmatches.
- Russel and Rao. This is a binary version of the inner (dot) product. Equal weight is given to matches and nonmatches. This is the default for binary similarity data.
- Sokal and Sneath 1. This is an index in which double weight is given to matches.
- Sokal and Sneath 2. This is an index in which double weight is given to nonmatches, and joint absences are excluded from consideration.
- Sokal and Sneath 3. This is the ratio of matches to nonmatches. This index has a lower bound of 0 and is unbounded above. It is theoretically undefined when there are no nonmatches; however, the software assigns an arbitrary value of 9999.999 when the value is undefined or is greater than this value.
- Sokal and Sneath 4. This index is based on the conditional probability that the characteristic in one item matches the value in the other. The separate values for each item acting as a predictor of the other are averaged to compute this value.
- Sokal and Sneath 5. This index is the squared geometric mean of conditional probabilities of positive and negative matches. It is independent of item coding. It has a range of 0 to 1.
- Yule's Y. This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals. It has a range of −1 to 1. Also known as the coefficient of colligation.
- Yule's Q. This index is a special case of Goodman and Kruskal's gamma. It is a function of the cross-ratio and is independent of the marginal totals. It has a range of −1 to 1.
You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent. The procedure will ignore all other values.