Distances Similarity Measures for Binary Data
The following similarity measures are available for binary data:
- Russel and Rao. This is a binary version of the inner (dot) product. Equal weight is given to matches and nonmatches. This is the default for binary similarity data.
- Simple matching. This is the ratio of matches to the total number of values. Equal weight is given to matches and nonmatches.
- Jaccard. This is an index in which joint absences are excluded from consideration. Equal weight is given to matches and nonmatches. Also known as the similarity ratio.
- Dice. This is an index in which joint absences are excluded from consideration, and matches are weighted double. Also known as the Czekanowski or Sorensen measure.
- Rogers and Tanimoto. This is an index in which double weight is given to nonmatches.
- Sokal and Sneath 1. This is an index in which double weight is given to matches.
- Sokal and Sneath 2. This is an index in which double weight is given to nonmatches, and joint absences are excluded from consideration.
- Sokal and Sneath 3. This is the ratio of matches to nonmatches. This index has a lower bound of 0 and is unbounded above. It is theoretically undefined when there are no nonmatches; however, Distances assigns an arbitrary value of 9999.999 when the value is undefined or is greater than this value.
- Kulczynski 1. This is the ratio of joint presences to all nonmatches. This index has a lower bound of 0 and is unbounded above. It is theoretically undefined when there are no nonmatches; however, Distances assigns an arbitrary value of 9999.999 when the value is undefined or is greater than this value.
- Kulczynski 2. This index is based on the conditional probability that the characteristic is present in one item, given that it is present in the other. The separate values for each item acting as predictor of the other are averaged to compute this value.
- Sokal and Sneath 4. This index is based on the conditional probability that the characteristic in one item matches the value in the other. The separate values for each item acting as predictor of the other are averaged to compute this value.
- Hamann. This index is the number of matches minus the number of nonmatches, divided by the total number of items. It ranges from -1 to 1.
- Lambda. This index is Goodman and Kruskal's lambda. Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions). Values range from 0 to 1.
- Anderberg's D. Similar to lambda, this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions). Values range from 0 to 1.
- Yule's Y. This index is a function of the cross-ratio for a 2 x 2 table, and is independent of the marginal totals. It has a range of -1 to 1. Also known as the coefficient of colligation.
- Yule's Q. This index is a special case of Goodman and Kruskal's gamma. It is a function of the cross-ratio and is independent of the marginal totals. It has a range of -1 to 1.
- Ochiai. This index is the binary form of the cosine similarity measure. It has a range of 0 to 1.
- Sokal and Sneath 5. This index is the squared geometric mean of conditional probabilities of positive and negative matches. It is independent of item coding. It has a range of 0 to 1.
- Phi 4-point correlation. This index is a binary analogue of the Pearson correlation coefficient. It has a range of -1 to 1.
- Dispersion. This index has a range of -1 to 1.
You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent. The procedure will ignore all other values.