Distances Dissimilarity Measures for Binary Data

The following dissimilarity measures are available for binary data:

  • Euclidean distance. Computed from a fourfold table as SQRT(b+c), where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other.
  • Squared Euclidean distance. Computed as the number of discordant cases. Its minimum value is 0, and it has no upper limit.
  • Size difference. An index of asymmetry. It ranges from 0 to 1.
  • Pattern difference. Dissimilarity measure for binary data that ranges from 0 to 1. Computed from a fourfold table as bc/(n**2), where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations.
  • Variance. Computed from a fourfold table as (b+c)/4n, where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations. It ranges from 0 to 1.
  • Shape. This distance measure has a range of 0 to 1, and it penalizes asymmetry of mismatches.
  • Lance and Williams. Computed from a fourfold table as (b+c)/(2a+b+c), where a represents the cell corresponding to cases present on both items, and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other. This measure has a range of 0 to 1. (Also known as the Bray-Curtis nonmetric coefficient.)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent. The procedure will ignore all other values.