Measures for Binary Data (PROXIMITIES command)

Different binary measures emphasize different aspects of the relationship between sets of binary values. However, all measures are specified in the same way. Each measure has two optional integer-valued parameters, p (present) and np (not present).

  • If both parameters are specified, PROXIMITIES uses the value of the first parameter as an indicator that a characteristic is present, and PROXIMITIES uses the value of the second parameter as an indicator that a characteristic is absent. PROXIMITIES skips all other values.
  • If only the first parameter is specified, PROXIMITIES uses that value to indicate presence and uses all other values to indicate absence.
  • If no parameters are specified, PROXIMITIES assumes that 1 indicates presence and 0 indicates absence.

Using the indicators for presence and absence within each item (case or variable), PROXIMITIES constructs a 2×2 contingency table for each pair of items and uses this table to compute a proximity measure for the pair.

Table 1. 2 x 2 contingency table
  Item 2 characteristics Present Item 2 characteristics Absent
Item 1 characteristics Present a b
Item 1 characteristics Absent c d

PROXIMITIES computes all binary measures from the values of a, b, c, and d. These values are tallied across variables (when the items are cases) or cases (when the items are variables). For example, if variables V, W, X, Y, Z have values 0, 1, 1, 0, 1 for case 1 and have values 0, 1, 1, 0, 0 for case 2 (where 1 indicates presence and 0 indicates absence), the contingency table is as follows:

Case 1 characteristics Present. 2

Case 2 characteristics Absent. 0

The contingency table indicates that both cases are present for two variables (W and X), both cases are absent for two variables (V and Y), and case 1 is present and case 2 is absent for one variable (Z). There are no variables for which case 1 is absent and case 2 is present.

The available binary measures include matching coefficients, conditional probabilities, predictability measures, and other measures.

Matching Coefficients. The following table shows a classification scheme for PROXIMITIES matching coefficients. In this scheme, matches are joint presences (value a in the contingency table) or joint absences (value d). Nonmatches are equal in number to value b plus value c. Matches and non-matches may be weighted equally or not. The three coefficients JACCARD, DICE, and SS2 are related monotonically, as are SM, SS1, and RT. All coefficients in the table are similarity measures, and all coefficients exceptK1 and SS3 range from 0 to 1. K1 and SS3 have a minimum value of 0 and have no upper limit.

Table 2. Binary matching coefficients in PROXIMITIES
  Joint absences excluded from numerator Joint absences included in numerator
All matches included in denominator, equal weight for matches and non-matches RR SM
All matches included in denominator, double weight for matches   SS1
All matches included in denominator, double weight for non-matches   RT
Joint absences excluded from denominator, equal weight for matches and non-matches JACCARD  
Joint absences excluded from denominator, double weight for matches DICE  
Joint absences excluded from denominator, double weight for non-matches SS2  
All matches excluded from denominator, equal weight for matches and non-matches K1 SS3

RR[(p[,np])]. Russell and Rao similarity measure. This measure is the binary dot product.

SM[(p[,np])]. Simple matching similarity measure. This measure is the ratio of the number of matches to the total number of characteristics.

JACCARD[(p[,np])]. Jaccard similarity measure. This measure is also known as the similarity ratio.

DICE[(p[,np])]. Dice (or Czekanowski or Sorenson) similarity measure.

SS1[(p[,np])]. Sokal and Sneath similarity measure 1.

RT[(p[,np])]. Rogers and Tanimoto similarity measure.

SS2[(p[,np])]. Sokal and Sneath similarity measure 2.

K1[(p[,np])]. Kulczynski similarity measure 1. This measure has a minimum value of 0 and no upper limit. The measure is undefined when there are no non-matches (b=0 and c=0).

SS3[(p[,np])]. Sokal and Sneath similarity measure 3. This measure has a minimum value of 0 and no upper limit. The measure is undefined when there are no non-matches (b=0 and c=0).

Conditional Probabilities. The following binary measures yield values that can be interpreted in terms of conditional probability. All three measures are similarity measures.

K2[(p[,np])]. Kulczynski similarity measure 2. This measure yields the average conditional probability that a characteristic is present in one item given that the characteristic is present in the other item. The measure is an average over both items that are acting as predictors. The measure has a range of 0 to 1.

SS4[(p[,np])]. Sokal and Sneath similarity measure 4. This measure yields the conditional probability that a characteristic of one item is in the same state (presence or absence) as the characteristic of the other item. The measure is an average over both items that are acting as predictors. The measure has a range of 0 to 1.

HAMANN[(p[,np])]. Hamann similarity measure. This measure gives the probability that a characteristic has the same state in both items (present in both or absent from both) minus the probability that a characteristic has different states in the two items (present in one and absent from the other). HAMANN has a range of −1 to +1 and is monotonically related to SM, SS1, and RT.

Predictability Measures. The following four binary measures assess the association between items as the predictability of one item given the other item. All four measures yield similarities.

LAMBDA[(p[,np])]. Goodman and Kruskal’s lambda (similarity). This coefficient assesses the predictability of the state of a characteristic on one item (present or absent) given the state on the other item. Specifically, LAMBDA measures the proportional reduction in error, using one item to predict the other item when the directions of prediction are of equal importance. LAMBDA has a range of 0 to 1.

D[(p[,np])]. Anderberg’s D (similarity). This coefficient assesses the predictability of the state of a characteristic on one item (present or absent) given the state on the other item. D measures the actual reduction in the error probability when one item is used to predict the other item. The range of D is 0 to 1.

Y[(p[,np])]. Yule’s Y coefficient of colligation (similarity). This measure is a function of the cross ratio for a 2×2 table and has a range of −1 to +1.

Q[(p[,np])]. Yule’s Q (similarity). This measure is the 2×2 version of Goodman and Kruskal’s ordinal measure gamma. Like Yule’s Y, Q is a function of the cross ratio for a 2×2 table and has a range of −1 to +1.

Other Binary Measures. The remaining binary measures that are available in PROXIMITIES are either binary equivalents of association measures for continuous variables or measures of special properties of the relationship between items.

OCHIAI[(p[,np])]. Ochiai similarity measure. This measure is the binary form of the cosine and has a range of 0 to 1.

SS5[(p[,np])]. Sokal and Sneath similarity measure 5. The range is 0 to 1.

PHI[(p[,np])]. Fourfold point correlation (similarity). This measure is the binary form of the Pearson product-moment correlation coefficient.

BEUCLID[(p[,np])]. Binary Euclidean distance. This measure is a distance measure. Its minimum value is 0, and it has no upper limit.

BSEUCLID[(p[,np])]. Binary squared Euclidean distance. This measure is a distance measure. Its minimum value is 0, and it has no upper limit.

SIZE[(p[,np])]. Size difference. This measure is a dissimilarity measure with a minimum value of 0 and no upper limit.

PATTERN[(p[,np])]. Pattern difference. This measure is a dissimilarity measure. The range is 0 to 1.

BSHAPE[(p[,np])]. Binary shape difference. This dissimilarity measure has no upper limit or lower limit.

DISPER[(p[,np])]. Dispersion similarity measure. The range is −1 to +1.

VARIANCE[(p[,np])]. Variance dissimilarity measure. This measure has a minimum value of 0 and no upper limit.

BLWMN[(p[,np])]. Binary Lance-and-Williams nonmetric dissimilarity measure. This measure is also known as the Bray-Curtis nonmetric coefficient. The range is 0 to 1.

Example

PROXIMITIES A B C
  /MEASURE=RR(1,2).
  • MEASURE computes Russell and Rao coefficients from data in which 1 indicates the presence of a characteristic and 2 indicates the absence. Other values are ignored.

Example

PROXIMITIES A B C
  /MEASURE=SM(2).
  • MEASURE computes simple matching coefficients from data in which 2 indicates presence and all other values indicate absence.