Measures for Binary Data (CLUSTER command)

Different binary measures emphasize different aspects of the relationship between sets of binary values. However, all the measures are specified in the same way. Each measure has two optional integer-valued parameters, p (present) and np (not present).

If both parameters are specified, CLUSTER uses the value of the first as an indicator that a characteristic is present and the value of the second as an indicator that a characteristic is absent. CLUSTER skips all other values.
If only the first parameter is specified, CLUSTER uses that value to indicate presence and all other values to indicate absence.
If no parameters are specified, CLUSTER assumes that 1 indicates presence and 0 indicates absence.

Using the indicators for presence and absence within each item (case or variable), CLUSTER constructs a 2 x 2 contingency table for each pair of items in turn. It uses this table to compute a proximity measure for the pair.

Table 1. 2 x 2 contingency table
	Item 2 characteristics Present	Item 2 characteristics Absent
Item 1 characteristics Present	a	b
Item 1 characteristics Absent	c	d

CLUSTER computes all binary measures from the values of a, b, c, and d. These values are tallied across variables (when the items are cases) or across cases (when the items are variables). For example, if the variables V, W, X, Y, Z have values 0, 1, 1, 0, 1 for case 1 and values 0, 1, 1, 0, 0 for case 2 (where 1 indicates presence and 0 indicates absence), the contingency table is as follows:

Table 2. 2 x 2 contingency table
	Case 2 characteristics Present	Case 2 characteristics Absent
Case 1 characteristics Present	2	1
Case 1 characteristics Absent	0	2

The contingency table indicates that both cases are present for two variables (W and X), both cases are absent for two variables (V and Y), and case 1 is present and case 2 is absent for one variable (Z). There are no variables for which case 1 is absent and case 2 is present.

The available binary measures include matching coefficients, conditional probabilities, predictability measures, and others.

Matching Coefficients. The table below shows a classification scheme for matching coefficients. In this scheme, matches are joint presences (value a in the contingency table) or joint absences (value d). Nonmatches are equal in number to value b plus value c. Matches and non-matches may or may not be weighted equally. The three coefficients JACCARD, DICE, and SS2 are related monotonically, as are SM, SS1, and RT. All coefficients in the table are similarity measures, and all except two (K1 and SS3) range from 0 to 1. K1 and SS3 have a minimum value of 0 and no upper limit.

Table 3. Binary matching coefficients, all matches included in denominator
	Joint absences excluded from numerator	Joint absences included in numerator
Equal weight for matches and non-matches	RR	SM
Double weight for matches		SSL
Double weight for non-matches		RT

Table 4. Binary matching coefficients, joint absences excluded from denominator
	Joint absences excluded from numerator	Joint absences included in numerator
Equal weight for matches and non-matches	JACCARD
Double weight for matches	DICE
Double weight for non-matches	SS2

Table 5. Binary matching coefficients, all matches excluded from denominator
	Joint absences excluded from numerator	Joint absences included in numerator
Equal weight for matches and non-matches	K1	SS3

RR[(p[,np])]. Russell and Rao similarity measure. This is the binary dot product.

SM[(p[,np])]. Simple matching similarity measure. This is the ratio of the number of matches to the total number of characteristics.

JACCARD[(p[,np])]. Jaccard similarity measure. This is also known as the similarity ratio.

DICE[(p[,np])]. Dice (or Czekanowski or Sorenson) similarity measure.

SS1[(p[,np])]. Sokal and Sneath similarity measure 1.

RT[(p[,np])]. Rogers and Tanimoto similarity measure.

SS2[(p[,np])]. Sokal and Sneath similarity measure 2.

K1[(p[,np])]. Kulczynski similarity measure 1. This measure has a minimum value of 0 and no upper limit. It is undefined when there are no non-matches (b=0 and c=0).

SS3[(p[,np])]. Sokal and Sneath similarity measure 3. This measure has a minimum value of 0 and no upper limit. It is undefined when there are no non-matches (b=0 and c=0).

Conditional Probabilities. The following binary measures yield values that can be interpreted in terms of conditional probability. All three are similarity measures.

K2[(p[,np])]. Kulczynski similarity measure 2. This yields the average conditional probability that a characteristic is present in one item given that the characteristic is present in the other item. The measure is an average over both items acting as predictors. It has a range of 0 to 1.

SS4[(p[,np])]. Sokal and Sneath similarity measure 4. This yields the conditional probability that a characteristic of one item is in the same state (presence or absence) as the characteristic of the other item. The measure is an average over both items acting as predictors. It has a range of 0 to 1.

HAMANN[(p[,np])]. Hamann similarity measure. This measure gives the probability that a characteristic has the same state in both items (present in both or absent from both) minus the probability that a characteristic has different states in the two items (present in one and absent from the other). HAMANN has a range of −1 to +1 and is monotonically related to SM, SS1, and RT.

Predictability Measures. The following four binary measures assess the association between items as the predictability of one given the other. All four measures yield similarities.

LAMBDA[(p[,np])]. Goodman and Kruskal’s lambda (similarity). This coefficient assesses the predictability of the state of a characteristic on one item (present or absent) given the state on the other item. Specifically, LAMBDA measures the proportional reduction in error using one item to predict the other when the directions of prediction are of equal importance. LAMBDA has a range of 0 to 1.

D[(p[,np])]. Anderberg’s D (similarity). This coefficient assesses the predictability of the state of a characteristic on one item (present or absent) given the state on the other. D measures the actual reduction in the error probability when one item is used to predict the other. The range of D is 0 to 1.

Y[(p[,np])]. Yule’s Y coefficient of colligation (similarity). This is a function of the cross-ratio for a 2 x 2 table. It has range of -1 to +1.

Q[(p[,np])]. Yule’s Q (similarity). This is the 2 x 2 version of Goodman and Kruskal’s ordinal measure gamma. Like Yule’s Y, Q is a function of the cross-ratio for a 2 x 2 table and has a range of -1 to +1.

Other Binary Measures. The remaining binary measures available in CLUSTER are either binary equivalents of association measures for continuous variables or measures of special properties of the relationship between items.

OCHIAI[(p[,np])]. Ochiai similarity measure. This is the binary form of the cosine. It has a range of 0 to 1.

SS5[(p[,np])]. Sokal and Sneath similarity measure 5. The range is 0 to 1.

PHI[(p[,np])]. Fourfold point correlation (similarity). This is the binary form of the Pearson product-moment correlation coefficient.

BEUCLID[(p[,np])]. Binary Euclidean distance. This is a distance measure. Its minimum value is 0, and it has no upper limit.

BSEUCLID[(p[,np])]. Binary squared Euclidean distance. This is a distance measure. Its minimum value is 0, and it has no upper limit.

SIZE[(p[,np])]. Size difference. This is a dissimilarity measure with a minimum value of 0 and no upper limit.

PATTERN[(p[,np])]. Pattern difference. This is a dissimilarity measure. The range is 0 to 1.

BSHAPE[(p[,np])]. Binary shape difference. This dissimilarity measure has no upper or lower limit.

DISPER[(p[,np])]. Dispersion similarity measure. The range is −1 to +1.

VARIANCE[(p[,np])]. Variance dissimilarity measure. This measure has a minimum value of 0 and no upper limit.

BLWMN[(p[,np])]. Binary Lance-and-Williams nonmetric dissimilarity measure. This measure is also known as the Bray-Curtis nonmetric coefficient. The range is 0 to 1.