Measures for Binary Data (CLUSTER command)
Different binary measures emphasize different aspects of the relationship between sets of binary values. However, all the measures are specified in the same way. Each measure has two optional integer-valued parameters, p (present) and np (not present).
- If both parameters are specified,
CLUSTER
uses the value of the first as an indicator that a characteristic is present and the value of the second as an indicator that a characteristic is absent.CLUSTER
skips all other values. - If only the first parameter is specified,
CLUSTER
uses that value to indicate presence and all other values to indicate absence. - If no parameters are specified,
CLUSTER
assumes that 1 indicates presence and 0 indicates absence.
Using the indicators for presence and absence within each item
(case or variable), CLUSTER
constructs a 2 x 2 contingency
table for each pair of items in turn. It uses this table to compute
a proximity measure for the pair.
Item 2 characteristics Present | Item 2 characteristics Absent | |
---|---|---|
Item 1 characteristics Present | a |
b |
Item 1 characteristics Absent | c |
d |
CLUSTER
computes all binary measures from the
values of a, b, c, and d. These values
are tallied across variables (when the items are cases) or across
cases (when the items are variables). For example, if the variables V, W, X, Y, Z have
values 0, 1, 1, 0, 1 for case 1 and values 0, 1, 1, 0, 0 for case
2 (where 1 indicates presence and 0 indicates absence), the contingency
table is as follows:
Case 2 characteristics Present | Case 2 characteristics Absent | |
---|---|---|
Case 1 characteristics Present | 2 |
1 |
Case 1 characteristics Absent | 0 |
2 |
The contingency table indicates that both cases are present for two variables (W and X), both cases are absent for two variables (V and Y), and case 1 is present and case 2 is absent for one variable (Z). There are no variables for which case 1 is absent and case 2 is present.
The available binary measures include matching coefficients, conditional probabilities, predictability measures, and others.
Matching Coefficients. The table below shows a classification
scheme for matching coefficients. In this scheme, matches are
joint presences (value a in the contingency table) or joint
absences (value d). Nonmatches are equal in number to
value b plus value c. Matches and non-matches may or
may not be weighted equally. The three coefficients JACCARD
, DICE
,
and SS2
are related monotonically, as are SM
, SS1
,
and RT
. All coefficients in the table are similarity
measures, and all except two (K1
and SS3
)
range from 0 to 1. K1
and SS3
have
a minimum value of 0 and no upper limit.
Joint absences excluded from numerator | Joint absences included in numerator | |
---|---|---|
Equal weight for matches and non-matches | RR | SM |
Double weight for matches | SSL | |
Double weight for non-matches | RT |
Joint absences excluded from numerator | Joint absences included in numerator | |
---|---|---|
Equal weight for matches and non-matches | JACCARD | |
Double weight for matches | DICE | |
Double weight for non-matches | SS2 |
Joint absences excluded from numerator | Joint absences included in numerator | |
---|---|---|
Equal weight for matches and non-matches | K1 | SS3 |
RR[(p[,np])]. Russell and Rao similarity measure. This is the binary dot product.
SM[(p[,np])]. Simple matching similarity measure. This is the ratio of the number of matches to the total number of characteristics.
JACCARD[(p[,np])]. Jaccard similarity measure. This is also known as the similarity ratio.
DICE[(p[,np])]. Dice (or Czekanowski or Sorenson) similarity measure.
SS1[(p[,np])]. Sokal and Sneath similarity measure 1.
RT[(p[,np])]. Rogers and Tanimoto similarity measure.
SS2[(p[,np])]. Sokal and Sneath similarity measure 2.
K1[(p[,np])]. Kulczynski similarity measure 1. This measure has a minimum value of 0 and no upper limit. It is undefined when there are no non-matches (b=0 and c=0).
SS3[(p[,np])]. Sokal and Sneath similarity measure 3. This measure has a minimum value of 0 and no upper limit. It is undefined when there are no non-matches (b=0 and c=0).
Conditional Probabilities. The following binary measures yield values that can be interpreted in terms of conditional probability. All three are similarity measures.
K2[(p[,np])]. Kulczynski similarity measure 2. This yields the average conditional probability that a characteristic is present in one item given that the characteristic is present in the other item. The measure is an average over both items acting as predictors. It has a range of 0 to 1.
SS4[(p[,np])]. Sokal and Sneath similarity measure 4. This yields the conditional probability that a characteristic of one item is in the same state (presence or absence) as the characteristic of the other item. The measure is an average over both items acting as predictors. It has a range of 0 to 1.
HAMANN[(p[,np])]. Hamann similarity measure. This measure
gives the probability that a characteristic has the same state in
both items (present in both or absent from both) minus the probability
that a characteristic has different states in the two items (present
in one and absent from the other). HAMANN
has a
range of −1 to +1 and is monotonically related to SM
, SS1
,
and RT
.
Predictability Measures. The following four binary measures assess the association between items as the predictability of one given the other. All four measures yield similarities.
LAMBDA[(p[,np])]. Goodman and Kruskal’s lambda (similarity).
This coefficient assesses the predictability of the state of a characteristic
on one item (present or absent) given the state on the other item.
Specifically, LAMBDA
measures the proportional reduction
in error using one item to predict the other when the directions of
prediction are of equal importance. LAMBDA
has a
range of 0 to 1.
D[(p[,np])]. Anderberg’s D (similarity). This coefficient
assesses the predictability of the state of a characteristic on one
item (present or absent) given the state on the other. D
measures
the actual reduction in the error probability when one item is used
to predict the other. The range of D
is 0 to 1.
Y[(p[,np])]. Yule’s Y coefficient of colligation (similarity). This is a function of the cross-ratio for a 2 x 2 table. It has range of -1 to +1.
Q[(p[,np])]. Yule’s Q (similarity). This is the 2 x 2 version of Goodman and Kruskal’s ordinal measure gamma. Like Yule’s Y, Q is a function of the cross-ratio for a 2 x 2 table and has a range of -1 to +1.
Other Binary Measures. The remaining binary measures available
in CLUSTER
are either binary equivalents of association
measures for continuous variables or measures of special properties
of the relationship between items.
OCHIAI[(p[,np])]. Ochiai similarity measure. This is the binary form of the cosine. It has a range of 0 to 1.
SS5[(p[,np])]. Sokal and Sneath similarity measure 5. The range is 0 to 1.
PHI[(p[,np])]. Fourfold point correlation (similarity). This is the binary form of the Pearson product-moment correlation coefficient.
BEUCLID[(p[,np])]. Binary Euclidean distance. This is a distance measure. Its minimum value is 0, and it has no upper limit.
BSEUCLID[(p[,np])]. Binary squared Euclidean distance. This is a distance measure. Its minimum value is 0, and it has no upper limit.
SIZE[(p[,np])]. Size difference. This is a dissimilarity measure with a minimum value of 0 and no upper limit.
PATTERN[(p[,np])]. Pattern difference. This is a dissimilarity measure. The range is 0 to 1.
BSHAPE[(p[,np])]. Binary shape difference. This dissimilarity measure has no upper or lower limit.
DISPER[(p[,np])]. Dispersion similarity measure. The range is −1 to +1.
VARIANCE[(p[,np])]. Variance dissimilarity measure. This measure has a minimum value of 0 and no upper limit.
BLWMN[(p[,np])]. Binary Lance-and-Williams nonmetric dissimilarity measure. This measure is also known as the Bray-Curtis nonmetric coefficient. The range is 0 to 1.