Measures for Binary Data (PROXIMITIES command)
Different binary measures emphasize different aspects of the relationship between sets of binary values. However, all measures are specified in the same way. Each measure has two optional integer-valued parameters, p (present) and np (not present).
- If both parameters are specified,
PROXIMITIES
uses the value of the first parameter as an indicator that a characteristic is present, andPROXIMITIES
uses the value of the second parameter as an indicator that a characteristic is absent.PROXIMITIES
skips all other values. - If only the first parameter is specified,
PROXIMITIES
uses that value to indicate presence and uses all other values to indicate absence. - If no parameters are specified,
PROXIMITIES
assumes that 1 indicates presence and 0 indicates absence.
Using the indicators for presence and absence
within each item (case or variable), PROXIMITIES
constructs a 2×2 contingency table for each pair of items and
uses this table to compute a proximity measure for the pair.
Item 2 characteristics Present | Item 2 characteristics Absent | |
---|---|---|
Item 1 characteristics Present | a | b |
Item 1 characteristics Absent | c | d |
PROXIMITIES
computes all binary measures from the values of a, b, c, and d. These values are tallied across variables (when the
items are cases) or cases (when the items are variables). For example,
if variables V, W, X, Y, Z have
values 0, 1, 1, 0, 1 for case 1 and have values 0, 1, 1, 0, 0 for
case 2 (where 1 indicates presence and 0 indicates absence), the contingency
table is as follows:
Case 1 characteristics Present. 2
Case 2 characteristics Absent. 0
The contingency table indicates that both cases are present for two variables (W and X), both cases are absent for two variables (V and Y), and case 1 is present and case 2 is absent for one variable (Z). There are no variables for which case 1 is absent and case 2 is present.
The available binary measures include matching coefficients, conditional probabilities, predictability measures, and other measures.
Matching Coefficients. The following table shows a classification scheme for PROXIMITIES
matching coefficients. In this
scheme, matches are joint presences
(value a in the contingency table)
or joint absences (value d). Nonmatches are equal in number to value b plus value c. Matches and non-matches may be weighted equally or not. The three
coefficients JACCARD
, DICE
, and SS2
are related monotonically, as are SM
, SS1
, and RT
. All coefficients in the table are similarity measures,
and all coefficients exceptK1
and SS3
range from 0 to 1. K1
and SS3
have a minimum value of 0 and have no upper limit.
Joint absences excluded from numerator | Joint absences included in numerator | |
---|---|---|
All matches included in denominator, equal weight for matches and non-matches | RR | SM |
All matches included in denominator, double weight for matches | SS1 | |
All matches included in denominator, double weight for non-matches | RT | |
Joint absences excluded from denominator, equal weight for matches and non-matches | JACCARD | |
Joint absences excluded from denominator, double weight for matches | DICE | |
Joint absences excluded from denominator, double weight for non-matches | SS2 | |
All matches excluded from denominator, equal weight for matches and non-matches | K1 | SS3 |
RR[(p[,np])]. Russell and Rao similarity measure. This measure is the binary dot product.
SM[(p[,np])]. Simple matching similarity measure. This measure is the ratio of the number of matches to the total number of characteristics.
JACCARD[(p[,np])]. Jaccard similarity measure. This measure is also known as the similarity ratio.
DICE[(p[,np])]. Dice (or Czekanowski or Sorenson) similarity measure.
SS1[(p[,np])]. Sokal and Sneath similarity measure 1.
RT[(p[,np])]. Rogers and Tanimoto similarity measure.
SS2[(p[,np])]. Sokal and Sneath similarity measure 2.
K1[(p[,np])]. Kulczynski similarity measure 1. This measure has a minimum value of 0 and no upper limit. The measure is undefined when there are no non-matches (b=0 and c=0).
SS3[(p[,np])]. Sokal and Sneath similarity measure 3. This measure has a minimum value of 0 and no upper limit. The measure is undefined when there are no non-matches (b=0 and c=0).
Conditional Probabilities. The following binary measures yield values that can be interpreted in terms of conditional probability. All three measures are similarity measures.
K2[(p[,np])]. Kulczynski similarity measure 2. This measure yields the average conditional probability that a characteristic is present in one item given that the characteristic is present in the other item. The measure is an average over both items that are acting as predictors. The measure has a range of 0 to 1.
SS4[(p[,np])]. Sokal and Sneath similarity measure 4. This measure yields the conditional probability that a characteristic of one item is in the same state (presence or absence) as the characteristic of the other item. The measure is an average over both items that are acting as predictors. The measure has a range of 0 to 1.
HAMANN[(p[,np])]. Hamann similarity
measure. This measure gives the probability that a characteristic
has the same state in both items (present in both or absent from both)
minus the probability that a characteristic has different states in
the two items (present in one and absent from the other). HAMANN
has a range of −1 to +1 and
is monotonically related to SM
, SS1
, and RT
.
Predictability Measures. The following four binary measures assess the association between items as the predictability of one item given the other item. All four measures yield similarities.
LAMBDA[(p[,np])]. Goodman and
Kruskal’s lambda (similarity). This coefficient
assesses the predictability of the state of a characteristic on one
item (present or absent) given the state on the other item. Specifically,
LAMBDA
measures the proportional
reduction in error, using one item to predict the other item when
the directions of prediction are of equal importance. LAMBDA
has a range of 0 to 1.
D[(p[,np])]. Anderberg’s D (similarity). This coefficient
assesses the predictability of the state of a characteristic on one
item (present or absent) given the state on the other item. D
measures the actual reduction in the error
probability when one item is used to predict the other item. The range
of D
is 0 to 1.
Y[(p[,np])]. Yule’s Y coefficient of colligation (similarity). This measure is a function of the cross ratio for a 2×2 table and has a range of −1 to +1.
Q[(p[,np])]. Yule’s Q (similarity). This measure is the 2×2 version of Goodman and Kruskal’s ordinal measure gamma. Like Yule’s Y, Q is a function of the cross ratio for a 2×2 table and has a range of −1 to +1.
Other Binary Measures. The remaining binary measures that are available in PROXIMITIES
are either binary equivalents
of association measures for continuous variables or measures of special
properties of the relationship between items.
OCHIAI[(p[,np])]. Ochiai similarity measure. This measure is the binary form of the cosine and has a range of 0 to 1.
SS5[(p[,np])]. Sokal and Sneath similarity measure 5. The range is 0 to 1.
PHI[(p[,np])]. Fourfold point correlation (similarity). This measure is the binary form of the Pearson product-moment correlation coefficient.
BEUCLID[(p[,np])]. Binary Euclidean distance. This measure is a distance measure. Its minimum value is 0, and it has no upper limit.
BSEUCLID[(p[,np])]. Binary squared Euclidean distance. This measure is a distance measure. Its minimum value is 0, and it has no upper limit.
SIZE[(p[,np])]. Size difference. This measure is a dissimilarity measure with a minimum value of 0 and no upper limit.
PATTERN[(p[,np])]. Pattern difference. This measure is a dissimilarity measure. The range is 0 to 1.
BSHAPE[(p[,np])]. Binary shape difference. This dissimilarity measure has no upper limit or lower limit.
DISPER[(p[,np])]. Dispersion similarity measure. The range is −1 to +1.
VARIANCE[(p[,np])]. Variance dissimilarity measure. This measure has a minimum value of 0 and no upper limit.
BLWMN[(p[,np])]. Binary Lance-and-Williams nonmetric dissimilarity measure. This measure is also known as the Bray-Curtis nonmetric coefficient. The range is 0 to 1.
Example
PROXIMITIES A B C
/MEASURE=RR(1,2).
-
MEASURE
computes Russell and Rao coefficients from data in which 1 indicates the presence of a characteristic and 2 indicates the absence. Other values are ignored.
Example
PROXIMITIES A B C
/MEASURE=SM(2).
-
MEASURE
computes simple matching coefficients from data in which 2 indicates presence and all other values indicate absence.