Weights and record comparisons in DataStage

The information content of the data determines which record pairs are matches and which are nonmatches, both for an automated process like matching in DataStage® and for manual matching.

Each column within a record provides some information. The weight is a measure of the information content of the data and the likelihood that a record pair matches. Taken together, all the columns and comparisons determine the status of the pair that is being examined by the matching process.

Some columns provide more information more reliably than others. For example, it does not make sense to compare just the gender column and assert that if there is agreement, the record pair represents the same individual. However, it is more reasonable to compare records by using an individual identification number, and assert that if the number agrees then the record pair represents the same individual. You can use the discriminating power of each column to measure and predict matches.

Composite weights (DataStage)

For each record pair that you want to compare, a composite weight is computed.

The composite weight is the sum of the individual weights for all the match comparisons. It is the measure of confidence that the two records are a match.

Computation of weights (DataStage)

Match weights measure the extent to which two records match based on the designated matching criteria. Record pairs with higher weights are a stronger match than record pairs with lower weights.

The odds ratio is the ratio of the odds that an event occurs in one group to the odds that the event occurs in another group. An odds ratio of one (1) implies that the event is equally likely in both groups. An odds ratio greater than one (>1) implies that the event is more likely in the first group. An odds ratio less than one (< 1) implies that the event is less likely in the first group.

The weight calculation for a column uses the odds ratio of the likelihood that the values are the same and the record pair is a match versus the likelihood that the odds of the values are the same and the record pair is a nonmatch. Using such an odds ratio minimizes the number of clerical pairs for the chosen cutoff values.

Agreement and disagreement weights

The match weight is derived by using the agreement and disagreement weights established for that column.

If a match comparison agrees for the record pair being compared, the agreement weight is added to the composite weight. The agreement weight is a positive value. If a match comparison disagrees for the record pair being compared, the disagreement weight is added. The disagreement weight is a negative value. Therefore, agreement weights add to the composite weight, and disagreement weights subtract from the composite weight. The higher the score is; the greater the agreement is.

Partial weight is assigned for non-exact or fuzzy matches.

Missing values have a default weight of zero.

Reliability and chance agreement

Reliability: a discussion of m-prob

The m probability is the probability that a column value agrees with its pair value given that the record pair being examined is a matched pair. The m probability is effectively one minus the error rate of the column. For example, in a sample of matched records, if gender disagrees 10% of the time due to transcription errors or being misreported, the m probability for this variable is 0.9 (1 - 0.1).

A high m probability (0.9) indicates that the data in a given column is considered highly reliable. The higher the m probability is; the higher the disagreement weight is. A disagreement between two values in a reliable column is penalized more highly than for an unreliable column for which you set a low m probability (0.1).

If a column is important, then the m probability can be given higher values. If the m probability is high, it is equivalent to saying that a disagreement of values from that column is a rare event in a matched pair. Consequently the penalty for a nonmatch is high.

Chance agreement: a discussion of u-prob

The u probability is the probability that a column value agrees with its pair value given that the record pair being examined is an nonmatched pair. Because there are so many more possible nonmatched pairs than matched pairs, the u probability is effectively the probability that the values agree at random.

Match uses a frequency analysis to determine the probability of a chance agreement for all values.

Rare values bring more weight to a match.

For example, the probability that two gender values agree at random is about 0.5. Given a uniform distribution, there are four possible combinations of the two values:
Table 1. Possible combinations of the gender values for a pair of records
File A File B
M F
M M
F M
F F

The gender agrees in two of the four combinations. Therefore, the pair has a u probability of 0.5.

Weights calculations

Column weights reflect both the reliability of the data and the possibility for chance agreement.

The agreement weight for a column is calculated as the logarithm to the base two of the ratio of the m probability and u probability, as shown in the following equation:
log2(m probability/(u probability)
The disagreement weight for a column is calculated as shown in the following equation:
log2((1 - m probability)/(1 - u probability))

Agreement weight example

Gender and national identity numbers demonstrate how differences in frequency or reliability change the agreement weight for columns.

To see how the weight computation translates into actual values, consider the values for the columns, gender and the national identity number. In this example, the gender has a 10 percent error rate, and national identity number has a 40 percent error rate.

The m probability for gender is 0.9. The u probability is 0.5. Therefore, the weight for gender is shown in the following example:


log2 (m/u) = ln(m/u)/ln(2) = ln(0.9/0.5)/ln(2) = 0.85.

Conservatively, assume that the probability of a chance agreement of the national identity number is one in 10 million. Given m as 0.6 (40% error rate in matched pairs), the weight for national identity number is ln(0.6/0.0000001) = 22.51.

Therefore, the weight for a match on the gender column is 0.85, and a match on the national identity number is 22.51. The weights capture what you might intuitively know about the columns.

Cutoff values (DataStage)

Match and clerical cutoffs are thresholds that determine how to categorize scored record pairs.

Your goal of setting cutoffs is to minimize uncertainty in the match results while you limit the number of false categorizations.

Record pairs with composite weights equal to or greater than the match cutoff are considered matches. Record pairs with composite weights equal to or greater than the clerical cutoff but less than the match cutoff are called clerical pairs. The matching process is uncertain whether clerical pairs are matches or nonmatches. Pairs with composite weights below the clerical cutoff are considered nonmatches. You can set cutoffs at the same value, so that you eliminate clerical records.

You can set a high cutoff threshold to limit the results to better quality matches, though possibly fewer matches. A lower threshold can produce more matches, but some of these matches might be of lesser quality. Business requirements help drive decisions. Results can vary depending on whether you take a conservative or more aggressive approach to defining the cutoff values.

For example, matching for the purpose of docking a person's pay might require a more conservative approach than deduplicating a mailing list for shopping catalogs. As a best practice, keep in mind the business purpose when you tune the match settings.

The composite weights assigned to each record pair create a distribution of scores that range from very high positive to very high negative. The graph in Figure 1 focuses on the area of a histogram where the number of low scoring pairs tails off and the high scoring pairs starts to increase. In this area of the graph, there is not a high likelihood that pairs are either matches or nonmatches.

You set the cutoff values to tell the matching process how to handle pairs in this range. Differences in the distribution of pairs help to determine the settings. The detail of the graph of matched versus nonmatched records relate to the cutoff points. You typically set cutoffs on the down slope of the nonmatched and the up slope of the matched. Where you set the cutoff is influenced by both the business objective and the tolerance for error.

The weights between the vertical lines form a gray area, where one cannot say whether the pair is matched or not. You want to have enough variables to distribute the matched versus nonmatched groups further apart (minimize the uncertain pairs). You know that you developed a good match strategy when what is in the clerical area are records with mostly blank, missing, and default values.

Figure 1. Histogram of weights
Histogram shows Number of Pairs on the vertical axis and weight of comparison on the horizontal axis. On the low end of weight of comparison, there are unmatched records to the left of the low cutoff and on the high end of weight comparison there are match records to the right of the high cutoff. Clerical records in between the low and high cutoff. The gray area is the middle area, between the nonmatch and match areas and is labeled as the clerical area.

The fewer records in the clerical area, the fewer the cases to review, but the greater the probability of errors.

False positives are cases in which records are classified as matched records but really are nonmatch records. False negatives are cases in which records are classified as nonmatch records but are matched records.

The goal of setting cutoffs is to minimize the number of clerical pairs and limit the number of false negatives and positives. You fine tune the results depending on the goals of your organization.

Guidelines when determining probability (DataStage)

You can use these guidelines as you assign probabilities.

The higher the m probability is, the greater the disagreement weight is. Therefore, if a column is important, give the m probability higher values. If the m probability is high, it is equivalent to saying that a disagreement of that column is a rare event in a matched pair, and consequently the penalty for a nonmatch is high. The weights computed from the probabilities are visible in the data viewer of the Match Designer so that you can inspect the results.

Use the following guidelines when determining m probabilities.

  • Give high m probabilities to the columns that are the most important and reliable.
  • Give lower m probabilities to the columns that are often in error or incomplete.
  • The m probability must always be greater than the u probability and must never be zero or 1.

Agreement or disagreement between data values is more significant for reliable data and less significant for unreliable data.

As a starting point, you can guess the u probability because the matching process replaces any guess with actual values. A good estimate is to make the u probability 1/n values, where n is the number of unique values for the column. By default, the u probability for each comparison is calculated automatically by the matching process using the frequency information from the Frequency stage. This calculated u probability is important for columns with non-uniform distributions.

The frequency information allows match to vary the weights according to the particular values of a column. Rare values bring more weight to a match. For example, in using a column such as FamilyName, the values Smith and Jones are common in the United States. However, a value such as Alcott is relatively rare in the United States. A match on the rare family name, Alcott, gets a higher weight than a match on the more common family names because the probability of chance agreement on Alcott is relatively low compared to chance agreements on other values such as Smith or Jones.

For columns with a uniform distribution of values and a high number of different values (such as individual identification numbers), it is better not to generate frequency information. Specify the vartype NOFREQ in the Variable Special Handling window of the Match Designer.

Even exact matching is subject to the same statistical laws as probabilistic matching. It is possible to have two records that contain identical values and yet do not represent the same entity. You cannot make a definitive determination when there is not enough information.