How QualityStage matches records in DataStage®
Matching is a probabilistic record linkage implementation. Matching is an open system that you can tune to create the best match for your needs.
Weights, scores, and thresholds
The matching process assigns numeric scores called weights to the comparison of individual data elements. The scores measure the contribution of each data element to the overall or composite weight. The composite weight, in turn, is the sum total weight of all defined comparisons. Some data elements contribute more weight because they are more critical to the match or more reliable than others. Statistical properties of the data elements and tuning parameters determine the weight of the contributions.
The weight for a data element is generated by using one of the comparison functions that are available in matching. Among the comparison functions are exact functions and functions that provide a full spectrum of error-tolerant or fuzzy matching functions. You can adjust the resulting weight of a given comparison function to reflect the importance of the data element to its domain and to the overall comparison.
The composite weight is compared against a set of thresholds (also called cutoffs) to determine how to translate the weight into a measurement of confidence. The confidence indicates the likelihood that a matching pair of records was identified.
Computational loads of large volumes of data
A matching process consists of one or more passes. Each pass employs a technique called blocking to reduce the computational load by focusing on pairs of records that are more likely to be matches. The pass also specifies the columns and comparison functions that are used to compare records.
For data sources of a reasonable size, it is not feasible to compare all record pairs, because the number of possible pairs is the product of the number of records in each source. For example, when you have two sources with as few as 1000 records each there are 1,000,000 combinations of records, one from each source. But there are, at most, only 1000 possible matches (if there are no duplicates on the sources). Therefore, the set of matched pairs contains, at most, 1000 pairs, and the set of nonmatched pairs contains the remaining 999,000 pairs.
There are many more nonmatched pairs than matched pairs. Only when you look at pairs of records that have a high probability of being matches, and ignore all pairs with very low probabilities, does it become feasible in a computational sense to link large volumes of data.