The matching algorithm

The matching algorithm can be summarized as follows.

A block of records is read from data source or in a Reference Match from both a data source and reference source.
All columns are compared and a composite weight is computed for each possible record pair in the block.
- Reference Match
  A matrix of composite weights is created. The matrix size is nXm, where n is the number of data source records in the block and m is the number of reference source records in the block. The elements of the matrix are the composite weights.
- Unduplicate Match
  A matrix of composite weights is created. The matrix size is nXn, where n is the number of data source records in the block. The elements of the matrix are the composite weights.
Matches are assigned.
The assigned elements are examined. If they have a weight greater than the cutoff values, they are matched or considered clerical review pairs.
Duplicates are detected on both sources by examining the row and column of an assigned pair.
If there is more than one element whose weight is greater than the cutoff weight, it is a potential duplicate.

The matches, clerical review cases, and duplicates are removed from consideration in subsequent passes. This prevents a record from being matched to different records in each pass. The residuals from a pass participate in the next pass.

It is theoretically possible to match a record to a better record in a subsequent pass, but this record was not found since it was removed from consideration. In practice, try to make the early passes the gold-plated passes. These have the most restrictive blocking criteria. In subsequent passes, the net is widened to try to find records having errors on the most discriminating columns. Consequently, a record matching on an early pass is always a much better match than anything that would be found later.

The decision to remove matches from consideration in future passes makes data management much easier. If each pass produced a different set of possibilities, you would have to arbitrate the results from each set to decide on the correct outcome. This of course, could remove matches from other sets. Thus by keeping only one match for each record, performance is greatly improved, the quality of the matches is excellent and user interaction is dramatically reduced.

If it is important to retain matches in a number of passes, you can specify a number of single pass match stages and arbitrate the differences by grouping all match results into sets.