Information icon IBM Information Server, Version 8.1
Feedback

The matching algorithm

The matching algorithm can be summarized as follows.

The matches, clerical review cases, and duplicates are removed from consideration in subsequent passes. This prevents a record from being matched to different records in each pass. The residuals from a pass participate in the next pass.

It is theoretically possible to match a record to a better record in a subsequent pass, but this record was not found since it was removed from consideration. In practice, try to make the early passes the gold-plated passes. These have the most restrictive blocking criteria. In subsequent passes, the net is widened to try to find records having errors on the most discriminating columns. Consequently, a record matching on an early pass is always a much better match than anything that would be found later.

The decision to remove matches from consideration in future passes makes data management much easier. If each pass produced a different set of possibilities, you would have to arbitrate the results from each set to decide on the correct outcome. This of course, could remove matches from other sets. Thus by keeping only one match for each record, performance is greatly improved, the quality of the matches is excellent and user interaction is dramatically reduced.

If it is important to retain matches in a number of passes, you can specify a number of single pass match stages and arbitrate the differences by grouping all match results into sets.


PDF This topic is also in the IBM WebSphere QualityStage User Guide.

Update icon Last updated: 2008-09-30