Match column selection in DataStage®
For any given pass, you might or might not want to use some match columns for blocking columns. In general, if a column is a blocking column, then you do not make it a match column.
If all passes use the same comparisons, then the weights generated are identical for each pass. You might want to make some blocking columns match columns to keep all pass comparisons the same. Making the comparisons the same might make setting cutoff weights easier, but might not be a good idea in terms of statistical accuracy.
Specifying only blocking columns and no matching columns creates an exact match. All record pairs agreeing on the blocking columns are considered to be matches and duplicates.
It is a good idea to compare all columns in common in a two-file match. Often, people want to omit columns that are not reliable. However, it is often useful to include unreliable columns and assign a low m-probability to the columns, so that there is not much penalty for mismatches.
Decreasing the number of match comparisons might result in more matches but might also reduce the quality of some of the matches. Fewer comparisons can decrease the ability of the matching process to differentiate the correct matches from the incorrect matches.
When you use a blocking column that contains a manufactured or encoded value, such as a Soundex code or part of a value (for example, the first three characters of a family name), the underlying values are not necessarily the same. If matching the underlying values must be part of the consideration, be sure to include the underlying values as a match comparison.