Blocking examples

The best blocking columns are columns with the largest number of values possible and the highest reliability. These general guidelines can help you to select useful blocking columns.

Individual identification numbers

Identification numbers are typically reliable. In a first pass, use individual identification numbers such as national identity numbers, medical record numbers, claim numbers, and so forth, even if the numbers are missing or in error in a sizable percentage of the records.

For example, sources contain a national identity number in 50 percent of the records. Pass 1 is blocked by national identity number. Match skips all records with no national identity number. The skipped records are applied to the second pass. However, a fairly large percentage of the records are matched easily.

If there are several identification numbers, use them on the first two passes. After that, try other columns. Identification numbers are ideal for blocking columns, because they partition the records into many sets.

Birth dates

Birth dates are excellent blocking columns.

For example, by using the Transformer stage, you can separate birth dates into these columns: BirthYear, BirthMonth, and BirthDay. For larger sources (over 100,000 records), use all three columns as a first-pass blocking column. For smaller sources, use BirthYear, BirthMonth, and an additional column such as Gender. Subsequent passes can use blocks containing BirthDay.

Event dates

Event dates, such as an accident date, claim date, hospital admission date, and so on, are useful as blocking columns.

Names

A phonetic encoding (such as Soundex or NYSIIS codes) of the family name is a useful blocking column. For large sources, combine this code with the first letter of the given name or birth year. Remember, different cultures use different conventions for family names, so do not rely exclusively on them.

Addresses

Postal addresses present a wealth of information for blocking. For example, postal codes and a phonetic encoding (Soundex or NYSIIS) of street name or city name are all excellent choices.