Blocking in DataStage®
Blocking is an important factor in match performance and efficiency.
Blocking provides a method of focusing the scope of record pairs to examine. For sources of reasonable size, it is infeasible to compare all record pairs, because the number of possible pairs is the product of the number of records on each source. Blocking provides a method of limiting the number of pairs being examined.
Blocking partitions the sources into mutually exclusive and exhaustive subsets, and the matching process searches for matches only within a subset. If the subsets are designed to bring together pairs that have a higher likelihood of being matches and ignore those that are less likely matching pairs, successful matching becomes computationally feasible for large data volumes.
Records in a block already match exactly on one or several blocking columns. Blocking is an essential step to effective matching in your data cleansing project.
Blocking considerations
- The computational cost (examining too many records) that blocking helps to resolve
- The potential to increase the rate of false negatives (when a record pair that does represent the same entity is not a match) because the records are not members of the same block
- Choose columns with reliable data.
- Choose columns that make business sense to meet your objective. If you want to identify unique customers, then blocking by house number is not the best choice.
- The character discrete investigate reports help you select the blocking columns. The reports tell you how often a column is populated. If you choose columns with reliable data, then you are truly grouping like records, because the data values are reliable.
- Choose columns with a good distribution of values.
- Some columns do not have many possible values. For example, gender, as a column value, typically does not make blocks small enough.
- Sometimes columns do not have a good distribution of values. For example, if your data is from only a few states or provinces, then the states or provinces might not be the best columns to use.
- A combination of columns might be preferable.
- Use enough columns to keep the block size manageable.
Blocking analogy
You have the task of matching into pairs many recently washed socks that are in a box. You might select one sock and search through the entire box, sock by sock to find the partner of the sock. But most likely you decide to seek a more efficient method. You can sort or “block” socks by characteristic. You can sort by color first. Then you need to compare only white socks to white socks and not waste time comparing a white sock to a blue sock.
What if you found one pink sock that was potentially white originally but is now discolored? Due to color error, you do not put the pink sock with the white socks in your first pass of sorting socks. Using match terms, the pink sock does not make the block of white socks. In subsequent match passes, after you sort by color, you sort the remaining, unpaired socks by size and shape. In one of these subsequent passes, you might find the most likely match for the pink sock. Similarly, multiple passes help overcome the problem of records not making the correct block group.
Blocking guidelines
- Use blocking columns like sort keys
- All records that have the same value in the blocking columns are eligible for comparison during the matching phase.
- Make blocks as small as possible
- One to two hundred records per source is a good size. Efficiency decreases as block size increases. Blocking partitions the sources into subsets that make computation feasible. Using small blocks results in better system performance when processing records. To create an effective blocking strategy, use several columns for blocking in each pass.
- Implement a blocking strategy that is consistent with your matching goals
- Although small blocks of several columns are preferable, a blocking scheme that is highly restrictive can create too many blocks. The problem with too many blocks is that records that are potential targets for comparison are distributed in multiple blocks. If the records are distributed in multiple blocks, then the records are not compared within the pass, and your match results might be compromised.
- Avoid block overflow
- Block overflow occurs if, during a pass, more records are grouped in a particular block than the
number that you specified for the block overflow limit. When block overflow occurs, all records in
the block are skipped by the pass.
You can configure the block overflow setting. The default setting for the block overflow is 10,000 records. For a two-source match process, many-to-one match, the reference block size is the only block that is constrained by the overflow setting. If a block overflow occurs, examine the blocking criteria. Instead of raising the block overflow limit immediately, understand why the overflow occurs. Determine if one of the blocking columns has many different values (high cardinality).
If you raise the block overflow limit without understanding that you have a problem with your data, you might inadvertently hide a problem with your data or your blocking strategy. Raise the block overflow if you know why the block overflow occurred and if you have a logical reason to do so.
- Define missing values for blocking columns
- Define missing values for blocking columns. Blocks that have columns with missing values are skipped in the matching process. Convert all generic stand-ins for missing values (such as UNKNOWN or 99999) to nulls. When missing values are not converted to nulls, the matching process does not identify the values as missing. As a result, costly block overflows can occur. For example, if a national identity number is present in only half the records but the missing values are reported as spaces instead of nulls, the blank numbers form one large block. The large block might cause excessive computation and block overflow.
- For sources with limited information, use a reverse Soundex code
- The reverse Soundex is formed by looking at the name backwards and computing a Soundex. For example, the reverse of JONES would be SENOJ. Because the Soundex algorithm preserves the first letter, running a reverse Soundex allows for errors at the beginning of names.
Blocking examples
- Individual identification numbers
-
Identification numbers are typically reliable. In a first pass, use individual identification numbers such as national identity numbers, medical record numbers, claim numbers, and so forth, even if the numbers are missing or in error in a sizable percentage of the records.
For example, sources contain a national identity number in 50 percent of the records. Pass 1 is blocked by national identity number. Match skips all records with no national identity number. The skipped records are applied to the second pass. However, a fairly large percentage of the records are matched easily.
If there are several identification numbers, use them on the first two passes. After that, try other columns. Identification numbers are ideal for blocking columns, because they partition the records into many sets.
- Birth dates
-
Birth dates are excellent blocking columns.
For example, by using the Transformer stage, you can separate birth dates into these columns: BirthYear, BirthMonth, and BirthDay. For larger sources (over 100,000 records), use all three columns as a first-pass blocking column. For smaller sources, use BirthYear, BirthMonth, and an additional column such as Gender. Subsequent passes can use blocks containing BirthDay.
- Event dates
-
Event dates, such as an accident date, claim date, hospital admission date, and so on, are useful as blocking columns.
- Names
-
A phonetic encoding (such as Soundex or NYSIIS codes) of the family name is a useful blocking column. For large sources, combine this code with the first letter of the given name or birth year. Remember, different cultures use different conventions for family names, so do not rely exclusively on them.
- Addresses
-
Postal addresses present a wealth of information for blocking. For example, postal codes and a phonetic encoding (Soundex or NYSIIS) of street name or city name are all excellent choices.