Adding passes to a match specification in DataStage®

Add a pass to your match specification for each matching process that you want to run on your data.

About this task

Specify a pass's blocking columns, match commands, and cutoff values.

Procedure

  1. Click Add pass and create a pass.
  2. Click the pass's name to open its configuration settings and click Edit matches.
  3. Click Edit under Match specification columns to add blocking and matching columns.
  4. Click New block +. Specify the variable to use for your blocking column and the type of comparison to be done on it.
    1. Select Character comparison when the blocking column values are alphanumeric. Select Numeric comparison when the blocking column values are numeric only.

    Use blocking columns to create subsets or blocks of input data records that are likely to be associated. The records that have the same values in the blocking columns are compared to only one another. Blocks make the matching process faster and more efficient.

    If you want exact matching, specify only the blocking columns. If you specify only the blocking columns, all pairs of records that contain the same values in the blocking columns are considered matches.

  5. Click New match + to add a match command for each linkage variable you want to use. Select the data column that corresponds to your linkage variable and the comparison type. See Match comparisons in DataStage for a list of comparison types.
    1. In the m-prob field of the Command Options area, enter a value for the m probability (0 - 1). This value is the probability that two matching records have the same value for the linkage variable.
      The default value is 0.9.

      For important columns, use 0.999. For unimportant columns, use 0.8. The higher that you set the m-probability, the greater the penalty for values that do not match.

      The m-probability reflects the error rate of a column. For example, if values in a column do not match 10% of the time, set m-prob to 0.9. The m-probability also forces a column to be more important. For example, if you want a column to have a high penalty for non-matching values, set m-prob at a higher value such as 0.95. The higher value indicates that the values in that column are unlikely to be different and so different values should result in a high disagreement weight.

    2. In the u-prob field of the Command Options area, enter a value for the u-probability (0 - 1). This value is the probability that two nonmatching records happen to have the same value for the linkage variable.
      The default value is 0.01. For most data, use the default value of 0.01. The matching process computes an accurate u-probability by using the frequency data that contains information from all the values in a column. For age, use 0.02. For gender, use 0.5. The probability that the gender variable agrees at random is about 0.5.
    3. Optional: If your comparison choice allows reverse matching, select Reverse to assign the agreement weight when the columns disagree or the disagreement weight when the columns agree.
    4. Optional: If your selected comparison type supports vectors, you can click Vectors to display and compare column vectors. If you want to create vectors to use in your match specification, see Make Vector stage in DataStage.
    5. Specify other parameters, depending on the comparison type.
    6. Optional: In the Matching columns section under Input schema, click the three dots next to a column to specify a weight override. You can override weights by replacing the calculated weight with the weight that you specify or adding the weight you specify to the calculated weight with the Replace and Add options. Use the Scale option to replace values outside the range of your frequency with a specified weight, and scale the other values in proportion to the replaced weights. Use a weight override to avoid invalid matches for important values like telephone numbers and addresses by assigning negative weights to penalize missing or invalid values.
  6. In the Pass configuration window, specify cutoff values and an overflow value.

    Specify an overflow value to set the maximum number of records for a block. If the number of records in a block exceeds the overflow value, the records in that block will not be matched. Review Pass statistics during testing to identify if an overflow occurred and improve the blocking criteria or increase the overflow.

    Cutoffs are thresholds that determine how the result of each record pair comparison is classified. The composite weight from each comparison then determines, relative to the cutoffs, whether that record pair is considered a match, a clerical pair, a duplicate, or a nonmatch.

    Depending on the match type, you can set up to three of the following match cutoffs.
    • Match. Record pairs with composite weights equal to or greater than this cutoff are considered matches.
    • Clerical. Record pairs with composite weights that are less than the match cutoff but equal to or greater than the clerical cutoff can be reviewed to determine whether they are a match. Record pairs with composite weights that are less than the clerical cutoff are considered nonmatches and participate in the following pass. You can eliminate the clerical review category by setting the clerical cutoff equal to the match cutoff. Also, this cutoff is not available with the one-source independent or one-source transitive match type.
    • Duplicate. You can set a duplicate cutoff for use with only the two-source many-to-one-duplicate match type. This cutoff is optional and must be greater than the match cutoff. Each reference source record in a record pair that has a composite weight that is equal to or greater than the duplicate cutoff is categorized as a duplicate record.
  7. Apply your definitions and save your pass.
  8. To group records and specify treatment for them during matching, click the pass name and click Manage groups.