One-source Match stage

The One-source Match stage matches records from a single source file.

An example of grouping records might be that you locate all records that apply to the same individual, household, or event. In addition, you might deduplicate a file to group all invoices for a customer or merge a mailing list.

The One-source Match stage accomplishes the following actions.
  • Categorizes all records with weights greater than the match cutoff as a set of duplicates.
  • Identifies a master record by selecting the record within the set that matches to itself with the highest weight. The master record is associated with its set of duplicates.
  • Determines that records that are not part of a set of duplicates are nonmatched records. The nonmatched and master records are generally made available for the next pass.
  • Excludes duplicates in subsequent passes. However, you can choose the Independent match type if you want duplicates to be included in subsequent passes.

The output of the One-source Match stage can include master records, duplicates greater than the match cutoff, clerical duplicates, nonmatched records, and statistics about the results of the matching process.

Inputs to the One-source Match stage

The One-source Match stage accepts two input links: one for the source data, and one from the frequency information for that data. The One-source Match stage uses a one-source match specification to group and match the data.

To add source data to the One-source Match, you need links from the following sources:
  • The data that you want to find matches in. Any stage that produces output can be a source. However, as a best practice, standardize the data before you use the One-source Match stage.
  • The frequency information for that data, as generated by the Match Frequency stage. You can input frequency data from the Match Frequency stage as a part of the current job. You can also input frequency data that was output from a previously run Match Frequency job.

When you configure the stage, you must designate an existing match specification from the assets. This match specification must be of the one-source match type and be based on the column definitions of the data you are inputting to the One-source Match stage.

The One-source Match stage matches and groups your input data based on the match specification. You select the columns to output.

One-source Match stage workflow

The One-source Match stage requires a one-source match specification, standardized data, and frequency information to process data according to particular attributes.

A typical workflow for using the One-source Match stage includes the following tasks.
  • Standardize the source data.
  • Prepare a representative sample data set from the source data.
  • Use the Match Frequency stage to generate frequency information.
  • When the DataStage® flow with match related stage is imported, the related match specifications are imported in the assets as well. You can use those match specification files in the One-source Match stage.
  • Optional. If you want to reduce the amount of frequency data, you can run the Frequency Match stage job again. However, for this job run, use the match specification that you created for the one-source match. Using a one-source match specification limits the frequency data to only the columns that participate in the match job.
  • Create a DataStage flow that includes the One-source Match stage, with the source data and the frequency information as input.
  • Configure the One-source Match stage, which includes selecting the one-source match specification that you created.

Creating One-source Match stage jobs

The One-source Match stage job requires that you add the One-source Match stage to the job and link it to two source stages and up to five output stages.

Procedure

  1. From the palette, select the One-source Match stage and drag it onto the middle of the DataStage design canvas.
  2. From the palette, add two input stages. Add one for each of the following data sources.
    • Data source
    • Frequency information for the data source

    Typically, your source data comes from a file or database but you can also use other stages to preprocess the data before it is input to the One-source Match stage.

  3. Link the input stages in the following order.
    • Data
    • Data Frequency
  4. From the palette, add up to five output stages, one for each of the output options that you intend to use in the One-source Match stage.
    The following list shows the five output options.
    • Match
    • Clerical
    • Duplicate
    • Nonmatched
    • Match Statistics
    For the output stages, you can use any file, database, or processing stage.
  5. For the match output options that you intend to use, link the stages in the following order.
    1. Match
    2. Clerical
    3. Duplicate
    4. Nonmatched
    5. Match Statistics
    Note: Verify in the Link Ordering section on the Input tab and the Output tab that the link labels and link names are correct.
  6. Double-click an input stage, then configure it.
    1. In the Properties section of the Output tab, in the text box for File, type the file name, including the path, which the input data will read.
      If you used a stage other than a file stage for the input, select the appropriate input and define any additional properties that the source requires.
    2. In the Columns section of the Output tab, click Edit, then choose Importing existing data definitions. When you select the data definition for the frequency input, use the data definition that was created when the Frequency Match job was run.
    3. In the Properties section of the Input tab, in the text box for File, enter the file name to write the output data to.
    4. Click Apply and Return, then Save.
    5. Repeat these steps until all the input stages are configured.
  7. Double-click an output stage, then configure it.
    1. In the Properties section of the Input tab, in the text box for File, enter the file name to write the output data to.
      If you used a stage other than a file stage for the output, select the appropriate output and define the properties that the target requires.
    2. Click Apply and Return, then Save..
    3. Repeat these steps until all the output stages are configured.
  8. Optional: Rename the stages and links with meaningful names that reflect their functions in the job or project.

Configuring the One-source Match stage

After you create a job that contains this stage, select from available match specifications and other settings that match records in one data source.

  1. Open the DataStage flow that contains the One-source Match stage that you want to configure.
  2. Double-click the One-source Match stage.
  3. In the Match Specification field, click Browse and select a one-source match specification.
  4. To override any of the match cutoff values from the match specification that you selected, select Override match cutoffs, click Edit, then enter new values or job parameters in the Match or Clerical columns.
  5. To enter specific values for the override match cutoffs, click the pencil icon next to a value and enter the new value. Click the icon to take one of the following options.
    Option Action
    Use an existing job parameter. Select Insert Parameter and choose an existing parameter.
    Return to match cutoff values from the match specification. Select Use match specification.
  6. From the drop-down list for the Match Type, choose a match type.
  7. In the Match Outputs pane, select one or more outputs that you want to create. Be aware of the following conditions.
    • Each output that you select must have a corresponding output link.
    • No output can be sent to more than one link.
  8. In the Column section of the Output tab, for any one output , select Edit to modify output columns.
  9. Click Save.

Match types for the One-source Match stage

When you use the One-source Match stage, you select one of the following match types. For Two-source Match stage match types, see Match types for the Two-source Match stage.
  • Dependent.

    In a one-source dependent match, passes process the data sequentially. In each pass, groups are built around master records. The groups that are formed in all the passes for the same master record are combined to create the final group for the master.

    Each duplicate record in a group matches the group master record in one of the match passes. The master records and nonmatched records from a pass are made available for the subsequent pass. Duplicates are taken out of consideration so that they are not assigned to more than one group. Existing master records are given priority in group construction in subsequent passes.

  • Independent.

    Each pass in a one-source independent match processes all the input records. Like the one-source dependent match type, in each pass, groups are built around master records. But because each pass processes all records, a record can be a member of a group from more than one of the passes. (Similarly, a record can be a master in a group that was built in one pass while being a duplicate in a group that was built in another pass.)

    The groups from all the passes are merged so that groups that have a record in common form a single group. If record A is in a group with record B and record B is in a different group with record C, then those two groups are merged so that records A, B, and C are all in the same group. (A record ends up in no more than one group.) Groups are merged until all groups that have records in common are merged. At the pass level, the relationship that determines group membership is that of records matching a master record. However, for the merge process, the relationship is one of group membership. Thus, members in a group can be connected by a chain of relationships and do not necessarily all match a common master.

  • Transitive.

    Like one-source independent matches, each pass in a one-source transitive match also processes every record. But unlike a one-source independent match, the one-source transitive match type does not create pass-level groups. Instead, all record pairs that score above the match cutoff are used to produce the groups.

    Creating pass-level groups would discard the information that a record pair's score was above the match cutoff in a pass if each record ends up in a different group. The one-source transitive match type does not discard that information. It builds groups so that all records that score above the match cutoff in any pass are in the same group. For example, if record A and record B scored above the match cutoff in a pass and record B and record C scored above the match cutoff in a pass (possibly the same pass), then records A, B, and C are added to the same group. (A record ends up in no more than one group.) Like one-source dependent matches, members in a group can be connected by a chain of relationships and do not necessarily all match a common master. But the one-source transitive chain can extend further because it uses all the pairs that score above the match cutoff.

In most cases, choose the dependent match type, because you want duplicates removed from consideration so that they do not match to other records in subsequent passes.

However, the independent option is useful when you want to link people or organizations regardless of address. For example, you can link together all the locations where a doctor practices.

The transitive option is useful if you want to account for inconsistent data entry in fields that assist in duplicate identification, for example, date of birth and driver's license numbers.

An example of processing for the dependent and independent match types
The following example shows how to use the independent match type with the One-source Match stage. The table shows four records that describe the same person. You require that all records concerning the same person match without regard to address.
Table 1. Four records that describe the same person
Record Name Address Tax ID
1 William Nickson 123 Rodeo Drive 123456789
2 Bill Nixon 123 Rodeo Drive  
3 B Nickson 978 Sunset Blvd. 123456789
4 Nickson 456 Western Ave. 123456789
The matching process using this data yields different results depending on the match type that you choose:
  • Dependent
    • The first pass blocks and matches on Name and Address. Records 1 and 2 are considered a matched pair. Records 3 and 4 are considered nonmatched records.
    • If Record 2 (without the TaxID) is selected as the master, and Record 1 is considered a duplicate, then Record 1 is not available for the second pass.
    • If the second pass blocks and matches on Name and TaxID, then only Records 3 and 4 match. The result is two groups of matched records: Records 1 and 2, and Records 3 and 4.
  • Independent
    • The first pass results are the same as the dependent match. Records 1 and 2 are considered a matched pair. Records 3 and 4 are considered nonmatched records.
    • If Record 2 (without the TaxID) is selected as the master record in the second pass, the duplicate record, Record 1, is also compared to the rest of the records. When you block on Name and TaxID, records 1, 3, and 4 match. Since Record 1 matched Record 2 in the first pass, the output is one group with all four records linked.
An example of processing for the dependent and transitive match types
The following example shows how to use the transitive match type option with the One-source Match stage. The table shows six records that show a difference of one day between records of the same family name. You require that records of the same family name match if the dates have a difference of one day or less.
Table 2. Records that show a difference of one day between records of the same family name
Record Family name Date Given name
qsMatch
Type
qsMatch
Weight
qsMatch
Pass
Number
qsMatch
SetID
qsMatch
DataID
5 Clifford 19530831 Benn MP 0 1 5 5
7 Clifford 19530829 George DA 0 1 5 7
6 Clifford 19530830 George DA 0 1 5 6
8 Clifford 19530731 Thomas MP 0 1 8 8
9 Clifford 19530801 David DA 0 1 8 9
10 Clifford 19530802 David DA 0 1 8 10
The matching process that uses this data yields different results depending on the match type that you choose:
  • Dependent
    • The first pass blocks on Family Name and matches on Date by using a date tolerance of one day. Records 5 and 6 are considered a matched pair.
    • If Record 5 is selected as the master record, Record 6 is not available for the second pass and no other records match.
  • Transitive
    • The first pass blocks on Family Name and matches on Date by using a date tolerance of one day. Records 5 and 6 are considered a matched pair.
    • If Record 5 is selected as the master record, Record 6 is available for subsequent passes and is compared to the rest of the records. Records 6 and 7 are considered a matched pair. Because Record 5 matched Record 6 in the first pass, the result is one group in which all three records are linked. Records 5 and 6 are matched. Records 6 and 7 are matched. Therefore Records 5, 6, and 7 are within the same match set.

Match outputs for the One-source Match stage

Select from the following output options. Each option is output to a separate output link. Therefore, the number of output links corresponds to the number of output options that you select.
  • Match. The master records.
  • Clerical. The duplicates that fall in the clerical range.
  • Duplicate. The duplicate records that are greater than the match cutoff.
  • Nonmatched. The records that are not master, duplicate, or clerical records.
  • Match Statistics. Summary statistics about the matching results and statistics about the matching process for each match pass.

Use the Link Ordering section of the Output tab to associate the output options with specific output links. Check the Link Ordering section to ensure that the records for each output option that you select are output to the link to which you intend.

If you want, you can add other stages, such as the Funnel stage, to group some or all the output into a single file or table.

The columns that are available for output consist of all the input columns, plus more columns that are created by the match process.

The nonmatched output includes the following columns.
  • qsMatchDataID. The data record ID.
  • qsMatchType. The match ID for the record. One of:
    • MP. Master record.
      • DA. Duplicate record.
      • CP. Record that requires clerical review.
    • RA. Nonmatched record.
  • qsMatchSetId. The match set identifier.
The match, clerical, and duplicate output includes the previous three columns in addition to the following columns.
  • qsMatchWeight. The weight.
  • qsMatchPattern. The pattern.
  • qsMatchLRFlag. "L" for left, "R" for right.
  • qsMatchExactFlag. "X" if the match is exact.
  • qsMatchPassNumber. The number of the pass where the match was found.
If you select the Match Statistics output option, ensure that you use the default names for the output columns. The statistical output includes the following columns:
  • qsMatchPassNumber. The number of the pass where the match was found.
  • qsMatchStatType. The number used to identify the type of statistic.
  • qsMatchStatValue. The value for a particular statistic.
  • qsMatchWeight. The weight.