Two-source Match stage

The Two-source Match stage compares two sources of input data (reference records and data records) for matches.

The Two-source Match stage uses the following four sources of data for matches.
  • A data source
  • Frequency information about the data source, generated by the Match Frequency stage
  • A reference source
  • Frequency information about the reference source, generated by the Match Frequency stage

Inputs to the Two-source Match stage

The Two-source Match stage takes four input sources. The data and reference sources can be from any parallel database, file, or processing stage.

The four inputs to the Two-source Match stage come from the following sources.
  • Data source
  • Reference source
  • Frequency information about the data and reference source, generated by the Match Frequency stage. You can input frequency data from the Match Frequency stage as a part of the current job. You can also input frequency data that was output from a Match Frequency job that was run previously.

When you configure the stage, designate an existing two-source match specification from the assets based on the column definitions of the data and reference sources.

The Two-source Match stage matches and groups your input data based on the match specification. You select which columns to output.

Two-source Match stage workflow

The Two-source Match stage requires standardized data and reference data as source data, a two-source match specification, and frequency information for both sources.

A typical workflow for using the Two-source Match stage includes the following tasks.
  • Standardize the source data for the data source and the reference source.
  • Prepare representative sample data sets from the source data.
  • Use the Match Frequency stage to generate frequency information.
  • Optional. If you want to reduce the amount of frequency data that will be used in the Two-source Match job, you can run the Frequency Match stage job again. However, for this job run, select the two-source match specification that you created. Selecting the two-source match specification in the Frequency Match stage job limits the frequency data to only the columns that will participate in the match job.
  • Create a DataStage® asset that includes the Two-source Match stage, with data source, reference source, and the frequency information for each source as inputs.
  • Configure the Two-source Match stage, which includes selecting the two-source match specification that you created.

Creating Two-source Match stage jobs

A Two-source Match stage job requires that you add the Two-source Match stage to the job and link it to data and reference sources and output stages.

Procedure

  1. Click New asset +.
  2. Choose DataStage.
  3. From the palette, select the Two-source Match stage and drag it on the middle of the canvas.
  4. From the palette, add four input stages. Add one for each of the following data sources.
    1. Data source.
    2. Frequency information for the data source.
    3. Reference source.
    4. Frequency information for the reference source.
    Typically, your source data comes from a file or database but you can also use other stages to preprocess it before you input it to the Two-source Match stage.
  5. Link the input stages in the following order:
    1. Data
    2. Reference
    3. Data Frequency
    4. Reference Frequency
  6. From the palette, add up to seven output stages, one for each of the output options that you intend to use in the Two-source Match stage.
    The following list shows the seven output options.
    • Match
    • Clerical
    • Data Duplicate
    • Reference Duplicate
    • Data Nonmatched
    • Reference Nonmatched
    • Match Statistics
    For the output stages, you can use any file, database, or processing stage.
  7. For the match output options that you intend to use, link the stages in the following order.
    1. Match
    2. Clerical
    3. Data Duplicate
    4. Reference Duplicate
    5. Data Nonmatched
    6. Reference Nonmatched
    7. Match Statistics
    Note: Verify in the Link ordering section on the Input tab and the Output tab that the link labels and link names are correct.
  8. Double-click an input stage.
    1. In the Properties section of the Output tab, in the text box for File, type the file name, including the path, which the input data will read.
      If you used a stage other than a file stage for the input, select the appropriate input and define any additional properties that the source requires.
    2. In the Columns section of the Output tab, click Edit, then choose Importing existing data definitions.
      When you select the data definition for the frequency input, use the table definition that was created when the Frequency Match job was run.
    3. Click Apply and return, then click Save.
    4. Repeat these steps until all the input stages are configured.
  9. Double-click an output stage.
    1. In the Properties section of the Input tab, in the text box for File, enter the file name to write the output data to.
      If you used a stage other than a file stage for the output, select the appropriate output and define the properties that the target requires.
    2. Click Apply and return, then click Save.
    3. Repeat these steps until all the output stages are configured.
  10. Optional: Rename the stages and links with meaningful names that reflect their functions in the job or project.

Configuring the Two-source Match stage

After you create a job that contains this stage, select from available match specifications and other settings that determine matching records in two data sources.

  1. Open the DataStage flow that contains the Two-source Match stage that you want to configure.
  2. Double-click the Two-source Match stage.
  3. In the Match Specification field, click Browse and select a two-source match specification.
  4. To override any of the match cutoff values from the match specification that you selected, select Override cutoffs and enter new values or job parameters in the Match, Clerical, or Duplicate columns.
  5. To enter specific values for the override match cutoffs, click the pencil icon next to a value and enter the new value. Click the icon to take one of the following options.
    Option Action
    Use an existing job parameter. Select Insert Parameter and choose an existing parameter.
    Return to match cutoff values from the match specification. Select Use match specification.
  6. From the drop-down list for the Match Type, choose a match type.
  7. In the Match Outputs pane, select one or more outputs that you want to create. Be aware of the following conditions.
    • Each output that you select must have a corresponding output link.
    • No output can be sent to more than one link.
  8. Click Stage Properties.
    1. In the column section of the Output tab, for any one output, select Edit to modify output columns.
  9. Click Save.

Match types for the Two-source Match stage

When you use the Two-source Match stage, you select one of the following match types.
Many-to-one
Any reference source record can match many data source records. Any one data source record can match only one reference source record. For example, if 101 Main St. on the data source matches two records on the reference source: 101-199 Main St SW and 101-199 Main St SE, the first reference source record is the matched record and the second reference source record is not considered a match to this particular data source record. It is possible that the second reference source record will match another data source record.
Many-to-one multiple
Each reference source record having the same weight as the matched pair when it is scored against the data record is flagged as a duplicate record. Any one data source record might match more than one reference source record. For example, if 101 Main St. on the data source matches to two records on the reference source: 101-199 Main St SW and 101-199 Main St SE, one reference source record is the matched record and the other is the duplicate.
Many-to-one duplicate
Like the many-to-one multiple option, except that additional reference source records that match to a level beyond the duplicate cutoff value are flagged as duplicates. This functionality means that records with lesser weights than the match weight can be flagged as duplicates. For example, if 101 Main St on the data source matches to three records on the reference source: 101-199 Main St SW, 101-199 Main St SE, and 101 Main Rd, you get 101-199 Main St SW as the match, and both of the other addresses might be duplicates.
One-to-one
Matches a record on the data source to only one record on the reference source. A record on the reference source can match only to one data source record.

Match outputs for the Two-source Match stage

You can send records to different links by using one of the following options.
  • Match. The matched records for both inputs.
  • Clerical. The clerical review records for both inputs.
  • Data Duplicate. The duplicates in the data source.
  • Reference Duplicate. The duplicates in the reference source.
  • Data Nonmatched. The nonmatched records from the data input.
  • Reference Nonmatched. The nonmatched records from the reference input.
  • Match Statistics. Summary statistics about the matching results and statistics about the matching process for each match pass.

Use the Link ordering section of the Output tab to associate the output options with specific output links. Check the Link ordering section to ensure that the records for each output option that you select are output to the intended link.

If you want, you can add other stages, such as the Funnel stage, to group some or all the output into a single file or table.

The columns that are available for output consist of all the input columns, plus more columns that are created by the match process. If you select the Match Statistics output option, ensure that you use the default names for the output columns.