Two-source Match stage
The Two-source Match stage compares two sources of input data (reference records and data records) for matches.
- A data source
- Frequency information about the data source, generated by the Match Frequency stage
- A reference source
- Frequency information about the reference source, generated by the Match Frequency stage
Inputs to the Two-source Match stage
The Two-source Match stage takes four input sources. The data and reference sources can be from any parallel database, file, or processing stage.
- Data source
- Reference source
- Frequency information about the data and reference source, generated by the Match Frequency stage. You can input frequency data from the Match Frequency stage as a part of the current job. You can also input frequency data that was output from a Match Frequency job that was run previously.
When you configure the stage, designate an existing two-source match specification from the assets based on the column definitions of the data and reference sources.
The Two-source Match stage matches and groups your input data based on the match specification. You select which columns to output.
Two-source Match stage workflow
The Two-source Match stage requires standardized data and reference data as source data, a two-source match specification, and frequency information for both sources.
- Standardize the source data for the data source and the reference source.
- Prepare representative sample data sets from the source data.
- Use the Match Frequency stage to generate frequency information.
- Optional. If you want to reduce the amount of frequency data that will be used in the Two-source Match job, you can run the Frequency Match stage job again. However, for this job run, select the two-source match specification that you created. Selecting the two-source match specification in the Frequency Match stage job limits the frequency data to only the columns that will participate in the match job.
- Create a DataStage® asset that includes the Two-source Match stage, with data source, reference source, and the frequency information for each source as inputs.
- Configure the Two-source Match stage, which includes selecting the two-source match specification that you created.
Creating Two-source Match stage jobs
A Two-source Match stage job requires that you add the Two-source Match stage to the job and link it to data and reference sources and output stages.
Procedure
Configuring the Two-source Match stage
After you create a job that contains this stage, select from available match specifications and other settings that determine matching records in two data sources.
- Open the DataStage flow that contains the Two-source Match stage that you want to configure.
- Double-click the Two-source Match stage.
- In the Match Specification field, click Browse and select a two-source match specification.
- To override any of the match cutoff values from the match specification that you selected, select Override cutoffs and enter new values or job parameters in the Match, Clerical, or Duplicate columns.
- To enter specific values for the override match cutoffs, click the pencil
icon next to a value and enter the new value. Click the ⋮ icon to take one of
the following options.
Option Action Use an existing job parameter. Select Insert Parameter and choose an existing parameter. Return to match cutoff values from the match specification. Select Use match specification. - From the drop-down list for the Match Type, choose a match type.
- In the Match Outputs pane, select one or more outputs that
you want to create. Be aware of the following conditions.
- Each output that you select must have a corresponding output link.
- No output can be sent to more than one link.
- Click Stage Properties.
- In the column section of the Output tab, for any one output, select Edit to modify output columns.
- Click Save.
Match types for the Two-source Match stage
- Many-to-one
- Any reference source record can match many data source records. Any one data source record can
match only one reference source record. For example, if
101 Main St.
on the data source matches two records on the reference source:101-199 Main St SW
and101-199 Main St SE
, the first reference source record is the matched record and the second reference source record is not considered a match to this particular data source record. It is possible that the second reference source record will match another data source record. - Many-to-one multiple
- Each reference source record having the same weight as the matched pair when it is scored
against the data record is flagged as a duplicate record. Any one data source record might match
more than one reference source record. For example, if
101 Main St.
on the data source matches to two records on the reference source:101-199 Main St SW
and101-199 Main St SE,
one reference source record is the matched record and the other is the duplicate. - Many-to-one duplicate
- Like the many-to-one multiple option, except that additional reference source records that match
to a level beyond the duplicate cutoff value are flagged as duplicates. This functionality means
that records with lesser weights than the match weight can be flagged as duplicates. For example, if
101 Main St
on the data source matches to three records on the reference source:101-199 Main St SW
,101-199 Main St SE
, and101 Main Rd
, you get101-199 Main St SW
as the match, and both of the other addresses might be duplicates. - One-to-one
- Matches a record on the data source to only one record on the reference source. A record on the reference source can match only to one data source record.
Match outputs for the Two-source Match stage
- Match. The matched records for both inputs.
- Clerical. The clerical review records for both inputs.
- Data Duplicate. The duplicates in the data source.
- Reference Duplicate. The duplicates in the reference source.
- Data Nonmatched. The nonmatched records from the data input.
- Reference Nonmatched. The nonmatched records from the reference input.
- Match Statistics. Summary statistics about the matching results and statistics about the matching process for each match pass.
Use the Link ordering section of the Output tab to associate the output options with specific output links. Check the Link ordering section to ensure that the records for each output option that you select are output to the intended link.
If you want, you can add other stages, such as the Funnel stage, to group some or all the output into a single file or table.
The columns that are available for output consist of all the input columns, plus more columns that are created by the match process. If you select the Match Statistics output option, ensure that you use the default names for the output columns.