Designing match specifications in DataStage

When you group data with the QualityStage stages, you must specify the criteria to be used for matching records. Create new match specification assets and test them with the Match designer.

The following sections describe how to design match specifications on the IBM Cloud Pak® for Data platform. For more detailed conceptual information on the matching process itself, see Matching data.

Creating a match specification asset

You can create a match specification as a reusable component to use in your DataStage® jobs.

  1. Open an existing project or create a project.
  2. Under Assets, click New Asset and select DataStage component from the available asset types.
  3. Select Match specification as the DataStage component type.
  4. Select the match type and click Create to open the Match designer. For more information about match types, see Match types for the One-source Match stage and Match types for the Two-source Match stage.

Preparing data

Before you can test a match specification, you must upload or create files that contain your sample data, frequency, and metadata.

  1. Click Configure to open the configuration settings for your match specification.
  2. Select a sequential file or data set to use as your Data sample data set. DOS newline and UNIX newline are supported as record delimiters on sequential file.
  3. Select a sequential file or data set to use as your Data frequency data set. DOS newline and UNIX newline are supported as record delimiters on sequential file. For more information, see Match Frequency stage. Click Save and return.
  4. Click Input schema to configure metadata for your sample.
  5. Select a data source. Metadata can be extracted from the sample data, but you can include more detail by using a data definition. See Defining data definitions in DataStage.
  6. Select a Default handling for missing weights. You can set missing values to be counted as zero, agreements, disagreements, or the average of the agreement and disagreement weights.
  7. Select a Maximum frequency value. The default value is 100.
  8. Specify the environment for the match specification to be tested in under Execution environment for test.
  9. Click the Variable special handling tab to assign actions to do on specific columns. The actions apply to all match passes for this specification.

Adding passes

Add a pass to your match specification for each matching process that you want to run on your data. Specify a pass's blocking columns, match commands, and cutoff values. For more information, see Adding passes to a match specification in DataStage.

Testing passes

You can test match passes to identify how effectively they meet your matching goals and make adjustments as necessary. You can gain insight into your test results by looking at statistics and record weights. For more information, see Testing passes for a match specification in DataStage.

Provisioning

Click Provision to enable the match specification to be used in Match stages such as the One-source match stage, the Two-source match stage, and the Match frequency stage.