Match types in DataStage®

Matching provides several ways to process either a single source or two sources of data. You can apply the matching types on sequential files, data sets, records from relational databases, or as a real-time service.

Two-source matching in DataStage

Two-source matching involves a data source and a reference source.

The following types are available for two-source matching:

  • Two-source many-to-one
  • Two-source many-to-one, multiple
  • Two-source many-to-one, duplicate
  • Two-source one-to-one

For two-source matching, one or more columns on the data source must have equivalent columns on the reference source. For example, to match on family name and age, both sources require columns containing this information. The location, name, and length of a column in the data source can be different from its equivalent column in a reference source.

One-to-one matching and examples

In one-to-one matching, one record from the data source can be assigned to only one record from the reference source, and vice versa. Each record pertains to a single individual or event.

When a reference record is matched, the reference record cannot match to any other records and is removed from subsequent passes. After record pairs are scored and possible matches are determined, a linear-sum-assignment algorithm is used to optimize the assignment of matched pairs.

Examples of one-to-one matching are as follows:

  • A sample of individuals counted in the United States Census are matched to a Post Enumeration Survey. The test objective is to determine which individuals and households are present in both the census and survey.
  • The National Highway Traffic Safety Administration matches highway traffic accident reports, which are completed by law enforcement officers, to injury records, which are completed by emergency medical service and hospital personnel. Match provides a way to measure how highway safety countermeasures affect medical outcomes.
  • Matching files that contain administrative data are useful in a number of areas. For example, if juries are chosen from both driver license records and voter registration rolls, the two files can be matched to identify individuals that are listed in both files. Steps can be taken to ensure that individuals that are listed both files are not twice as likely to be selected for a jury as individuals listed in only one file.
  • Matching administrative files can also provide data that is missing from one set of files. For example, a driver's license file could provide information regarding age, gender, or race, while a voter's file might provide a national identity number.

Many-to-one matching and examples

You use many-to-one matching to match a single data source to a reference source. A reference record can match to many records on the data source.

The many-to-one, multiple type reports on all reference records that tie for the best match score for a data record. One reference record is chosen as a match and the others that have an identical weight to the matched pair are flagged as duplicates. The many-to-one, duplicate type identifies the best record as a match and flags all other reference records, that have a weight above the duplicate cutoff threshold, as duplicates.

Examples of many-to-one matching are as follows:

  • Matching a purchased mailing list against a list of customers. The mailing list can have duplicates, or many lists might be integrated. Consequently more than one record on the mailing list can match to the list of customers.
  • Matching a file of sales activities or accounts receivable to the file of customers. There might be more than one sale for any given customer.
  • Address matching to enrich data records. You can match a data source containing postal addresses to a reference source to obtain geographic coordinates, census tract numbers, special area codes, and other information.

    For example, 103 Main St. can match to a record like 101-199 Main St. There can be more than one user record within that block; therefore multiple records on the data source might match to a single record on the reference source.

  • Matching a file that contains data about park visitors to census reference sources to obtain the census tracts where the visitors reside. You can use the information to prepare a demographic profile of the visitors.
  • Matching a customer file to a geographic coordinate file to produce an automated map that shows the location of the customers.
  • Matching a file of ambulance calls or fires to a file containing fire district codes to produce a map that shows the locations of such events.

One-source matching in DataStage

One-source matching groups records that have similar attributes into sets.

The following types are available for one-source matching:

  • One-source dependent
  • One-source independent
  • One-source transitive
When you deduplicate a data source, you first identify groups of records that share common attributes. The identification allows you to correct, merge, or eliminate the duplicate entries. But one-source matching is not only for deduplication. One-source matching also provides the capability of grouping records even in the following situations:
  • The records do not have unique identifiers, such as tax identification numbers.
  • The content is subject to error.

One-source matching determines the relationship among the match passes, the records that are processed by each of the passes, and how the groups are formed. In most cases, you choose the dependent match type, because you want duplicates removed from consideration, so that they do not match to other records in subsequent passes. However, the independent match type is useful when you want to link people or organizations regardless of address. For example, you can link each doctor with all the locations where the doctor works. The transitive match type is useful if you want to account for inconsistent data entry in columns that assist in duplicate identification, for example, date of birth and driver's license numbers.

The passes process and group records based on the type that you select:

Dependent
The passes process the data sequentially. In each pass, groups are built around master records. The groups that are formed in all the passes for the same master record are combined to create the final group for the master. Each duplicate record in a group matches the group master record in one of the match passes. The master records and nonmatched records from a pass are made available for the subsequent pass. Duplicates are taken out of consideration, so that they are not assigned to more than one group. Existing master records are given priority in group construction in subsequent passes.
Independent
Each pass processes all the input records. Like the one-source dependent match type, in each pass, groups are built around master records. But because each pass processes all records, a record can be a member of a group from more than one of the passes. (Similarly, a record can be a master in a group that is built in one pass while being a duplicate in a group that is built in another pass.) The groups from all the passes are merged, so that groups that have a record in common form a single group.

If record A is in a group with record B, and record B is in a different group with record C, then the two groups are merged so that records A, B, and C are all in the same group. (A record ends up in no more than one group.) Groups are merged until all groups that have records in common are merged. At the pass level, the relationship that determines group membership is that of records matching a master record. However, for the merge process, the relationship is one of group membership. Therefore, members in a group can be connected by a chain of relationships and do not necessarily all match a common master.

Transitive
Like the one-source independent match type, the one-source transitive match type processes all the input records, in each of the passes. But unlike the one-source independent match type, the one-source transitive match type uses all pairs that have a score above the match cutoff in a pass, not just pairs that are grouped in a pass. Using only grouped pairs discards high scoring record pairs if the records end up in different groups.

The one-source transitive match type does not discard the information about high scoring record pairs that are in different groups. The one-source transitive match type builds groups so that all records that score above the match cutoff in any pass are in the same group. For example, if record A and record B scored above the match cutoff in a pass, and record B and record C scored above the match cutoff in a pass (possibly the same pass), then records A, B, and C are added to the same group. (A record ends up in no more than one group.)

Like the one-source independent match type, members in a group can be connected by a chain of relationships and do not necessarily all match a common master. But the one-source transitive match type chain can extend further because it uses all the pairs that score above the match cutoff.

Master record selection and group construction in DataStage

Match processes one block of records at a time. Each record in a block of records is compared to every other record in the block. A master record is designated. Each master record is used to create a group. When the block is completely read, another block is read, until the end of the input data is reached.

The following list describes the process:
  1. All pairs of records within a block are scored.
  2. From the records that have not been added to a group, the record pair with the highest composite weight over the cutoff thresholds is selected.
  3. The record from the pair with the highest score when compared to itself is designated as the master record.
  4. Other records are then added to create the group. A record whose composite weight when compared to the master record is above the match cutoff is assigned to the group as a match duplicate. A record whose weight is between the clerical and match cutoffs is assigned to the group as a clerical duplicate.
  5. The process is repeated within the block until there are no remaining pairs of ungrouped records whose weight is above either of the cutoffs.

Examples of one-source matching

One-source matching is often used to deduplicate lists and identify groups of related records

Examples of one-source matching are as follows:
  • Grouping all hospital charges for a patient.
  • Finding all members of a household. All persons at a specific address or building are grouped.
  • Deduplicating a file of customer invoices to identify individual customers.
  • Deduplicating a mailing list that was created by merging multiple files.
  • Removing duplicate records that pertain to the same individual, household, event, product, or part, from a data source before updating or creating a data warehouse.