Matching data in DataStage
Matching in QualityStage® is a probabilistic record linkage system that automates the process of identifying records that are likely to represent the same entity. The matching process improves the integrity of your data.
With matching, you can identify duplicates in your data, group records on any set of criteria, and build relationships among records in multiple files despite variations in the representation of the data and missing or inaccurate information.
Matching goals in DataStage
Some typical goals for matching include the following ones:
- Identifying duplicate records for entities such as individuals, companies, suppliers, products, or events
- Grouping records with the same or similar values, such as householding
- Enriching existing data with new attributes from external sources
- Locating entities in a data warehouse
- Reconciling inventory or transactions
Nest steps:
After you identify your goals, you define the requirements of what constitutes a match. You choose which columns to compare and how to compare the columns.
You then create and test customized match specifications by using the Match Designer. These match specifications are used by the Two-source Match and One-source Match stages in DataStage® jobs.
Matching in DataStage: key concepts
Probabilistic record linkage uses statistical properties of values to calculate the likelihood that records correspond to the same entity. The information content, completeness, reliability, contextual frequency, and representation of the data are considered to create a cumulative assessment of confidence.
Understanding conceptual information is necessary to define match specifications. You need to know how to evaluate the results, estimate probabilities, set thresholds, and perform related tasks. Some understanding of the theory of record linkage is required.
Record linkage and the matching process in DataStage
In practice, you compare record pairs and classify them into one of these sets: matched pairs and nonmatched pairs.
- Columns contain errors or missing values.
- Data can be unreliable.
- You want to find the matches with a reasonable statistical assurance.
- Scenario for record linkage in DataStage
-
Consider two sources of data. Each source consists of a number of records, and the records contain some number of columns. Typically, each record corresponds to an entity, and the columns are attributes identifying the entity, such as name, address, age, and gender. The objective of the record linkage or matching process is to identify and link the records on each source that correspond to the same entity. The records do not contain totally reliable unique identifiers that make the matching operation trivial. Also, the individual columns are all subject to error.
The columns in common between the two sources are useful for matching. Not all columns, however, contain an equal amount of information, and error rates vary. For example, a column such as Gender has only two value states, and consequently, cannot impart enough information to identify a match uniquely. Conversely, a column such as FamilyName imparts much more information, but it might frequently be reported or transcribed (keyed) incorrectly.
You use weights to measure the contribution of each column to the probability of making an accurate classification. Record linkage has the following states; a record pair is classified as follows:
Match: The composite weight is above a threshold (cutoff) value.
Non match: The composite weight is below a second threshold value.
Undecided situation: The composite weight is between the first and second thresholds.
Consider the level of information content in DataStage
One or more columns in one record must have equivalent columns in the other record to compare them. For example, in order to match on family name and age, both records must have columns containing family name and age information. Although for a two-source match, the metadata for the comparable columns does not need to be identical.
For a record linkage project to be feasible, it is possible for a human to examine the record pairs and declare with reasonable certainty which of the pairs are a match or a nonmatch. For example, if the only column in common between two sources is gender, you do not then conclude that, because the gender agrees, the pair represents the same individual.
Information content measures the significance of one column over another (discriminating value). For example, a gender code contributes less information than a tax identification number.
Information content also measures the significance of one value in a column over another. In the United States, John contributes less information than Dwezel in a GivenName column. The given name, John, in the United States is much more common than the given name, Dwezel. Significance is determined by the reliability of the value and the ability of the value to discriminate a match from a nonmatch. And any comparison of records needs enough information to reach a reliable conclusion. For example, two identical customer records that contain only the family name, Smith, and that miss values in every other name and address column, do not have enough information to determine that the records represent the same person.