Match comparisons in DataStage

Match comparisons analyze the values in columns to calculate the contribution to the composite weight. From the more than 24 available comparisons, choose an exact or error-tolerant comparison that suits both your matching objectives and your data.

Comparisons marked with an equal sign (=) can be used for both one-source matches and two-source matches. All other comparisons apply only to two-source matches.

Comparisons that can also be used for matching vectors are marked with an asterisk (*). If you want to create vectors to use in the Match Designer, see Make Vector stage in DataStage.

Table 1. Match comparisons that apply to characters
Comparison Description
CHAR * = Compares data values on a character-by-character basis. This comparison is often used to catch spelling mistakes or inverted letters.
LR_CHAR Compares place information from a data source with geocoding reference files by using a left-right character string comparison.
LR_UNCERT Compares place information from a data source with geocoding reference files by using a left-right string comparison algorithm based on information theory principles.
UNCERT * = Evaluates the similarity of two character strings by using an algorithm that is based on information theory principles.
Table 2. Match comparisons that apply to numbers
Comparison Description
ABS_DIFF * = Compares the absolute difference between two numbers to a value that you specify.
CNT_DIFF * = Compares two strings of numbers and assigns agreement or disagreement weights based on the number of differences between the numbers in the strings. Weights are prorated according to the magnitude of the disagreement.
DATE8 * = Compares dates in the format of YYYYMMDD by measuring the number of days difference between two dates. Weights are prorated according to the magnitude of the disagreement.
DELTA_PERCENT * = Compares columns in which the difference is measured as a percentage of the value that is compared. One use for DELTA_PERCENTAGE is comparing age.
DISTANCE = Computes the distance between two points and prorates the weight based on the distance between the points. You can use this comparison for matching geographic coordinates where the farther the points are from each other, the lesser the weight that is applied.
NUMERIC * = Converts two strings to integers and then performs a numeric comparison of the integers. When converting the strings to integers, leading and trailing spaces are ignored.
PRORATED * = Compares numeric columns and allows them to disagree by an absolute amount that you specify. Any difference between 0 and the specified amount receives a weight proportionally equal to that difference.
TIME * = Compares values in DataStage and QualityStage time or character columns in the format of HHMM or HHMMSS. TIME assigns proportionate weights to time differences that fall between an exact match and the maximum difference that you allow.
Table 3. Match comparisons that apply to strings
Comparison Description
MULT_ALIGN = Scores the similarity of two sequences of terms. This comparison combines your knowledge of how similar the terms are, the order of the similar terms, and the proximity of the similar terms. You can use MULT_ALIGN to compare addresses where the sequences of terms are in different orders.
MULT_EXACT = Compares all words in one column of a record with all words in the same column of a second record.
MULT_RANGE Compares a string in one column to a range of strings in another column. You can use MULT_RANGE to compare single house numbers to a list of house number ranges.
MULT_UNCERT = Compares all words in one column of a record with all words in the same column of a second record by using a string comparison algorithm based on information theory principles.
NAME_UNCERT * = Compares two strings. First, it right-truncates the longer string so that it contains the same number of characters as the shorter string. If that comparison is not an exact match, it evaluates the similarity of the strings by doing an UNCERT comparison. You can use NAME_UNCERT to compare given names, where one of the name strings is shorter than the other.
PREFIX * = Compares two strings on a character-by-character basis after right-truncating the longer string so that it contains the same number of characters as the shorter string.
Table 4. Match comparisons that apply to intervals
Comparison Description
AN_DINT Compares an alphanumeric string from a data source to two alphanumeric intervals from a reference source. You can use this comparison to compare house numbers with census, Etak, GDT DynaMap, postal code, or other files.
AN_INTERVAL Compares a single number from a data source to an interval or range of numbers from a reference source by using an alphanumeric odd-even interval comparison.
D_INT Compares a numeric string from a data source to two numeric intervals from a reference source. You can use this comparison to compare house numbers with census, Etak, GDT DynaMap, or postal code files.
D_USPS Compares an alphanumeric house number from a data source to two alphanumeric house number intervals from a reference source by using a left-right interval comparison. Control columns indicating the odd-even parity of the reference intervals are required.
INT_TO_INT = Compares an interval from a data source to an interval from a reference source. The results match if an interval in one file overlaps or is fully contained in an interval in another file.
INTERVAL_NOPAR Compares a single number from a data source to an interval from a reference source. The single number must be within the interval (inclusive of the end points) to be considered a match. The odd-even parity of the single number does not need to agree with the parity of the beginning value of the interval.
INTERVAL_PARITY Compares a single number from a data source to an interval from a reference source. The odd-even parity of the number must agree with the parity of the beginning value of the interval.
USPS Compares an alphanumeric house number from a data source to two alphanumeric house number intervals from USPS ZIP Code files or other reference sources that can contain alphanumeric ranges. Odd-even parity control information such as the USPS ZIP+4 control column is required.
USPS_DINT Compares an interval from a data source to two intervals from a reference source for columns that contain an address primary number. This match comparison can be used to compare information from a USPS ZIP+4 file to geographic reference files such as the Census Bureau TIGER file, GDT Dynamap files, or Etak MapBase files. Odd-even parity control information such as the USPS ZIP+4 control column is required.
USPS_INT = Compares an interval from a data source to an interval from a reference source for columns that contain address primary number. The results match if the interval in the data source overlaps any part of the interval from a reference source and the odd-even parity agrees. Parity control information such as the USPS ZIP+4 control column is required.