Match comparisons analyze the values in columns to calculate
the contribution to the composite weight. From the more than 24 available
comparisons, choose an exact or error-tolerant comparison that suits
both your matching objectives and your data.
Comparisons marked with an equal sign (=) can be used for both
one-source matches and two-source matches. All other comparisons apply
only to two-source matches.
Comparisons that can also be used for matching vectors are marked with an asterisk (*). If you
want to create vectors to use in the Match Designer, see Make Vector stage in DataStage.
Table 1. Match comparisons that apply to characters
Comparison |
Description |
CHAR *
= |
Compares data values on a character-by-character
basis. This comparison is often used to catch spelling mistakes or
inverted letters. |
LR_CHAR |
Compares place information from a data
source with geocoding reference files by using a left-right character
string comparison. |
LR_UNCERT |
Compares place information from a data
source with geocoding reference files by using a left-right string
comparison algorithm based on information theory principles. |
UNCERT *
= |
Evaluates the similarity of two character
strings by using an algorithm that is based on information theory
principles. |
Table 2. Match comparisons that apply to numbers
Comparison |
Description |
ABS_DIFF *
= |
Compares the absolute difference between
two numbers to a value that you specify. |
CNT_DIFF *
= |
Compares two strings of numbers and
assigns agreement or disagreement weights based on the number of differences
between the numbers in the strings. Weights are prorated according
to the magnitude of the disagreement. |
DATE8 *
= |
Compares dates in the format of YYYYMMDD
by measuring the number of days difference between two dates. Weights
are prorated according to the magnitude of the disagreement. |
DELTA_PERCENT * = |
Compares columns in which the difference
is measured as a percentage of the value that is compared. One use
for DELTA_PERCENTAGE is comparing age. |
DISTANCE = |
Computes the distance between two points
and prorates the weight based on the distance between the points.
You can use this comparison for matching geographic coordinates where
the farther the points are from each other, the lesser the weight
that is applied. |
NUMERIC *
= |
Converts two strings to integers and
then performs a numeric comparison of the integers. When converting
the strings to integers, leading and trailing spaces are ignored. |
PRORATED *
= |
Compares numeric columns and allows
them to disagree by an absolute amount that you specify. Any difference
between 0 and the specified amount receives a weight proportionally
equal to that difference. |
TIME *
= |
Compares values in DataStage and QualityStage time or character columns in the format of HHMM
or HHMMSS. TIME assigns proportionate weights to time differences that fall between an exact match
and the maximum difference that you allow. |
Table 3. Match comparisons that apply to strings
Comparison |
Description |
MULT_ALIGN = |
Scores the similarity of two sequences
of terms. This comparison combines your knowledge of how similar the
terms are, the order of the similar terms, and the proximity of the
similar terms. You can use MULT_ALIGN to compare addresses where the
sequences of terms are in different orders. |
MULT_EXACT = |
Compares all words in one column of
a record with all words in the same column of a second record. |
MULT_RANGE |
Compares a string in one column to
a range of strings in another column. You can use MULT_RANGE to compare
single house numbers to a list of house number ranges. |
MULT_UNCERT = |
Compares all words in one column of
a record with all words in the same column of a second record by using
a string comparison algorithm based on information theory principles. |
NAME_UNCERT *
= |
Compares two strings. First, it right-truncates
the longer string so that it contains the same number of characters
as the shorter string. If that comparison is not an exact match, it
evaluates the similarity of the strings by doing an UNCERT comparison.
You can use NAME_UNCERT to compare given names, where one of the name
strings is shorter than the other. |
PREFIX *
= |
Compares two strings on a character-by-character
basis after right-truncating the longer string so that it contains
the same number of characters as the shorter string. |
Table 4. Match comparisons that apply to intervals
Comparison |
Description |
AN_DINT |
Compares an alphanumeric string from
a data source to two alphanumeric intervals from a reference source.
You can use this comparison to compare house numbers with census,
Etak, GDT DynaMap, postal code, or other files. |
AN_INTERVAL |
Compares a single number from a data
source to an interval or range of numbers from a reference source
by using an alphanumeric odd-even interval comparison. |
D_INT |
Compares a numeric string from a data
source to two numeric intervals from a reference source. You can use
this comparison to compare house numbers with census, Etak, GDT DynaMap,
or postal code files. |
D_USPS |
Compares an alphanumeric house number
from a data source to two alphanumeric house number intervals from
a reference source by using a left-right interval comparison. Control
columns indicating the odd-even parity of the reference intervals
are required. |
INT_TO_INT = |
Compares an interval from a data source
to an interval from a reference source. The results match if an interval
in one file overlaps or is fully contained in an interval in another
file. |
INTERVAL_NOPAR |
Compares a single number from a data
source to an interval from a reference source. The single number must
be within the interval (inclusive of the end points) to be considered
a match. The odd-even parity of the single number does not need to
agree with the parity of the beginning value of the interval. |
INTERVAL_PARITY |
Compares a single number from a data
source to an interval from a reference source. The odd-even parity
of the number must agree with the parity of the beginning value of
the interval. |
USPS |
Compares an alphanumeric house number
from a data source to two alphanumeric house number intervals from
USPS ZIP Code files or other reference sources that can contain alphanumeric
ranges. Odd-even parity control information such as the USPS ZIP+4
control column is required. |
USPS_DINT |
Compares an interval from a data source
to two intervals from a reference source for columns that contain
an address primary number. This match comparison can be used to compare
information from a USPS ZIP+4 file to geographic reference files such
as the Census Bureau TIGER file, GDT Dynamap files, or Etak MapBase
files. Odd-even parity control information such as the USPS ZIP+4
control column is required. |
USPS_INT = |
Compares an interval from a data source
to an interval from a reference source for columns that contain address
primary number. The results match if the interval in the data source
overlaps any part of the interval from a reference source and the
odd-even parity agrees. Parity control information such as the USPS
ZIP+4 control column is required. |