It can be helpful to understand the logic used by the Master Data Engine to
compare two tokens.
- Initials match
- If both tokens are initials,
there is an exact match and the exact match weight for the initials
is applied. If one token is an initial and the other is not and the
first characters match, the weight is the exact match weight for the
initial with the initial adjustment penalty.
- Full word comparison
- The exact match weight
is applied if the words match completely.
- Phonetic name comparison
- The words are checked for phonetic matching
and, if they match, the weight is computed by subtracting the phonetic
adjustment penalty from the smaller of the two exact match weights.
If this weight is smaller than the minimum phonetic weight, the minimum
weight is used.
- Nickname comparison
- If there is a NICKNAME
table specified in the mpi_cmpspec table, the nickname comparison
is performed. Two names match from the nickname table if they have
a common nickname or the nickname of one matches the original name
of the other.
- If there is a nickname match, the weight is computed by subtracting
the nickname adjustment penalty from the smaller of the two exact
match weights. If this weight is smaller than the minimum initial
weight, the minimum is used.
- Nickname-meta comparison
Tokens can also match through a nick-meta (phonetic)
match. Two names match in the nickname table if they each have a nickname
that matches phonetically, or the nickname of one phonetically matches
the original name of the other.
- If there is a nick-meta match, the weight is computed by subtracting
the nick-meta adjustment penalty from the smaller of the two exact
match weights. If this value is less than the minimum nick-meta weight,
the minimum value is used.
- Edit distance comparison
- Edit distance measures the similarity
between two tokens by calculating the number of character insertions,
deletions, or transpositions it would take to make the tokens match.
- If Editdistance * edit-distance factor <= maximum (length of
the two words), then the edit distance weight is applied. In other
words, EditDistance <=1/edit-distance factor.
- Setting __EDITDIST_FACTOR = 5 means there is an edit distance
match if the edit distance is <= 1/5 or 20% of the longest string
length.
- Setting __EDITDIST_FACTOR = 0 means that every string pair matches
by edit distance (this setting is not suggested).
- Setting __EDITDIST_FACTOR to a large number means that few if
any pairs match by edit distance.
- The weight is obtained by applying the edit-distance adjustment
penalty to the exact weights, as in the previous cases. In other words,
you are trying to see how much of an edit distance is considered for
a match.
- See Edit distance comparison functions for
more detail.
- Prefix and Compound comparison
- Both AXP and
CXNM functions allow for prefix and compound word comparison and matching. By
using PREFIX or COMPOUND weight generation parameters, match cases
that might normally be missed in edit-distance processing can be accounted
for and produce more accurate results. For example, “Cleveland” and
“Cleve” are not close enough for an edit distance comparison to result
in a match. Using a prefix parameter would adjust for this distance
in the comparison. As well, the compound parameter would account for
“WALMART” in one string and “WAL MART” in another.
- Word mismatch
- If none of the previous matches
pass, then the words are completely different. In this case, a zero
weight is assigned.
- Acronym comparison
- Using the acronym comparison,
if “ITS” is compared to “INTEGRATED TECHNOLOGY SOLUTIONS,” there is
a match.
- The acronym comparison works in two ways. First the words in “ITS”
are checked to see if there is an acronym in INTEGRATED TECHNOLOGY
SOLUTIONS and vice versa.
- Acronym matches are scored as initial matches. In the described
example there are three initial matches: I-INTEGRATED, T-TECHNOLOGY,
and S-SOLUTIONS.