It can be helpful to understand the logic that InfoSphere® MDM uses to compare two tokens.
- Initials match
- If both tokens are initials, there is an exact match and the exact match weight for the initials is applied. If one token is an initial and the other is not and the first characters match, the weight is the exact match weight for the initial with the initial adjustment penalty.
- Full word comparison
- The exact match weight is applied if the words match completely.
- Phonetic name comparison
- The words are checked for phonetic matching and, if they match, the weight is computed by subtracting the phonetic adjustment penalty from the smaller of the two exact match weights. If this weight is smaller than the minimum phonetic weight, the minimum weight is used.
- Nickname comparison
- If there is a NICKNAME table specified in the mpi_cmpspec table, the nickname comparison is performed. Two names match from the nickname table if they have a common nickname or the nickname of one matches the original name of the other.
- If there is a nickname match, the weight is computed by subtracting the nickname adjustment penalty from the smaller of the two exact match weights. If this weight is smaller than the minimum initial weight, the minimum is used.
- Nickname-meta comparison
Tokens can also match through a nick-meta (phonetic) match. Two names match in the nickname table if they each have a nickname that matches phonetically, or the nickname of one phonetically matches the original name of the other.
- If there is a nick-meta match, the weight is computed by subtracting the nick-meta adjustment penalty from the smaller of the two exact match weights. If this value is less than the minimum nick-meta weight, the minimum value is used.
- Edit distance comparison
- Edit distance measures the similarity between two tokens by calculating the number of character insertions, deletions, or transpositions it would take to make the tokens match.
- If Editdistance * edit-distance factor <= maximum (length of the two words), then the edit distance weight is applied. In other words, EditDistance <=1/edit-distance factor.
- Setting __EDITDIST_FACTOR = 5 means there is an edit distance match if the edit distance is <= 1/5 or 20% of the longest string length.
- Setting __EDITDIST_FACTOR = 0 means that every string pair matches by edit distance (this setting is not suggested).
- Setting __EDITDIST_FACTOR to a large number means that few if any pairs match by edit distance.
- The weight is obtained by applying the edit-distance adjustment penalty to the exact weights, as in the previous cases. In other words, you are trying to see how much of an edit distance is considered for a match.
- See Edit distance comparison functions for more detail.
- Prefix and Compound comparison
- Both AXP and CXNM functions allow for prefix and compound word comparison and matching. By using PREFIX or COMPOUND weight generation parameters, match cases that might normally be missed in edit-distance processing can be accounted for and produce more accurate results. For example, “Cleveland” and “Cleve” are not close enough for an edit distance comparison to result in a match. Using a prefix parameter would adjust for this distance in the comparison. As well, the compound parameter would account for “WALMART” in one string and “WAL MART” in another.
- Word mismatch
- If none of the previous matches pass, then the words are completely different. In this case, a zero weight is assigned.
- Acronym comparison
- Using the acronym comparison, if “ITS” is compared to “INTEGRATED TECHNOLOGY SOLUTIONS,” there is a match.
- The acronym comparison works in two ways. First the words in “ITS” are checked to see if there is an acronym in INTEGRATED TECHNOLOGY SOLUTIONS and vice versa.
- Acronym matches are scored as initial matches. In the described example there are three initial matches: I-INTEGRATED, T-TECHNOLOGY, and S-SOLUTIONS.