CXNM (comparison function)
The CXNM comparison routine is used for business name comparisons.
CXNM compares ordered name tokens. As a reminder, cmpargs are used to specify the type of phonetic function to be applied.
Greater accuracy can be achieved by using PREFIX or COMPOUND weight generation parameters. Without use of these parameters, match cases with prefixes or compound names might be missed in normal edit distance processing.
- Number of roles
- Weight table
- mpi_wgtsval, mpi_wgthead
- Weight generation parameters
- __PREFIX_FACTOR, __PREFIX_ADJWGT, __PREFIX_MINWGT, __PREFIX_MAXWGT,
__COMPOUND_ADJWGT, __COMPOUND_MINWGT, __COMPOUND_MAXWGTNote: Set the weight generation parameters in the mpi_wgtsval table, similar to other comparison parameters.
The comparison function compares the two strings in an iterative manner, first looping through one of the compare strings and then through the other. The tokens are compared against one another during each iteration.
In business name parsing, the order of the words is important. Every time there is an order displacement, there is a penalty associated with the next match. For example: INITIATE SYSTEMS LTD and INITIATE SYSTEMS LTD match better than INITIATE SYSTEMS LTD and INITIATE LTD SYSTEMS.
The result of each token match could be one of the following:
- Initial match
- Partial match
- Phonetic match
- Acronym match
- Nickname match
- Nickname_meta match
- Prefix/compound match
- Edit distance match
- Total name mismatch
The comparison logic for two strings, both with a set of words/tokens, is as follows:
- Token-by-token comparison takes order into consideration. As such, if the strings “CLEVELAND CLINIC” and “CLINIC OF CLEVELAND” are compared, only one of the two tokens CLEVELAND and CLINIC would match.
- Penalties are applied for extra tokens between matches. If the strings “JIMS TRUCKS” and “JIMS PRETTY BIG TRUCKS” are compared, the tokens JIMS and TRUCKS would match. However, there would be a penalty applied to the second match, because of the extra tokens between JIMS and TRUCKS.
- The prefix and compound name matching is done for CXNM. A check is done to see if
the two tokens are compound matches (for example, INITIATE SYSTEMS and INITIATESYSTEMS).
- First the strings are checked to see if they are numeric, since no prefix or compound match is done for numerics.
- If the strings are a compound match, the weight of the non-compound string is taken and the compound adjustment weight is subtracted. This weight is then checked to see that it is within the min and the max ranges.
- Prefix matching is then checked. If string 1 matches at least the string 2 length/prefix factor, then string 1 is considered a prefix of string 2. PREFIX_FACTOR is currently set to a default value of 2.
- The PREFIX_MATCH weight is then applied by subtracting the PREFIX_ADJWGT (prefix adjustment) weight from the min (string1 and string2) weights. This weight is then checked to be within the min and max range. If it is not within the range, the weight is limited to the min/max as appropriate.
- After all the comparisons have been done, the scores for all the matched tokens are added up, and then divided by the average of the two total token weights. This percentage is then normalized to a score from 0 to the maximum normalized index, which is typically 16. This index is used to look up the final weight in the PARM weight table.
- This normalized weight is compared to the different thresholds and the comparison output is MATCH, PARTIAL MATCH or MISMATCH.
There are two types of parameters used by CXNM. The first is a set of parameters used during the comparison step of weight generation. The second set of parameters use the outputs from weight generation and are used during the actual data comparison to determine a final score.
Weight generation comparison parameters
With normal edit distance processing, there are names that might not be recognized. Using the following weight generation parameters for prefix and compound address matching can significantly improve comparisons.
For example, if string 1 contains “CLEVELAND” and string 2 contains “CLEVE,” the strings are not close enough for the edit distance partial match. However, the prefix parameter adjusts for this scenario.
In the case of compound names, if string 1 contains “WAL MART” and string 2 contains “WALMART,” the parameters do allow for credit as a match.
Weight generation parameters include:
- PREFIX_FACTOR – The threshold for prefix match is defined by a
configurable prefix match factor which is an integer PREFIX_MATCH_FACTOR. For example: Ni
= “MICRO” and Mj = “MICROSOFT”
If Ni matches the beginning of Mj and PREFIX_MATCH_FACTOR* len(Ni )>=len(Mj) then the match is a prefix match.
Setting PREFIX_MATCH_FACTOR to 2 means that Ni should be at least half the length of Mi. Setting PREFIX_MATCH_FACTOR to 3 indicates that Ni should be at least one-third the length of Mi.
The default value for PREFIX_MATCH_FACTOR is 2.
- PREFIX_ADJWGT – If a prefix match is identified, the prefix_adjwgt is
used to adjust the prefix token (for example, “MICRO”) weight in the following manner:
(Prefix compare token)Ni_wgt = MIN(Ni_wgt, Mj_wgt) - PREFIX_ADJWGT
The default value for PREFIX_ADJWGT is 100 or 1.0.
- PREFIX_MINWGT – Used as a lower boundary for any prefix weight adjusted token. The prefix adjusted weight never falls below this weight value. The default value for prefix_MINWGT is 50 or .5.
- PREFIX_MAXWGT – Used as an upper boundary for any prefix weight adjusted token. The prefix adjusted weight never goes above this weight value. The default value for prefix_MAXWGT is 300 or 3.0.
- COMPOUND_ADJWGT – A compound match is detected when comparing tokens
“MICRO” “SOFT” vs. “MICROSOFT”. The compound_adjwgt is used to
adjust the compound token (“MICRO”) weight in the following manner:
(Compound compare token “MICRO”)Ni_wgt = MIN(Ni_wgt, Mj_wgt) - COMPOUND_ADJWGT
The default value for compound_adjwgt is 50 or .5.
- COMPOUND_MINWGT – Used as a lower boundary for any compound weight adjusted token. The compound adjusted weight never falls below this weight value. The default value for compound_MINWGT is 50 or .5.
- COMPOUND_MAXWGT – Used as an upper boundary for any compound weight adjusted token. The compound adjusted weight never goes above this weight value. The default value for compound_MAXWGT is 400 or 4.0.
The following conditions and penalties are used in comparison to determine the final score.
Position penalties: CXNM uses the following to check for position penalties.
- __CELLDIFF_ADJWGT %d (d=2,3,4). “d” refers to default disagreement weight.
CELLDIFF parameters are applied during the comparison of one token to another. When two tokens that match have a cell difference of 2, the total weight is reduced by CELLDIFF_ADJWGT_2. The weight is checked to make sure that it is within the range limits.
CELLDIFF_MAXIDX is used to check that the cell difference does not exceed this Max.
- __POSITION_EXACT /* (two tokens match exactly and are in the same position, the default is 20 */)
- __POSITION_ADJ /* (two tokens are off by one position, the default is 10 */)
Edit-distance: The following parameters are used to check the phone edit-distance. If it is equal to MCCIDX_EQUAL, then it is an exact match, otherwise it is a partial match.
If the edit-distance from comparing phone numbers is equal to DIST_MCCIDX_EQUAL, they are considered an exact match. Similarly, if the distance is equal to DIST_MCCIDX_PARTIAL, they are considered partial match.
Index calculation: CXNM uses these parameters to calculate the index for 2dim tables. It also checks for EQUAL or PARTIAL matches (similar to the ones above).
- __NORM_ADJWGT %d (d=2,3,4). “d” refers to default disagreement weight.
The NORM_MCCIDX values are used for the address equal or partial match.
NORM_MAXIDX and NORM_MINIDX are used to bind the Normalized value from the Address matching.
NORM_ADJWGT is the final lookup weight for CXNM.
Partial string matches: These parameters are used for penalties for the partial string matches.
EDITDIST_FACTOR is used to determine if the two strings can use an edit-distance comparison. EDITDIST_ADJWGT is the penalty applied for an edit-distance match. The final weight is then checked to be within the ranges by using the MIN and MAX values.
Acronym matches: The following use the MIN and MAX values to check the edit-distance in acronyms. If the weights are less than MIN or greater than MAX, they are adjusted back to the MIN and MAX.
Weight tokens: The _NORM_AVGINFO parameter controls the average information content that is used to calculate the normalized weight. Using a 0 results in the average being equal to the minimum weight. Using a 1 results in the average being equal to the maximum weight. The default value is 0.5 (50 in the wgtval notation), which is a true average. The _NORM_AVGINFO parameter is used to adjust the information that is used in the normalization process to reduce or increase the penalty for missing tokens. In earlier versions, the normalization process used the average information of the two records that are being compared. With the _NORM_AVGINFO parameter, the normalization process can use any value between the minimum and maximum weights of the two records being compared.
Consider comparing the strings BILL JOHNSONS TRUCKS and B JOHNSONS BIG TRUCKS. The following weights are the exact match weights and some penalties that are used:
- Initial Adjustment Weight
- Position Adjustment Weight
The comparison is done in this series of steps. At every step, the best weight is carried forward.
- B with [BILL, JOHNSONS, TRUCKS]
- B and BILL partially match and the weight is:
20-5=15 (exact match –initial adjustment weight)
- JOHNSONS with [BILL,JOHNSONS AND TRUCKS]
- JOHNSONS and JOHNSONS match exactly and the weight is 50+15=65 (exact match+ previous weight).
- BIG with [BILL, JOHNSONS AND TRUCKS]
- No matches
- TRUCKS with [BILL, JOHNSONS, TRUCKS]
- TRUCKS and TRUCKS match exactly and the weight is: 65+40-5=100 (previous weight+exact match weight – position penalty)
The position penalty weight is applied in this case, since there was a positional gap between this match and the previous match.
- The total for B JOHNSONS BIG TRUCKS is 20+50+30+40=140
- The total for BILL JOHNSONS TRUCKS is 30+50+40=120
- The average is 140+120=260/2=130
- The normalized score is now: (100/130*15) =12.30 (this is done since the scale is from 0 -15). This is now looked up in the mpi_wgtsval table under _NORM_ADJWGT_12 and that is the score used.