AXP

AXP (address and phone) performs edit distance, phonetic, and frequency-based analysis on address data and an edit distance analysis on phone number data.

AXP uses the output of the ADDR2 standardization functions. It uses the cmpargs argument in mpi_cmpspec to specify the type of phonetic function to be applied. The possible phonetic functions (cmpargs) are: METAPHONE, IDENTAPHONE, ARABPHONE, and PREFIXMAP.

The AXP address and phone comparison function is based on information content and similarity. If the address consists of street information and postal codes, this information is used for comparison. When the postal code is not present, the City, State (or city, country) is used in the comparison.

The AXP function compares ordered name tokens. To account for match cases with prefixes or compound names, which might be missed in normal edit distance processing, use PREFIX or COMPOUND weight generation parameters. These parameters help achieve accurate name matching.

AXP further accounts for addresses that contain unit information, such as APT, FLOOR, or SUITE. By using the string type of SET, which is defined in mpi_strset, the address input is broken up by unit boundaries.

In some instances, you might want to give greater emphasis (greater weight) to certain tokens in the address string. For example, you might want to weigh ZIP Code higher than Streetline. The use of ADDR_ parameters enable weighting of address tokens which affect the average weight used in the comparison. By using these parameters, scoring is not penalized by missing tokens. Use of the ADDR_ parameters affects the average weight (AVERAGEWGT) in the comparison.

The AXP comparison function uses four weight tables. The mpi_wgtsval table contains the exact weights for the individual tokens (street address lines, numbers, postal codes, city, and state) and the parameters used in the AXP comparison. The mpi_wgt1dim table contains the weight for numeric exact match values in the address line. The mpi_wgt2dim is the final lookup table to get the final weight (for an address and phone index combination). The mpi_wgthead table contains the weight definitions for calculating match scores.

Number of roles
2
Number of dimensions (in each role)
1
cmpargs
strcode containing unit indicators; the format is UNITTYPES=strcode
dvdargs
ALLOWHASH (The dvdargs setting is made in the standardization function, not in the actual AXP comparison function.)
Weight table
mpi_wgt1dim, mpi_wgt2dim, mpi_wgthead, mpi_wgtsval
Other tables
mpi_cmpspec, mpi_cmphead, mpi_cmpfunc

The following steps describe the comparison.

  1. The street address is divided into subcomponents of streetline, unit boundary, region (city and state) and postal code. The address is broken by unit boundary if a strcode for unittypes is defined. If the ALLOWHASH dvdarg is passed from standardization, any # (number sign) in the address is not treated as punctuation. Otherwise, it is treated as punctuation and is stripped from the string
  2. The information content for each of these subcomponents is calculated in both strings. The information content is the weight associated with each token and is computed as follows:
    • If the token type is ZIP code, the weight is taken from the 1DIMZIP table defined in mpi_wgt1dim.
    • If the token type is any other numeric, the weight is taken from the 1DIM table in mpi_wgt1dim.
    • If the token type is non-numeric, the weight is taken from the mpi_wgtsval table.

      Using the example of “12345 PARK RN TN”, the weight for 12345 is looked up at index 6 in the 1DIMZIP table in mpi_wgt1dim (length of numeric data). Weights for PARK, RD, and TN are looked up in mpi_wgtsval.

  3. The weight adjustment is done for both strings as follows. If the total streetline weight is greater than Address Street MaxWgt, the streetline weight and each individual token within the streetlines are adjusted to scale down within the MaxLimit. This process is repeated for region and postal code weights as well.

    For example:

    6001 PLAZA ST
    6001 - 10
    PLAZA - 20
    ST - 10
    Total streetline weight = 40
    AvgMaxWgt for streetline = 25

    The adjustment is as follows:
    6001 - 10 *25/40 = 6.25
    PLAZA - 20 *25/40 = 12.5
    ST - 6.25

  4. If the address contains a unit, the weight adjustment is as follows.
    1. Unit starting positions are determined based on settings in mpi_strset. Only the tokens in the first cmpval dimension (stline components are examined.
    2. If the primary address components are equal and the unit components are unequal, then the weight is adjusted by the value of the _ADDR_UNIT_ADJWGT parameter. Otherwise, the addresses are equal or one/both did not contain unit information, so no adjustment is made. For example, use the addresses of:

      5001 Plaza on the Lake Blvd Suite 111 and 5001 Plaza on the Lake Blvd Suite 222

      These addresses would allow for the adjustment since they are equal in primary components (everything left of Suite) and unequal in unit components (everything right of Suite).

  5. The comparison is performed iteratively, like CXNM comparisons, in a grid manner. The resulting weights in this comparison are the similarity weights. The tokens from the two street addresses being compared are checked for all possible matches (including compound name matching) as listed in the beginning of this section. At each grid comparison, the best match position is also noted. The CELL-DIFF_n (difference between table cells) is the positional penalty. The penalty is applied if the previous best match position and the current match are more than one cell apart. The resulting weight from this comparison is an input to the TOTALWGT.
  6. Post- grid processing is performed in the following manner:
    1. If there is a postal code in both strings and if the weights of both postal code strings are greater than the addrPostalMin weight, then the postal code similarity weight (as calculated in step 4) is added to the AVERAGEWGT.

      The AVERAGEWGT is calculated by scaling the weights from two strings based on the ADDR_POSTAL_NORM.

      If AddrPostalNorm = 0, the minimum of the two string weights is added to the AVERAGEWGT.

      If AddrPostalNorm = 1, the maximum is added.

      For anything in between, the appropriate scaled weights are added.

    2. If conditions from step 5.a fail, a check is made to see if there is a region specified in both strings and if the region weights are more than AddrRegionMin. If they are, the region similarity (as calculated in step 4) is added to the TOTALWGT.

      The AVERAGEWGT processing is done in a manner like step 5.a using the AddrRegionNorm factor.

    3. If there is not a “good” postal code or region (that is, step 5.b fails), then the (AddrRegionMinwgt + AddrPostalMinwgt)/2 is added to the AVERAGEWGT.
    4. The Streetline part of the address is then checked. If there are at least two street tokens in both strings AND both are more than AddrStreetMin wgt, the similarity from Streetline gets added to the TOTALWGT.

      The AVERAGEWGT is calculated as in step 5.a using the AddrStreetNorm factor.

    5. If there is not "good" street data (step 5.d fails), then the AddrStreetMinwgt is added to the AVERAGEWGT.
  7. The Normalized index is then calculated for Address. The Normalized Similarity = NormMaxIdx(15) * TOTALWGT/AVERAGEWGT.

    Normalized Index = 15 - Normalized Similarity

    If this Normalized Index is greater than NormMaxIdx, then that Normalized Index =NormMaxIdx.

    If the Normalized Index is less than or equal to NormMinIdx, the Normalized Index = 1

    This index is the ADDRESSINDEX.

  8. The Phone comparison is done by using edit-distance compare. The string tokens are compared against each other and the best match (or minimum edit-distance) is taken. This distance is the PHONEINDEX (in the mpi_wgt1dim table, the edit-distance of 0 is index 1, and so on).
  9. For 2dim weights, the AddressIndex and Phone Index are the inputs to look up the mpi_wgt2dim table for the FINAL weight for AXP.
AXP example:

The strings to be compared are:

6001 Plaza St, 78750
6001 New Plaza, 78751

Assume the following weights:

Plaza, 6001, St, New, 78750 = 10
Cell_diff_2 = 5
AddrPostalMinwgt = 9
AddrStreetMinWgt = 10
AddrPostalNorm = 0
AddrStreetNorm = 0.5

  6001 Plaza St 78750
6001 10* 10 10 10
New 10 10 10 10
Plaza 10 10+10-5=15** 15 (Stline) 15
78750 10 15 15 15+10=25
  1. Streetline: 6001 Plaza St
    Postal code: 78750

    Streetline: 6001 New Plaza
    Postal code: 78750
  2. Information content:
    6001 Plaza St - 30
    78750 - 10

    6001 New Plaza - 30
    78750 - 10
  3. Assume that this information content is within the maximum limits.
  4. The comparison is token by token. 6001 and 6001 are the first compared tokens and they match exactly. This comparison is the best match so far (10*). As the comparison continues along Row 1, the best weight from the top, left, and diagonal cell is carried forward. Row 2 continues the same way since there is no match. In Row 3, the Plaza and Plaza match exactly, so the weight is the existing carried forward weight + exact match weight = 10+10 = 20. BUT, the last best match was in row 1. The current match is in Row3. Since there is a celldiff of 2, the CELL_DIFF penalty is applied. So weight = 20 -5 =15, which is the new best weight.

    StreetLine Similarity = 15
    Postal Code Similarity = 10 (Match wgt for 78750 and 78750)

  5. Since Postal code and Streetline satisfy the conditions, TotalWgt = 15 + 10 = 25.

    StreetLine AverageWgt = (30+30)/2 = 30 (since AddrStreetNorm = 0.5)
    PostalCode Average Wgt = Min(10, 10) = 10 (since AddrPostalNorm = 0)
    AverageWgt = 30 + 10 = 40

    This example shows what happens if AddrPostalNorm = 0.5. The example assumes that the weights of the two postal codes are 10 and 20: PostalcodeAvgwgt = Min + AddrPostalNorm * (Max-Min). In this case: 10 + 0.5*(20-10)= 15.

  6. The NormIdx = 15 – 25/40 = 14.
  7. On the phone comparison, assume that the comparison gave an exact match. Thus, the edit-distance was 0, so the index is 1.
  8. The 2dim lookup index is (14,1).

AXP parameters

There are two types of parameters used by AXP. The first is a set of parameters used during the comparison step of weight generation. The second set of parameters uses the outputs from weight generation and are used during the actual data comparison to determine a final score.

Weight generation comparison parameters:

As previously mentioned, with normal edit-distance processing, there are names that might not be recognized. Using the following weight generation parameters for prefix and compound address matching can significantly improve comparisons.

For example, if string A contains “CLEVELAND” and string B contains “CLEVE,” the strings are not close enough for the edit distance partial match. However, the prefix parameter adjusts for this scenario.

When comparing compound names, if string A contains “WAL MART” and string B contains “WALMART,” the parameters do allow for credit as a match.

Another example is string A with an address of “123 MAIN St SUITE 456” and string B “123 MAIN ST SUITE456.” Using the compound parameters would allow credit for a match.

These weight generation parameters include the following descriptions.

Note: These parameters are defined in the mpi_wgtsval table by using a wgtcode of PARM. When you generate your weights in InfoSphere® MDM Workbench, you can right-click and select Get Weights. The contents of mpi_wgtsval are copied into your project directory. From the project directory, you can open and edit the file.
  • PREFIX_FACTOR – The threshold for prefix match is defined by a configurable prefix match factor which is an integer PREFIX_MATCH_FACTOR. For example: Ni = “MICRO” and Mj = “MICROSOFT”

    If Ni matches the beginning of Mj and PREFIX_MATCH_FACTOR* len(Ni )>=len(Mj), then the match is a prefix match.

    Setting PREFIX_MATCH_FACTOR to 2 means that Ni should be at least half the length of Mi. Setting PREFIX_MATCH_FACTOR to 3 indicates that Ni should be at least one-third the length of Mi.

    The default value for PREFIX_MATCH_FACTOR is 2.

  • PREFIX_ADJWGT – If a prefix match is identified, the prefix_adjwgt is used to adjust the prefix token (for example, “MICRO”) weight in the following manner:

    (Prefix compare token)Ni_wgt = MIN(Ni_wgt, Mj_wgt) - PREFIX_ADJWGT

    The default value for PREFIX_ADJWGT is 100 or 1.0.

  • PREFIX_MINWGT – Used as a lower boundary for any prefix weight adjusted token. The prefix adjusted weight never falls below this weight value. The default value for prefix_MINWGT is 50 or .5.
  • PREFIX_MAXWGT – Used as an upper boundary for any prefix weight adjusted token. The prefix adjusted weight never goes above this weight value. The default value for prefix_MAXWGT is 300 or 3.0.
  • COMPOUND_ADJWGT – A compound match is detected when comparing tokens “MICRO” “SOFT” versus “MICROSOFT”. The compound_adjwgt is used to adjust the compound token (“MICRO”) weight in the following manner:

    (Compound compare token “MICRO”)Ni_wgt = MIN(Ni_wgt, Mj_wgt) - COMPOUND_ADJWGT

    The default value for compound_adjwgt is 50 or .5.

  • COMPOUND_MINWGT – Used as a lower boundary for any compound weight adjusted token. The compound adjusted weight never falls below this weight value. The default value for compound_MINWGT is 50 or .5.
  • COMPOUND_MAXWGT – Used as an upper boundary for any compound weight adjusted token. The compound adjusted weight is never above this weight value. The default value for compound_MAXWGT is 400 or 4.0.

Comparison parameters:

The following conditions and penalties are used in comparison to determine the final score.

Boundary conditions. AXP uses these parameters to check boundary conditions for Streetlines, Postal code, and Region.

  • __ADDR_POSTAL_MAXWGT
  • __ADDR_POSTAL_MINWGT
  • __ADDR_REGION_MAXWGT
  • __ADDR_REGION_MINWGT
  • __ADDR_STREET_MAXWGT
  • __ADDR_STREET_MINWGT
  • __ADDR_UNIT_ADJWGT

The sum of the words in the streetline is checked to make sure that it does not exceed the ADDR_STREETLINE_MAXWGT. If it does, it is scaled to the MAX value. Similarly, the ADDR_POSTAL_MAXWGT and ADDR_REGION_MAXWGT are used for Postal and Region subcomponents of the Address.

If the streetline token weights are less than the ADDR_STREET_MINWGT, then the streetline weight does not contribute to the Avgwgt. Instead the ADDR_STREET_MINWGT is used.

Similarly if the Postal code and Region weights are less than their respective MINWGT, the average of the two MINWGTs contributes to the total average weight.

Position penalties. AXP uses these parameters to check for position penalties.

  • __CELLDIFF_ADJWGT_2
  • __CELLDIFF_ADJWGT_3
  • __CELLDIFF_ADJWGT_4
  • __CELLDIFF_MAXIDX
  • __CELLDIFF_MINIDX

CELLDIFF parameters are applied during the comparison of one token to another. When two tokens that match have a cell difference of 2, the total weight is reduced by CELLDIFF_ADJWGT_2. The weight is checked to make sure that it is within the range limits.

CELLDIFF_MAXIDX is used to check that the cell difference does not exceed this Max.

Edit-distance. The following parameters are used to check the phone edit-distance. If it is equal to MCCIDX_EQUAL, then it is an exact match, otherwise it is a partial match.

  • __DIST_MINIDX
  • __DIST_MAXIDX
  • __DIST_MCCIDX_EQUAL
  • __DIST_MCCIDX_PARTIAL

If the edit-distance from phone number comparison is equal to DIST_MCCIDX_EQUAL, they are considered an exact match. Similarly, if the distance is equal to DIST_MCCIDX_PARTIAL, they are considered partial match.

Index calculation. AXP uses these parameters to calculate the index for 2dim tables. It also checks for EQUAL or PARTIAL matches (like the ones previously described).

  • __NORM_MINIDX
  • __NORM_MAXIDX
  • __NORM_MCCIDX_EQUAL
  • __NORM_MCCIDX_PARTIAL

Like the Phone parameters above, the NORM_MCCIDX values are used for the address equal or partial match.

NORM_MAXIDX and NORM_MINIDX are used to bind the Normalized value from the Address matching.

Partial string matches. These parameters are used for penalties for the partial string matches.

  • __EDITDIST_ADJWGT
  • __EDITDIST_FACTOR
  • __EDITDIST_MINWGT
  • __EDITDIST_MAXWGT
  • __FULLNAME_MAXWGT
  • __INITIAL_ADJWGT
  • __INITIAL_MINWGT
  • __INITIAL_MAXWGT
  • __NICKMETA_ADJWGT
  • __NICKMETA_MINWGT
  • __NICKMETA_MAXWGT
  • __NICKNAME_ADJWGT
  • __NICKNAME_MINWGT
  • __NICKNAME_MAXWGT
  • __PHONETIC_ADJWGT
  • __PHONETIC_MINWGT
  • __PHONETIC_MAXWGT

EDITDIST_FACTOR is used to determine whether the two strings can use an edit-distance comparison. EDITDIST_ADJWGT is the penalty applied for an edit-distance match. The final weight is then checked to be within the ranges of the MIN and MAX values.

Weight tokens. The following parameters are used to give greater weight to tokens. Each parameter is a number between 0 - 1. Using 0 results in the average being equal to the minimum weight. Using 1 results in the average being equal to the maximum weight. Using 0.5 results in the average being a true average. 0.5 is the default value. Here the average refers to the average weight for the two strings being compared.

  • ADDR_STREET_NORM (this parameter allows you to control the weight of the streetline)
  • ADDR_REGION_NORM (this parameter allows you to control the weight of the address region code)
  • ADDR_POSTAL_NORM (this parameter allows you to control the weight of the postal code)

The following weight tables are expected for AXP:

  • mpi_wgthead - the following entries must be present for AXP in the mpi_wgthead table
    1|1|A|CMPID-AXP-1DIM|1DIM|CMPID-AXP-1DIM|9|0|0|0|
    1|1|A|CMPID-AXP-1DIMZIP|1DIM|CMPID-AXP-1DIMZIP|9|0|0|0|
    1|1|A|CMPID-AXP-2DIM|2DIM|CMPID-AXP-2DIM|16|8|0|0|
    1|1|A|CMPID-AXP-PARM|SVAL|CMPID-AXP-PARM|0|0|0|0|
    1|1|A|CMPID-AXP-XACT|SVAL|CMPID-AXP-XACT|0|0|0|0|
    
  • mpi_wgtsval - mpi_wgtsval table for AXP should contain the *PARM tables and the *XACT tables. A partial sampling of these tables is listed here:
    1|1|A|CMPPATIENT-PATADDRESS-PARM|__PREFIX_FACTOR|2|
    1|1|A|CMPPATIENT-PATADDRESS-PARM|__PREFIX_MAXWGT|944|
    1|1|A|CMPPATIENT-PATADDRESS-PARM|__PREFIX_MINWGT|50|
    1|1|A|CMPPATIENT-PATADDRESS-XACT|22ND|831|
    1|1|A|CMPPATIENT-PATADDRESS-XACT|26TH|822|
    1|1|A|CMPPATIENT-PATADDRESS-XACT|28TH|825|
    1|1|A|CMPPATIENT-PATADDRESS-XACT|35TH|831|
    1|1|A|CMPPATIENT-PATADDRESS-XACT|36TH|808|
    
  • mpi_wgt1dim - mpi_wgt1dim table should contain the *1DIM and *1DIMZIP tables. For example:
    1|1|A|CMPPATIENT-PATADDRESS-1DIM|0|0|
    1|1|A|CMPPATIENT-PATADDRESS-1DIM|1|794|
    
    1|1|A|CMPPROVIDER-AXP-1DIMZIP|1|494|
    1|1|A|CMPPROVIDER-AXP-1DIMZIP|2|267|
    
  • mpi_wgt2dim - the final weights are read from the mpi_wgt2dim table. The following is a sample of those weights:
    1|1|A|CMPID-AXP-2DIM|0|0|482|222|59|9|-31|-71|-83|0|0|0|0|0|0|0|0|
    1|1|A|CMPID-AXP-2DIM|1|482|524|339|292|331|358|359|395|0|0|0|0|0|0|0|0|
    1|1|A|CMPID-AXP-2DIM|2|359|455|295|280|281|258|257|289|0|0|0|0|0|0|0|0|