Default Algorithm for InfoSphere MDM Probabilistic Matching Engine

White Papers

Abstract

The InfoSphere MDM Probabilistic Matching Engine is tailored to be a dedicated matching engine for InfoSphere MDM. The InfoSphere MDM Probabilistic Matching Engine provides organizations with the ability to perform party matching and suspected duplicate processing using a sophisticated and configurable scoring algorithm.

The InfoSphere MDM Probabilistic Matching Engine generates matching scores based on its probabilistic scoring system, and then InfoSphere MDM takes the score and uses it to determine survivorship and to decide what suspected duplicate processing actions must be completed.

Content

A basic algorithm configuration will have 3 standard functions that are correlated and combine to give matching results. These are the common basic components:

1. Standardization
Standardization is a process by which all inputs(members) are processed in a similar standard way so that they can all be in the same format. This helps to compare them to give best results. The results of a standardization(or the standardized value) are then used in comparison

2. Bucketing
Bucketing is a strategy which defines how the input records are grouped together. This will help us select the input records that will be compared and used in search. This is called Candidate Selection. Imagine grouping a set of clothes by color, or by color + material. This will help us pick what we want faster and also give us better possibility of matches.

A good bucketing strategy is one that does not repeat the same kind of members in different buckets and does not create too many members per bucket.

3. Comparison
Once the input records are standardized and bucketed, they can be compared against each other to give us a weight. This weight is a statistically derived value based on the data. Each attribute gives different weights based on the kind of comparison used for that attribute. The sum of weights across all attributes is called a Score. The score determines if the two records in question are a match, a no match or a suspect.

Configurable Options

PME allows greater user control and tuning of the algorithm components through a set of configuration files. The following lists some of the basic configurations used for this purpose

1. ANON:
ANON refers to Anonymous value. In a data set, we have many values that need not be used in comparison or bucketing. Eg: BABY for names. A match on BABY does not give any useful information about the name for matching.

Anonymous values can be specified in a stranon configuration. If the input matches any value in this configuration, that input value is removed from further processing.

2. CMAP:
CMAP refers to character mapping configuration. This configuration is used to map characters to other equivalent character values. e.g. if character A is mapped to character X in the CMAP configuration, all occurrences of A will be replaced with X when this configuration is used.

3. EQUI :
EQUI refers to equivalent value parameter. EQUI values can be specified in a strequi configuration. This configuration contains a list of Nicknames or equivalent values

Value |equivalent value

EQUI can be used in two different ways:
i. Standardization and Comparison:
When an EQUI parameter is specified, the value is looked up in the strequi configuration and the corresponding equivalent value is used in its place.

When an EQUI parameter is specified in standardization, only input values that are present in the strequi configuration will be used for further processing. All other values will be removed.

Eg: We know that gender can take 2 values: M, F. So any other value will not be processed.

ii. Bucketing
When EQUI arguments are used in bucketing, they will be used to convert a set of values to a common equivalent value as defined by the configuration.

Eg: If PATRICIA and PATTY both have an EQUI value defined as PAT, the value of PAT will be used in the place of both PATRICIA and PATTY for further processing and bucketing. In other words, they are equivalent names.

4. STRSET:
The use of strset is currently in standardizing addresses. Currently two types of arguments are used.

i. PATTERNSET – This specified what kind of pattern the postcodes fall under. This is used in standardization of addresses and postal codes

ii. UNITTYPES – This specifies what kind of words/characters are used to specify Unit information ( Floor, suite etc) in an address. This is used in the Address Comparison function

Person Algorithm
1. Name
(PERLEGALNAME, PERALIASNAME, PERMAIDENNAME, PERPREFNAME, PERPREVNAME, PERAKNAME, PERNICKNAME, PERBUSNAME)

Special Notes
All the name attributes are combined under a common comparison structure and this structure is used in Comparison. This is different from putting each type of name attribute under its own structure. This is done because we want to be able to match better even if ANY of the name attributes match.

Eg: Consider the following Names
PERLEGALNAME : PATRICIA
LIASNAME : PATZIE
Name under own structure: In this case, PATRICIA and PATZIE will not be compared against each other
Name under combined structure: PATRICIA and PATZIE will compare against each other, thus bringing more possible matches together

Standardization
All person name attributes are standardized the same way using PXNM name standardization
Arguments: ANON string code - ANAME

Why PXNM?
PXNM is designed specifically to handle personal name standardization. This takes into account the structure of a Personal name (LastName, First Name, Middle Name and so on), the possible variations with a name and splits the name into multiple tokens that are easy to compare. This function also handles suffix and prefix standardization.

What does it do?
i. The name is split into different components ( known as tokens)
ii. The special characters are removed. All digits are removed. The name is then converted to uppercase
iii. The CMAP table is checked for any character conversion. Refer to CMAP configuration in the Configurable options above
iv. The tokens are then placed into the respective components of Suffix, Prefix, Last name etc, based on comparison with a set of internal tables
v. All tokens are then checked if they are present in the ANON table. If yes, they are removed from the list going forward
Eg: Ms. JENNIFER\DE CATHELINE 77 JUNIOR will be standardized as :
DECATHELINE::JENNIFER:.JR:MS
<Lastname>:<nomiddlename>:<FirstName>:.<Suffix>:<prefix>

Bucketing
i. Name Only ( EQUI)
ii. Name + Zip ( EQUIMETA + POSTCODE)
iii. Name + DOB ( EQUIMETA + DOB)

Arguments :
i. MaximumBucketSize
When we create buckets using the above strategies, we want to limit the bucketsize further by providing a limit. The limit for all PME buckets are now set to 5000. This means that no more than 5000 members per bucket are created.

ii. BKTANON
When we want certain input values to be used for comparison but not for bucketing since they may create large buckets, we specify a set of ANON values. Eg: Middle name initials

Why these buckets?
There are many strategies for bucketing, but some tried and tested practices have worked well in the past. The above bucketing strategies are based on some of these past learned behaviors.

i. Name(EQUI) – Group all names that are nicknames into a single bucket. So when someone searches for a name using the full name or a nickname, we are still able to get that result
ii. Name+ZIP(EQUIMETA+POSTCODE) – Group names which have the same phonetic value for their nicknames + postcode into a single group. This will gather more candidates together and this also ensures we don’t repeat buckets.
iii. Name+DOB(EQUIMETA +DOB) --- Group names which have same phonetic value for their nicknames +Date of birth. Same reasoning as b

Comparison
QXNM function is used for Name comparison

Why QXNM?
QXNM name comparison compares names based on many similarity and content. It compares based on Nickname matching, Phonetic Matching, Edit distance matching, nickname+phonetic matching and Initial matching. These comparisons are also controlled by arguments that can be tuned by the user. This name comparison also limits the weights to a MAX weight, so that we can be sure not to exceed an allowable value.

What does it do?
i. QXNM compares every token against every other token.
ii. Align the pairs based on all the matches( Exact, nickname, phonetic and so on)
iii. Align the non matching pairs
iv. Add all the weights for the Matched pairs
v. Add a positional bonus weight of 0.20 if the matched pair were in the exact same location. ( Eg: BOB ROBERT vs. ROBIN ROBERT. Here ROBERT and ROBERT match at the same position)
vi. Add a positional bonus weight of 0.10 if the matched pairs were off by one location ( Eg: BOB ROBERT vs. ROBERT KING)
vii. Subtract the weight for non matching pairs
viii. What you get is the final weight for the name match

2. Person Address
(Primary Residence, Mailing Address, Other Residence, Secondary residence, Temporary Residence, Summer Residence)

Special Notes:
All of the above residence addresses are combined into the same comparison structure. Refer to Special Notes on Person Name comparison for the reasoning.

Standardization
The address standardization has 2 components :
i. POSTCODE using RZIP standardization
ii. ADDRESS using INTADDR2R standardization
The postcode is standardized as RZIP because this will be used along with the Name to form a bucket ( EQUIMETA+POSTCODE). Refer above( Name bucketing) for details

Why RZIP and INTADDR2R?
INTADDR2R and RZIP standardizations are designed for all kinds of addresses by allowing user configurable options that will help handle any kind of postal code. It also lets the user specify options that will help the comparison handle parts of the address such as Unit number to be handled differently

What does it do?
StreetLine processing:
i. If the address has atleast one component and less than 4 input values, it is treated as Streetline input. The CMAP character table is applied
ii. The characters are all converted to uppercase, digits are retained. If the user specifies an argument of ALLOWHASH, then the # sign is also retained, if not, it is converted to space
iii. Each of the words are separated using a space. Each of these words( or tokens) are converted to their abbreviated versions if applicable. Eg: NORHT EAST is converted to NE)
iv. Each token is then identified as numeric or alphanumeric. Numeric tokens contain digits but do not have 1st, 2nd, and so on. An “N” is added to each numeric token and an “S” to each non-numeric token.

City and State processing:
i. If the address has more than 4 but less than 6 input values, it is assumed to have City and State components
ii. City and State are standardized by converting into uppercase and removing non-alphabet characters and spaces

ZipCode processing( this is the same for INTADDR2R and RZIP):
i. If there are more than 6 address values, the zipcode processing is done
ii. The non-alphanumeric characters are removed.
iii. The input is then converted to patterns ( Eg: 7875A is converted to NNNNA).
iv. These patterns are then checked with valid patterns as defined by the user in mpi_strset table. If the converted input patters do not match with any pattern in mpi_strset, the postcode is treated as anonymous
v. The arguments for the address function specifies the name of the table that defines these patterns in mpi_strset table
vi. The actual input value is checked with the ANON table and removed if present

Bucketing
The POSTCODE from the address components are combined with Name to define bucket. Refer to Name bucketing for more details

Comparison
AXP comparison used

Why AXP?
AXP is an Address X Phone comparison that combines the results of both Address and Phone in a 2 dimensional comparison. This comparison uses the information content of a string and similarity to determine a match. It can use multiple components of the addresses like Street Line, city, state and weight them differently.
AXP uses 2 inputs – Phone strings and Address Strings

What does it do?
For Phone Strings:
The phone strings are compared using an edit-distance compare. Refer to Appendix I for more on edit-distance

For Address Strings:
i. The address strings are compared by an ordered token comparison. This means each token in one string is compared with every other token in the other string keeping the order intact. Order is important for this matching and there is a penalty for this.
ii. For each comparison, either a numeric comparison or string comparison is done. For the string comparison, edit-distance, phonetic and frequency based comparisons are done. Compound name and prefix matching is also done at this step.
iii. If the address has unit information such as #, Unit, Floor etc, AXP will weight them differently based on user defined arguments
iv. If the postcode is not present, the City and State are used as a replacement
v. Once the ordered name tokens are compared, the resulting weight is normalized with respect to the maximum possible weight from the two strings. This will account for missing tokens as well. The maximum possible weight of a token is the value that a token will get when compared to itself.
vi. The user can choose arguments to define parts of the addresses that need to be weighed more than others. Eg: If Street name is more important than City and State.
vii. Once the normalized value is got from step 6, this value is looked up in a 2 dim table along with the phone comparison result ( See phone strings above) to get the final match weight.

Eg: Record 1 :
Address : 6001, Plaza Street, MA, 78756
Phone: 7879900

Record 2:
Address: 5001, Plaza Lake, MA, 77668
Phone: 7879900

Assume the result of address compare is 3 ( after normalization)
Assume result of phone compare is 1 ( Edit distance)

The final weight for compare of these two records will be in the 2Dim table under index ( 3,1)

3. Phone
All phone attributes are combined under the same structure. Refer to Name for more details’

Standardization
ATTR standardization used

Arguments : ANON value of PHONE

Why ATTR?
ATTR is a generic attribute comparison that can be used with any attribute. It helps to use this attribute results with any kind of comparison and bucketing.

What does it do?
i. All alphanumeric characters are converted to uppercase. Non Alphanumeric characters are removed
ii. If the resulting value is present in the ANON table, it is treated as anonymous and removed for comparison
iii. If there an EQUI table specified, this table is checked to see if the input values are defined there. If they are NOT, they are removed for comparison.

Bucketing
Ngram bucketing is used

Why Ngram?
Ngram is used to improve candidate selection when your attributes contain similar values but may vary in spelling or position. Eg : 512 634 5116 and 634 5116. They both contain the same last 7 digits . In order to make sure we bucket these together, we use Ngram

What does it do?
It splits the input token into N buckets. N is configurable by the user.
Eg: N =4 with an input of 634 5111 has the following buckets
6345 3451 4511 5111

Comparison
Phone is compared using AXP. Refer above to Address attribute comparison for details

4. Business Address & Business Phone
Business addresses and Business phones are handled separately as we expect they may be different from the personal details, so we would want to match them differently for better results. The way we standardize and compare these attributes are similar to Personal address and phone. The way they differ from personal address and phone is by the kind of matching weight assigned to them. Business addresses are weighted lower than personal addresses. For a Person algorithm, this clearly makes sense

Special notes for PERSON Address and BUSINESS Address:
If a person has both personal and business address in the input, we will only use the better of the two addresses. This means that we will use the one with the higher weight. This will ensure that we don’t overweight without adding any real value for matching.
The way to implement this is to specify the same CmpGroup value for both Person and Business Address comparison. In the eME algorithm, the AXP comparison functions for both person and business comparisons have a property value of CmpGroup which is set to 1.

Standardization
Business Address : Uses INTADDR2R and RZIP for Address and postcode respectively
Business Phone: Uses ATTR

Bucketing
NGram bucketing used.
Refer to Phone( above) for details

Comparison
AXP comparison used
Refer to AXP ( Above) for details

5. SSN

Standardization
ATTRN is used

Arguments : SSN ANON value

Why ATTRN?
ATTRN is designed to specifically handle numeric attributes

What does it do?
i. Remove all non-numeric characters.
ii. Convert everything to uppercase
iii. Check if any value is anonymous. If it is, remove it

Bucketing
ATTR used

Why ATTR?
ATTR is a general bucketing function to handle any kind of attribute. SSN is not combined with any other attribute to form a bucket since it is a good indicator of an person’s identity using its unique value

Comparison
DR1D1C is used

Why DR1D1C?
DR1D1C is a function designed to handle edit-distance comparisons. Since SSN values are numeric, edit-distance will be a good indicator of matching. Refer to Appendix I for more details on edit-distance

What does it do?
The strings are matched iteratively using edit-distance. As the edit-distance gets larger, the less likely the strings match.

6. SIN

Standardization
ATTR is used

Arguments : SIN ANON value

Why ATTR?
ATTR is designed to specifically handle alpha-numeric attributes

What does it do?
i. Remove all non-alpha numeric characters.
ii. Convert everything to uppercase
iii. Remove any anonymous value

Bucketing
ATTR used

Why ATTR?
ATTR is a general bucketing function to handle any kind of attribute. SSN is not combined with any other attribute to form a bucket since it is a good indicator of an person’s identity using its unique value

Comparison
DR1D1C is used

Why DR1D1C?
DR1D1C is a function designed to handle edit-distance comparisons. Since SSN values are numeric, edit-distance will be a good indicator of matching. Refer to Appendix I for more details on edit-distance

What does it do?
The strings are matched iteratively using edit-distance. As the edit-distance gets larger, the less likely the strings match

7. DOB

Standardization
DATE2 standardization used

Arguments: DATE ANON, DATE2 EQUI

Why DATE2?
DATE2 handles incomplete dates besides regular complete dates. It has configurable options that give the user more control on the standardization

What does it do?

i. Check against the DATE2 EQUI configuration to replace dates as applicable

ii. The month and date is checked. If the month is between 1 and 12, the month is left as is. Otherwise, the month is set at 00.
iii. If the output month is not valid and the day is between 1 and 31, the day is left as is. Otherwise, the day is set to 00.
iv. If the month and year are valid, the month and year are used to determine if the day is valid. If the day is valid, it is left as is. Otherwise, the day is set to 00.
v. If the month is valid, but the year is not, then determination is made on whether the day is valid (leap year is assumed). If the day is valid, it is left as is. Otherwise, the day is set to 00.
vi. After completing the day analysis, if one or both of the month and day entries is invalid:
a. Return to the original month and day and transpose month/day. Analysis is repeated.
b. If analysis yields a valid month and day, the transposed date is used in the output with a prefix of ‘T'.

Bucketing

Date attribute is combined with Personal Name attribute to form a bucket. Refer to Name bucketing for more details

Comparison
DATE2 comparison is used

Why DATE2?
DATE2 comparison is designed to handle dates that the DATE2 standardization outputs. Both these functions are designed to work in tandem. In addition, DATE2 also handles Date-Date, Date-Age, and Age-Age comparisons.

What does DATE2 do?
i. DATE2 matched records based on weights that are across 4 different weight files.
ii. The different weight files are based on what kind of comparison is done with the date – Year, Day, Month, or Month, Day or Age
iii. The final matching weight is got by adding the weights for the different kinds of comparisons ( as defined in step2)

8. GENDER

Standardization
ATTR is used

Arguments: GENDER equi

Why ATTR?
ATTR is a generic attribute standardization function. In this case, since Gender can take values such as M, F, Male, Female, 1, 2 etc, ATTR is used which handles all kinds of values

What does it do?
Refer SIN standardization for ATTR description

Since there is a GENDER equi parameter, only values that are mentioned in the EQUI configuration will be used. Any input value that does not match these EQUI values will be removed from the algorithm.

Bucketing
We do not bucket on gender since it provides no good value in matching

Comparison
EQVD is used

Why EQVD?
EQVD is a string comparison that has a binary output ( Match or no match). Since gender attribute will be a simple string value, we use a simple string comparison

What does EQVD do?
The two values are compared and the resulting weight will be based on whether they match or don’t match

9. FPF2
FPF2 is a False Positive Filter function. False positive filters are used to filter information, rather than function as normal comparison functions.

False positive filters are used to ascertain whether the records do indeed represent the same person rather than members belonging to the same family
Currently this function is not used in eME. This is a configurable option and can be used at a later point.

Organization Algorithm

1. Name(ORGLEGALNAME,..)
All types of Organization names are combined into a common structure for comparison. Refer to Person Name Special notes for more information

Standardization
All name segments are standardized using CXNM name standardization

Arguments: ANON string code – ACNAME
CMAP string code- CMAP-CXNM

Why CXNM?
CXNM is a name standardization that is designed to handle Company/Business names well by defining the standardization around the special business name characteristics that may include special characters. CXNM can also handle single token ANONs and multi-tokens ANONs. Single-token anonymous values are removed on a per-word basis whereas whole-value anonymous values cause the entire name to be treated as anonymous (missing).

What does CXNM do?
i. If there is a CMAP argument specified, then the CMAP character conversion is done
ii. All characters are converted to uppercase. Digits are not removed. Special characters such as & are removed. Eg: AT & T becomes ATT
iii. If there is a space between characters in a single word, they are removed. Eg: I B M becomes IBM
iv. Single token and multitoken anonymous values are removed

Bucketing
i. Nickname+Meta(EQUIMETA)
ii. Nickname+Postcode(EQUI + POSTCODE)
Refer to Person Name bucketing for more details

Arguments: CNICKNAME – Nickname string

Comparison
CXNM name comparison used

Arguments: METAPHONE ( Type of phonetic encoding to by applied)

Why CXNM?
CXNM uses Edit distance, phonetic, nicknames, Compound names and prefix matching to score members. It also penalizes for order and missing tokens, this giving more accurate matches.

What does CXNM do?
CXNM compares tokens iteratively keeping the order intact just like AXP. Refer to AXP for more details.

2. Business Address
Similar to person Business Address algorithm

3. Business Phone
Similar to Person Business Phone algorithm

4. OrgDunsNumber
Similar to Person SSN algorithm

Arguments: DUNSNUM anon

5. OrgCoporateTaxID
Similar to Person SIN algorithm

Argument: CORPTAXID anon

Appendix:

Edit distance: Edit distance measures the similarity between two tokens by calculating the number of character insertions, deletions, or transpositions it would take to make the tokens match. e.g. 123 and 132 have an edit distance of 1 through transposition

Phonetic name comparison (META): Applying a phonetic encoding a token

Nickname comparison (EQUI): Applying a nickname equivalent of a name

Nickname-meta comparison (NICKMETA): Applying a phonetic on nickname equivalents

Compound Matching: Matching of disjoint names or abbreviations. Eg: IBM USA and IBMUSA
Prefix Matching: Matching based on the first N characters of the strings. Eg: INIT and INITIATE

Weight: Weight is a statistically derived value based on the data.

Score: Sum of all weights across all the attributes

Token: Part of the input or the entire input. Can be used interchangeably with component

Input: The input value. Can be used interchangeably with record and member.

Threshold:
AL (Auto-link): This is a score above which the records are automatically linked.
CR (Clerical Review): This is a score below which the records are never linked together.
Tasks: The records that score between AL and CR. Tasks are the equivalent of suspect.

Original Publication Date

15 April 2015

[{"Product":{"code":"SSWSR9","label":"IBM InfoSphere Master Data Management"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"10.0;10.0.0;10.1;10.1.0;11.0;11.0.0;11.3;11.4;9.7;9.5;9.2","Edition":"Standard Edition;Advanced Edition","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Product Synonym

MDS;Master Data Service;MDM;MDM SE;MDMSE;Master Data Management;IBM Infosphere Master Data Service;MDM Standard Edition;MDM Hybrid Edition;Initiate;Hybrid;Physical MDM;Virtual MDM;Hybrid MDM;PME Engine;Big Match

Tips

Default Algorithm for InfoSphere MDM Probabilistic Matching Engine

White Papers

Abstract

Content

Original Publication Date

Product Synonym

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?