IBM Match 360 with Watson matching algorithms

IBM Match 360 with Watson uses matching algorithms to resolve data records into master data entities. Data engineers can define different matching algorithms for each entity type in their data. The matching algorithms can then analyze the data to evaluate and compare records, and then collect matched records into entities.

There are two common reasons to run matching on your data:

For record deduplication and entity resolution, the matching process analyzes your data to determine whether any duplicate records exist in your data. Suspected duplicate records are merged into master data entities to establish a single, trusted, 360-degree view of your data.
To create other types of entity associations, the matching process analyzes your data to collect records into entities that represent different kinds of groupings, such as a household.

In this topic:

Matching to create more than one type of entity
The matching process
Components of the matching algorithm
Edit distance

Matching to create more than one type of entity

IBM Match 360 matching algorithms are driven by the entity type of the associated data. You can define more than one entity type for each record type in the data model. For each entity type, configure and tune its corresponding matching algorithm to ensure that IBM Match 360 creates entities that meet your organization's requirements.

A single record can be part of more than one separate entity. If your data model includes more than one entity type, you can run different types of matching across the same data set. For example, consider a data set that includes person records from across your enterprise. If the Person record type includes definitions for a Person entity type and a Household entity type, then you can run the Person matching algorithm for entity resolution and deduplication, and also run the Household matching algorithm to create entities made up of person records that belong to the same household.

The matching process

The matching engine goes through a defined process to match records into entities. The matching process includes three major steps:

Standardization. During this step, the algorithm standardizes the format of the data so that it can be processed by the matching engine.
Bucketing. The algorithm sorts data into various categories or "buckets" so that it can compare like-to-like pieces of information.
Comparison. The algorithm compares data to determine a final comparison score. The algorithm then uses the comparison score to determine whether the records are a match.

Each of these steps is defined and configured by the matching algorithm.

Components of the matching algorithm

Two main types of components define an IBM Match 360 matching algorithm:

Standardizers
Entity types

Standardizers

As the name suggests, standardizers define how data gets standardized. Standardization enables the matching algorithm to convert the values of different attributes to a standardized representation that can be processed by matching engine.

The matching algorithm uses multiple standardizers. Each standardizer is suited to process specific attribute types found in record data.

Standardizers are defined by JSON objects. Each standardizer's JSON object definition contains three elements:

label - A label that identifies this standardizer.
inputs - The inputs list has one element, which is a JSON object. That JSON object has two elements: fields and attributes:
- fields - The list of fields to use for standardization.
- attributes - The list of attributes to use for standardization.
standardizer_recipe - A list of JSON objects in which each object represents one step to be run during the standardization process of the associated standardizer. Each object in the standardizer_recipe list consists of four main elements:
- label - A label that identifies this step in the standardizer recipe.
- method - The internal method used. This element is just for reference and must not be edited.
- inputs - A single element of the inputs list defined one level above.
- fields - A list of the fields to be used for this step. This is generally a subset of all the fields defined within the inputs list one level above. Not every step needs to process all of the inputs fields.
- set_resource - The name of a set type customizable resource used for this step.
- map_resource - The name of a map type customizable resource used for this step.
Depending on the behavior of a step, there might be more configuration elements that are required in the corresponding JSON object.

Entity types

Within a single matching algorithm, each record type can have multiple entity type definitions (entity_type JSON objects). For example, in an algorithm defined for a person record type, you might need to create more than one entity type definition, such as person entity, household entity, location entity, and others.

Each entity type can be used to match and link records in different ways. An entity type defines how records are bucketed and compared during the matching process.

Each entity type definition (entity_type) in the matching algorithm has four JSON elements:

clerical_review_threshold - Records that have a comparison score lower than the clerical review threshold are considered as non-matches.
auto_link_threshold - Records that have a comparison score higher than the autolink threshold are considered to be strong enough matches that they are automatically matched.
bucket_generators - This section contains the definition of the bucket generators configured for an entity type. There are two types of bucket generators: buckets and bucket groups.
- Buckets involve bucketing for only one attribute. Each bucket definition includes four elements:
  - label - A label that identifies the bucket generator.
  - maximum_bucket_size - A value that defines the size of large buckets. Any bucket hash with a bucket size greater than this value is not considered for candidate selection during matching.
  - inputs - For buckets, the inputs list has only one element, which is a JSON object. That JSON object has two elements: fields and attributes:
    - fields - The list of fields to use for bucketing.
    - attributes - The list of attributes to use for bucketing.
  - bucket_recipe - A bucket recipe list defines the steps for the bucket generator to complete during the bucketing process. Each bucket_recipe list has a number of subelements:
    - label - A label that identifies the bucket recipe element.
    - method - The internal method used. This element is just for reference and must not be edited.
    - inputs - A single element of the inputs list defined one level above.
    - fields - A list of the fields to be used for this bucket. This is generally a subset of all the fields defined within the inputs list one level above.
    - min_tokens - The minimum number of tokens to use when the recipe is forming a bucket hash.
    - max_tokens - The maximum number of tokens to use together when the recipe is forming a bucket hash.
    - count - A limit on the number of bucket hashes for a single record that get generated out of a bucket generator. If a record generates a lot of bucket hashes, only the number of hashes set by this element get picked up.
    - bucket_group - The sequence number for a bucket group that produces a bucket hash. Intermediary steps or recipes would not be assigned a sequence number.
    - order - Specifies whether the tokens are sorted in lexicographical order when multiple tokens are combined to form a bucket hash.
    - maximum_bucket_size - A value that defines the size of large buckets. This element is the same as the one defined at the bucket generator level; also having it at the bucket recipe level gives you finer control over large individual buckets.
- Bucket groups involve bucketing for more than one attribute. Each bucket_group definition includes five elements:
  - label - A label that identifies the bucket generator.
  - maximum_bucket_size - A value that defines the size of large buckets. Any bucket hash with a bucket size greater than this value is not considered for candidate selection during matching.
  - inputs - For bucket groups, the inputs list has more than one JSON object element. The JSON objects each have two elements: fields and attributes:
    - fields - The list of fields to use for bucketing.
    - attributes - The list of attributes to use for bucketing.
  - bucket_recipe - A bucket recipe list defines the steps for the bucket generator to complete during the bucketing process. Each bucket_recipe list has a number of subelements:
    - label - A label that identifies the bucket recipe element.
    - method - The internal method used. This element is just for reference and must not be edited.
    - inputs - A single element of the inputs list defined one level above.
    - fields - A list of the fields to be used for this bucket. This is generally a subset of all the fields that are defined within the inputs list one level above.
    - min_tokens - The minimum number of tokens to use when the recipe is forming a bucket hash.
    - max_tokens - The maximum number of tokens to use together when the recipe is forming a bucket hash.
    - count - A limit on the number of bucket hashes for a single record that get generated out of a bucket generator. If a record generates many bucket hashes, only the number of hashes set by this element get picked up.
    - bucket_group - The sequence number for a bucket group that produces a bucket hash. Intermediary steps or recipes would not be assigned a sequence number.
    - order - Specifies whether the tokens are sorted in lexicographical order when multiple tokens are combined to form a bucket hash.
    - maximum_bucket_size - A value that defines the size of large buckets. This element is the same as the one defined at the bucket generator level. Being able to define it at the bucket recipe level gives you finer control over large individual buckets.
    - set_resource - The name of a set type resource used for a bucket recipe.
    - map_resource - The name of a map type resource used for a bucket recipe.
    - output_fields - If this recipe produces new fields after it completes bucketing functions on the input fields, this element contains a list of the names of the generated fields.
  - bucket_group_recipe - A bucket group recipe section is typically used for defining buckets that consist of more than one attribute. Every element of a bucket_group_recipe list is a JSON object defining the construct for a single bucket group.
    - The inputs list within bucket_group_recipe has more than one element, which means it refers to more than one attribute defined in the inputs array one level above.
    - The fields element is a list of lists. Every inner list of fields is associated with the respective attributes list.
    - min_tokens and max_tokens lists have more than one element, with each element corresponding to respective attributes list.
    Note: In some bucketing recipe definitions, there is a property that is named search_only. By default, its value is false. If set to true, this property indicates that a bucket or bucket group is used only for probabilistic search scenarios and is not used for entity resolution (matching) scenarios.
compare_methods - Definitions of the comparison methods that are configured for an entity type. Each compare_methods JSON object consists of definitions of various compare methods. The matching algorithm adds up the scores from each compare method definition to get the final comparison score. Each compare method's JSON object contains three elements:
- label - A label that identifies the compare method.
- methods - A list of comparators that form a comparison group. Every element in this array represents one comparator, meant for one type of matching attribute. The matching algorithm considers the maximum of the scores from all the comparators in a methods list as the final score from this comparison group. Each comparator definition includes two elements:
  - inputs - For comparators, the inputs list has only one element, which is a JSON object. That JSON object has two elements: fields and attributes:
    - fields - The list of fields to use for comparison.
    - attributes - The list of attributes to use for comparison.
  - compare_recipe - This list is used mainly for defining the comparison steps. Typically, there is only one JSON element in this array, representing only one step for doing the comparison. This step has five elements:
    - label - A label that identifies the comparison step.
    - method - The internal method used. This element is just for reference and must not be edited.
    - inputs - A single element of the inputs list defined one level above.
    - fields - The fields to be used for this comparison out of all of the fields that are defined in the inputs list one level above.
    - comparison_resource - The name of a customizable comparison resource used for this comparison step.
- weights - Each comparison that is done by a comparator results in a number score from 0 to 10. This number is called the distance or dis-similarity measure. A distance of 0 indicates that the values being compared are exactly the same. A distance of 10 indicates that they are completely different. Corresponding to the 11 distinct values (0 - 10), 11 weights are defined for each comparator. After calculating the distance, the compare method determines the corresponding weight value from the weights list, resulting in the total comparison score. Data engineers can customize the weights as needed, based on the data quality, distribution, or other factors.

Edit distance

The IBM Match 360 matching engine calculates edit distance as one of the internal functions during comparison and matching of various attributes. Edit distance is a measurement of how dissimilar two strings are from each other. It is calculated by counting the number of changes required to transform one string into the other.

There are different ways to define edit distance by using different sets of string operations. By default, IBM Match 360 uses a standard edit distance function that is publicly available in literature. As an alternative, you can choose to use a specialized IBM Match 360 edit distance function.

The standard edit distance function provides better performance of the matching engine. For this reason, it is the default comparison configuration for all attributes except for the Telephone attribute type.
The specialized edit distance function is built for hyper-precision use cases. This option takes into consideration typos or similar-looking characters, such as 8 and B, 0 and O, 5 and S, or 1 and I. When there is a mismatch in two compared values based on similar-looking characters, the assigned dissimilarity measure is less than what would be assigned by a standard edit distance function. As a result, these types of mismatches are not penalized as strongly by the specialized function.

Important: The specialized edit distance function includes some complex calculations. As a result, choosing this option has an impact on system performance during the matching process.

Note: Prior to IBM Cloud Pak for Data 4.0 refresh 7 (4.0.7), the specialized edit distance function was the only option for calculating edit distance.

For information about customizing your matching algorithm, including using the API to customize the edit distance, see Customizing and strengthening your matching algorithm.

Learn more

Parent topic: Managing master data