MULT_ALIGN comparison

Scores the similarity of two sequences of terms. This comparison combines your knowledge of how similar the terms are, the order of the similar terms, and the proximity of the similar terms. You can use MULT_ALIGN to compare addresses where the sequences of terms are in different orders.

Three independent scores factor into the final score:
  • Similarity of the terms
  • Order of similar terms in their original sequence
  • Proximity of similar terms in their original sequence

Required Columns

The following data source and reference source columns are required:
  • Data. The character string from the data source.
  • Reference. The character string from the reference source (only applies to a two-source match).

Parameters

The following three parameters control the relative importance that each of the three independent scores has to the final score. Assign the highest number to the score that is the most important to you. For example, if you enter a value of 200 for MatchMix, 100 for OrderMix, and 100 for CompactMix, that means that the similarity score is twice as important as the order score and proximity score. It also means that the order score and proximity score are equally important.
MatchMix
Enter a positive integer that represents the relative importance of the similarity score for all of the matched terms.
OrderMix
Enter a positive integer that represents the relative importance of the order score for matched terms that score at or above the value that you enter for the FactorCutoff parameter.
CompactMix
Enter a positive integer that represents the relative importance of the proximity score for matched terms that score at or above the value that you enter for the FactorCutoff parameter.
The following parameters control the similarity score:
MatchParm
Enter a positive integer from 0-900 which represents the weight that is used by the UNCERT match comparison to determine its tolerance of errors. This parameter is an indication of the tolerance of the comparison. Higher numbers mean that the comparison is less tolerant of differences in the strings. MatchParm is similar to the Param 1 parameter for the UNCERT comparison. Use these values as a rough guideline:
  • 900. The two strings must be identical.
  • 850. The two strings can be safely considered to be the same.
  • 800. The two strings are probably the same.
  • 750. The two strings are probably different.
  • 700. The two strings are almost certainly different.

The assigned weight is proportioned linearly between the agreement and disagreement weights. For example, if you specify 700 and the score is 700 or less, then the full disagreement weight is assigned. If the strings agree exactly, the full agreement weight is assigned.

As another example, suppose you specify 850 for the MatchParm, which means that the tolerance is relatively low. A score of 800 would get the full disagreement weight because it is lower than the parameter that you specified. Even though a score of 800 means that the strings are probably the same, you require a low tolerance.

MultType
Select one of the following values that determines how you want the match to normalize the score for two sequences of terms when the sequences do not contain the same number of terms:
  • 0 – Maximum number of words in the two sequences
  • 1 – Minimum number of words in the two sequences
  • 2 – Number of words in the first sequence
  • 3 – Number of words in the second sequence
  • 6 – Minimum number of words plus x, where x is the result of the ExtraTerms computation.
ExtraTerm
When the MultType value is 6, enter a positive integer for the percent of the difference between the greater and lesser of the two word counts to add to the minimum word count. An ExtraTerm value of 0 is equivalent to a MultType value of 1. An ExtraTerm value of 100 is equivalent to a MultType value of 0.
MatchRange
Enter a positive integer for the percent of the number of terms in the longer of the two sequences (percentage of the maximum word count). The resulting number of terms establishes a comparison radius that determines how different the position of two terms in their respective sequences can be and still be compared. For example, if the longer sequence contains 20 terms and you enter 50 for the MatchRange parameter, the match compares only the terms that are within 10 positions of each other.
OutOfRangeScore
Enter a positive integer for the percent of the default or rare value disagreement weight that is used to calculate a missing term weight. All terms in the shorter sequence must be scored against something. If all of the terms in the longer sequence that are within the range that is determined by the MatchRange parameter are paired with other terms, the value of the OutOfRangeScore parameter is used as the score for the unpaired terms.
This parameter controls which pairs of matched terms are used in the calculations of the order and proximity scores:
FactorCutoff
Enter a positive integer for the percent of the default or rare value agreement weight that is used to set a cutoff point for matched terms that are scored for order and proximity. Setting a cutoff score eliminates marginally positive and negative scores because those terms are really not matching. For example, for a FactorCutoff of 33, the lowest-scoring third of the term pairs will not be scored for order and proximity.
The following parameter controls the order score:
OrderParm
The value of this parameter determines the order score tolerance for errors. Enter a positive integer for the percent of the difference between the default agreement and disagreement weights that is used to penalize each out-of-order matched term. A lower number translates to more tolerance and a higher number translates to less tolerance.
The following parameters control the proximity score:
GapOpen
Enter a positive integer for the percent of the default or rare value agreement weight that is used to determine the proximity score penalty for the occurrence of each gap between matched terms.
GapExtend
Enter a positive integer for the percent of the default or rare value agreement weight that is used to determine the proximity score penalty for each additional space in a gap.

Example

The following examples illustrate how term order and term proximity are scored.

In the first example, the order score is higher for the first pair because all matched terms are in the same order.

Apartment 4-B Building 5
Apartment 4-B Building 5
Building 5 Apartment 4-B
Apartment 4-B Building 5

In the next example, the proximity score is higher for the first pair of terms because the second pair has a term that interrupts the sequence of matched terms.

Building 5 Apartment 4-B
Apartment 4-B Building 5
Building 5 Apartment 4-B
Apartment 4-B Upstairs Building 5