MULT_ALIGN comparison
Scores the similarity of two sequences of terms. This comparison combines your knowledge of how similar the terms are, the order of the similar terms, and the proximity of the similar terms. You can use MULT_ALIGN to compare addresses where the sequences of terms are in different orders.
- Similarity of the terms
- Order of similar terms in their original sequence
- Proximity of similar terms in their original sequence
Required Columns
- Data. The character string from the data source.
- Reference. The character string from the reference source (only applies to a two-source match).
Parameters
- MatchMix
- Enter a positive integer that represents the relative importance of the similarity score for all of the matched terms.
- OrderMix
- Enter a positive integer that represents the relative importance of the order score for matched terms that score at or above the value that you enter for the FactorCutoff parameter.
- CompactMix
- Enter a positive integer that represents the relative importance of the proximity score for matched terms that score at or above the value that you enter for the FactorCutoff parameter.
- MatchParm
- Enter a positive integer from 0-900 which represents the weight
that is used by the UNCERT match comparison to determine its tolerance
of errors. This parameter is an indication of the tolerance of the
comparison. Higher numbers mean that the comparison is less tolerant
of differences in the strings. MatchParm is similar to the Param
1 parameter for the UNCERT comparison. Use these values
as a rough guideline:
- 900. The two strings must be identical.
- 850. The two strings can be safely considered to be the same.
- 800. The two strings are probably the same.
- 750. The two strings are probably different.
- 700. The two strings are almost certainly different.
The assigned weight is proportioned linearly between the agreement and disagreement weights. For example, if you specify 700 and the score is 700 or less, then the full disagreement weight is assigned. If the strings agree exactly, the full agreement weight is assigned.
As another example, suppose you specify 850 for the MatchParm, which means that the tolerance is relatively low. A score of 800 would get the full disagreement weight because it is lower than the parameter that you specified. Even though a score of 800 means that the strings are probably the same, you require a low tolerance.
- MultType
- Select one of the following values that determines how you want
the match to normalize the score for two sequences of terms when the
sequences do not contain the same number of terms:
- 0 – Maximum number of words in the two sequences
- 1 – Minimum number of words in the two sequences
- 2 – Number of words in the first sequence
- 3 – Number of words in the second sequence
- 6 – Minimum number of words plus x, where x is the result of the ExtraTerms computation.
- ExtraTerm
- When the MultType value is 6, enter a positive integer for the percent of the difference between the greater and lesser of the two word counts to add to the minimum word count. An ExtraTerm value of 0 is equivalent to a MultType value of 1. An ExtraTerm value of 100 is equivalent to a MultType value of 0.
- MatchRange
- Enter a positive integer for the percent of the number of terms in the longer of the two sequences (percentage of the maximum word count). The resulting number of terms establishes a comparison radius that determines how different the position of two terms in their respective sequences can be and still be compared. For example, if the longer sequence contains 20 terms and you enter 50 for the MatchRange parameter, the match compares only the terms that are within 10 positions of each other.
- OutOfRangeScore
- Enter a positive integer for the percent of the default or rare value disagreement weight that is used to calculate a missing term weight. All terms in the shorter sequence must be scored against something. If all of the terms in the longer sequence that are within the range that is determined by the MatchRange parameter are paired with other terms, the value of the OutOfRangeScore parameter is used as the score for the unpaired terms.
- FactorCutoff
- Enter a positive integer for the percent of the default or rare value agreement weight that is used to set a cutoff point for matched terms that are scored for order and proximity. Setting a cutoff score eliminates marginally positive and negative scores because those terms are really not matching. For example, for a FactorCutoff of 33, the lowest-scoring third of the term pairs will not be scored for order and proximity.
- OrderParm
- The value of this parameter determines the order score tolerance for errors. Enter a positive integer for the percent of the difference between the default agreement and disagreement weights that is used to penalize each out-of-order matched term. A lower number translates to more tolerance and a higher number translates to less tolerance.
- GapOpen
- Enter a positive integer for the percent of the default or rare value agreement weight that is used to determine the proximity score penalty for the occurrence of each gap between matched terms.
- GapExtend
- Enter a positive integer for the percent of the default or rare value agreement weight that is used to determine the proximity score penalty for each additional space in a gap.
Example
The following examples illustrate how term order and term proximity are scored.
In the first example, the order score is higher for the first pair because all matched terms are in the same order.
Apartment 4-B Building 5
Apartment 4-B Building 5
Building 5 Apartment 4-B
Apartment 4-B Building 5
In the next example, the proximity score is higher for the first pair of terms because the second pair has a term that interrupts the sequence of matched terms.
Building 5 Apartment 4-B
Apartment 4-B Building 5
Building 5 Apartment 4-B
Apartment 4-B Upstairs Building 5