UNCERT comparison
Evaluates the similarity of two character strings by using an algorithm that is based on information theory principles.
The weight assigned is based on the difference between the two strings being compared as a function of the string length, the number of transpositions, and the number of unassigned insertions, deletions, or replacement of characters. String length is an important consideration because longer words can tolerate more errors than shorter words can. In other words, you have a better chance of understanding a longer word if it has the same number of errors as a shorter word.
Required Columns
The following data source and reference source columns are required:
- Data. The character string from the data source.
You can use this comparison with vectors and reverse matching. If you want to create vectors to use in the Match Designer, see Make Vector stage in DataStage.
- Reference. The character string from the reference source (only applies for a two-source match).
Required Parameter
The following parameter is required:
- 900. The two strings are identical.
- 850. The two strings can be safely considered to be the same.
- 800. The two strings are probably the same.
- 750. The two strings are probably different.
- 700. The two strings are almost certainly different.
A higher value for the Param 1 parameter causes the match to tolerate fewer differences than it would with a lower value for the Param 1 parameter.
Example
The assigned weight is proportioned linearly between the agreement and disagreement weights. For example, if you specify 700 and the score is 700 or less, then the full disagreement weight is assigned. If the strings agree exactly, the full agreement weight is assigned.
As another example, suppose you specify 850 for the MatchParm, which means that the tolerance is relatively low. A score of 800 would get the full disagreement weight because it is lower than the parameter that you specified. Even though a score of 800 means that the strings are probably the same, you require a low tolerance.