QualityStage user guide definitely specify (Chapter 11. Match Comparisons) that CNT_DIFF comparison type is only for numeric data.
For example look here
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=/com.ibm.swg.im.iis.qs.ug.doc/topics/logarithm.html (CNT_DIFF is in table 2 "Match comparisons that apply to numbers").
But I tried to use this comparison with alpha string and it works without any problem exact as for numeric strings. For example pairs AAAAAAAAAAAAAAAAAAAA / AAAAAAAAAAAAAAAAAbcd (alhpa) and 11111111111111111111 / 11111111111111111234 (numeric) produce the same weight = 1.6
Is it safe to use CNT_DIFF for non-numeric strings? Maybe it is a documentation bug?
What is algorithm is used in CNT_DIFF comparison (Edit Distance, Jaro-Winkler, ...)?
This topic has been locked.
3 replies Latest Post - 2011-01-28T17:36:37Z by OlegT.
Pinned topic CNT_DIFF also works for non-numeric strings (documentation bug)?
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
Updated on 2011-01-28T17:36:37Z at 2011-01-28T17:36:37Z by OlegT.
smithha 110000PAKN23 PostsACCEPTED ANSWER
Re: CNT_DIFF also works for non-numeric strings (documentation bug)?2011-01-28T17:30:10Z in response to OlegT.Hi Oleg,
It's more a documentation nuance as in "Compares two strings of numbers", where 'strings' includes character data (char, varchar, etc)
CNT_DIFF is designed to test for keystroke errors and it does evaluate strings, including strings of numeric or alpha values. It is most commonly used against numeric or date strings, as those types of strings are more likely to only have keystroke errors. If you have data such as license or part 'numbers' that contain mixed alphanumeric data, those are also good candidates to test using CNT_DIFF.
For most text data that has any freeform aspect to it (names, products, addresses, descriptions, etc.) I would not use CNT_DIFF.
The algorithms used in the Match Comparisons are proprietary to IBM.