Hi,
Is there a best practice for choosing Maximum Frequency value for Match Frequency stage (for example  "as much as possible" or "it is enough to select 1000 rows which appear in 90% of all cases")? For example I have 50000 unique values for first name, but first 500 of them cover more 80% of cases. Is it enough to specify 500 for Maximum Frequency or I should increase this value?
Another question  what value Qualitystage choose for uprobability for values which doesn't included into frequency statistics? Does it use uprob value from match command configuration or calculate it as 1/<number of rest value>?
Also I am interested in criterion for skiping frequency statistics gathering (in other words specification NOFREQ varaible special handling for column, for example "use NOFREQ when there is more then 25% unique values in dataset").
Regards,
Oleg
Topic
NOTICE: developerWorks Community will be offline May 2930, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
This topic has been locked.
1 reply
Latest Post
 20110209T14:54:33Z by smithha
ACCEPTED ANSWER
Pinned topic Match Frequency stage  choosing Maximum Frequency Entry
20110209T12:59:02Z

Answered question
This question has been answered.
Unanswered question
This question has not been answered yet.
Updated on 20110209T14:54:33Z at 20110209T14:54:33Z by smithha

ACCEPTED ANSWER
Re: Match Frequency stage  choosing Maximum Frequency Entry
20110209T14:54:33Z in response to OlegT.Hi Oleg,
It really depends on the industry you are in, the type of data you are working with, and the degree to which you need preciseness in outcome.
For example, if you need a legally defensible outcome such as in jury selection or a true statistical result, then you'd want to use a complete frequency distribution across all values.
In most instances, a cutoff at a certain percent or number of records is sufficient, usually based on the volume of data being processed and the spread of the data frequencies. Working with 1M rows of data, where uprob of a single event is .000001 (and for discussion put mprob at .99), the agreement weight on a single instance would be ~19.9. If the data occurs twice in 1 million (.000002), then the agreement drops to ~18.9. At 10 in a million, you are down to ~16.6. At 100 in a million, you are at ~13.3.
If your 1M rows has 50k distinct values, where 1000 cover 80% (800k) and 49k covers remaining 200k, the average occurrence in the remainder is ~4 in a million (~17.9 weight). If you cutoff at that point, then the remaining items will get maximum agreement weight, bumping up some slightly less rare occurrences. The primary question is: is there harm in increasing the likelihood of matching? For patient data in healthcare, I would say yes. For customer financial accounts, there could be. For customer data for marketing, probably not. Really depends on the data you are working with.
As for what situations would I use NOFREQ option:
1) when you want a specific agreement/disagreement weight regardless of frequency (e.g. if business indicates that if a Taxid matches, then record is considered a match; if it doesn't, then it's not).
2) when there are minimal distinct values and the expectation is an even distribution (i.e. slight skews don't matter)
3) all data is unique (i.e. it doesn't buy you anything to apply frequency)
Harald