Is there a best practice for choosing Maximum Frequency value for Match Frequency stage (for example - "as much as possible" or "it is enough to select 1000 rows which appear in 90% of all cases")? For example I have 50000 unique values for first name, but first 500 of them cover more 80% of cases. Is it enough to specify 500 for Maximum Frequency or I should increase this value?
Another question - what value Qualitystage choose for u-probability for values which doesn't included into frequency statistics? Does it use u-prob value from match command configuration or calculate it as 1/<number of rest value>?
Also I am interested in criterion for skiping frequency statistics gathering (in other words specification NOFREQ varaible special handling for column, for example "use NOFREQ when there is more then 25% unique values in dataset").
NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
This topic has been locked.
1 reply Latest Post - 2011-02-09T14:54:33Z by smithha
Pinned topic Match Frequency stage - choosing Maximum Frequency Entry
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
Updated on 2011-02-09T14:54:33Z at 2011-02-09T14:54:33Z by smithha
smithha 110000PAKN23 PostsACCEPTED ANSWER
Re: Match Frequency stage - choosing Maximum Frequency Entry2011-02-09T14:54:33Z in response to OlegT.Hi Oleg,
It really depends on the industry you are in, the type of data you are working with, and the degree to which you need preciseness in outcome.
For example, if you need a legally defensible outcome such as in jury selection or a true statistical result, then you'd want to use a complete frequency distribution across all values.
In most instances, a cutoff at a certain percent or number of records is sufficient, usually based on the volume of data being processed and the spread of the data frequencies. Working with 1M rows of data, where u-prob of a single event is .000001 (and for discussion put m-prob at .99), the agreement weight on a single instance would be ~19.9. If the data occurs twice in 1 million (.000002), then the agreement drops to ~18.9. At 10 in a million, you are down to ~16.6. At 100 in a million, you are at ~13.3.
If your 1M rows has 50k distinct values, where 1000 cover 80% (800k) and 49k covers remaining 200k, the average occurrence in the remainder is ~4 in a million (~17.9 weight). If you cutoff at that point, then the remaining items will get maximum agreement weight, bumping up some slightly less rare occurrences. The primary question is: is there harm in increasing the likelihood of matching? For patient data in healthcare, I would say yes. For customer financial accounts, there could be. For customer data for marketing, probably not. Really depends on the data you are working with.
As for what situations would I use NOFREQ option:
1) when you want a specific agreement/disagreement weight regardless of frequency (e.g. if business indicates that if a Taxid matches, then record is considered a match; if it doesn't, then it's not).
2) when there are minimal distinct values and the expectation is an even distribution (i.e. slight skews don't matter)
3) all data is unique (i.e. it doesn't buy you anything to apply frequency)