I try to understand how is weight calcualted (for each column) during matching.
For example we have small file with 10 records, column CODE contains four values (in parentheses number of rows with this value and calculated disagreement weight):
X10 (5, -2.32)
Y20 (2, -2.99)
Z20 (2, -2.99)
X20 (1, -3.16)
Now when I run test for simple matching specification (in Match Command for CODE column I use CHAR comparison type) I get such matching results (disagreement weights for two rows with specified values in CODE column):
X10&Y20 = -2.9
X10&X20 = -3.16
X10&Z20 = -2.99
X20&Z20 = -2.99
X20&Y20 = -3.16
1. What disagreement weight is choosed for CHAR comparison type when there are two different disagreement weights (for each column value, for example for case X10&Z20 with disagr.weight -2.32/-2.99)?
2. How to explain different disagreement weight for cases X10&Y20 and X10&Z20 (Y20 and Z20 value have the same frequency!!!)?
3. Is there any document/white paper/redbook which describe this matching process in more details (I looked at QualityStage user guide and redbook "IBM WebSphere QualityStage Methodologies, Standardization, and Matching" - but these documents doesn't answer to my questions)?
This topic has been locked.
6 replies Latest Post - 2011-02-02T14:42:30Z by OlegT.
Pinned topic how match weight is calculated?
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
Updated on 2011-02-02T14:42:30Z at 2011-02-02T14:42:30Z by OlegT.
Re: how match weight is calculated?2011-01-26T17:44:02Z in response to OlegT.You can find some discussion on calculating weights and penalties here: link: http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/topic/com.ibm.swg.im.iis.qs.ug.doc/topics/logarithm.html
Also some reference starting on page 89 of the QualityStage Redbook.
However, here is a short summary on match weights and their calculation:
A match process includes some number of fields on a record that will be used in calculating the match weight/score.
- Each field agreeing between records is going to add an agreement weight to the probability that the records are a match
- Each field disagreeing is going to contribute a disagreement weight
- Missing values get a weight of zero.
- If a field is very reliable, it is more useful for classification than a field with many errors or missing values
Weights take into account two properties of statistical variables:
- Reliability of the field (measured by the m-Probability)
- Probability of a random agreement of values (measured by the u-Probability)
Agreement weight is log2(m/u)
Disagreement weight is log2((1-m)/(1-u))
m is the m-probability, u is the u-probability; logs are base 2
The agreement weight is applied to the field if it matches in the record pair being examined, else the disagreement weight is applied (i.e. a penalty).
At a record level, the weights are composites:
- If a field matches, the agreement weight is used - This is a positive number
- If a field doesn’t match, the disagreement weight is used - This is a negative number
- Weights for all field comparisons are summed to form a composite weight
- If most fields match, the result will be a large positive value
- If most fields do not match, the result will be a large negative value
The m-probability is defined as the probability of the variable agreeing in a matched pair
It is equivalent to 1 - error rate for the field
- For example, if the variable street type has a 10% error rate, then the m-probability is 0.90
The higher the m-probability, the higher the disagreement weight will be for the field not matching since errors are relatively rare events.
m-probability is an estimate
0.9 is generally good for most fields
Use 0.95 or greater if the field is fairly important as well as reliable
- note that this is what you enter for a particular field for a given match comparison http://e.g. you choose a CHAR comparison on a Country Code field, there is an m-probability value that defaults to 0.9, but can be changed
The second property of a variable measured by weight is the probability that a variable agrees in an unmatched record (the u-probability). Consequently, the u-probability can be approximated as the probability that a field agrees at random (by chance).
u-Probability for each value is calculated as:
- of occurrences of the value / total number of occurrences for all values
Relatively rare events are more useful for record classification than common events
A match frequency stage can be used to calculate the u-probability automatically for a given field
The smaller the u-probability, the higher the agreement weight will be for the field as the likelihood of a match is rare, so when a match occurs it is likely good.
Some example calculations
Field m-prob u-prob weight penalty
last name 0.9 0.01 6.491853096 -3.307428525
first name 0.9 0.001 9.813781191 -3.320484678
taxid 0.999 0.0001 13.28626896 -9.965640008
date of birth 0.99 0.001 9.951284715 -6.642412773
gender 0.9 0.5 0.847996907 -2.321928095
If all of these were CHAR comparisons, they would either match (and get the full weight) or not match (and get the full penalty). Other comparisons will return various prorated scores between the maximum weight and the maximum penalty.
Changes to u-prob impact weight; changes to m-prob impact penalty
On your questions:
1. What disagreement weight is chosen for CHAR comparison type when there are two different disagreement weights (for each column value, for example for case X10&Z20 with disagr.weight -2.32/-2.99)?
Each field comparison has one score -- in your example X10 and Z20 are not a match, so they will get a disagreement and it will be calculated based on the m-prob entered in the match comparison and the u-prob will be based either on a value from a Match Frequency stage output, or what you enter in the match comparison.
2. How to explain different disagreement weight for cases X10&Y20 and X10&Z20 (Y20 and Z20 value have the same frequency!!!)?
It would be necessary to look at the details of your match specification, and any match frequency tables created for use with that.
Re: how match weight is calculated?2011-01-27T06:55:44Z in response to smithhaHi Harald,
> Each field comparison has one score -- in your example X10 and Z20 are not a match, so they will get a disagreement ...
As I used frequency table (generated with Match Frequency stage) I have two disagreement scores - for X10 and Z20 values ( -2.32 and -2.99). As you can see in my example Qualitystage choosed value -2.99 and I guessed that for disagreement is used minimum value from two weights. But unfortunately my guess is not rights because in other case - X20&Z20 (weights respectively -3.16 and -2.99) - Qualitystage choosed maximum value -2.99.
> It would be necessary to look at the details of your match specification, and any match frequency tables created for use with that.
Example is very simple, input data (columns BlockId, Descr, mcode):
Descr and mcode used in matching commands (with UNCERT and CHAR comparison types respectivelly).
I also attached screen snapshot of Match Designer (file simple10.PNG)
And frequency table for mcode column:
VALUE FREQUENCY MPROB UPROB AGRWGT DISWGT MISWGT
ALL-VALUES-IN-TABLE 0 0.899999976 0.01 1.84000003 -2.9000001 0
X10 5 0.899999976 0.5 0.839999974 -2.31999993 0
X20 1 0.899999976 0.100000001 3.16000009 -3.16000009 0
Y20 2 0.899999976 0.200000003 2.16000009 -2.99000001 0
Z20 2 0.899999976 0.200000003 2.16000009 -2.99000001 0
Re: how match weight is calculated?2011-01-28T22:17:46Z in response to OlegT.Hi Oleg,
Ok, I can see the pieces you are looking at with that detail.
What you are seeing is primarily a case of sequencing based on the input data set.
QualityStage will look at the data here record-by-record.
First record is X10, nothing else to match to, so it forms the first match group. However, that record's frequency information will not be the basis of the subsequent scoring.
Next record is Y20. It is compared to X10. You are using a CHAR comparison, so values must be exact to match. They aren't, so disagreement weight is based on the frequency of Y20 occurring: -2.99. Y20 will also become the second match group available.
Next record is X10. That is a match and you will get agreement weight based on X10 occurring: 0.84.
When you get to X20, it will compare to X10 and get a disagreement based on X20 occurring: -3.16.
When you get to Z20, it will compare first to X10 and get a disagreement based on Z20 occurring: -2.99.
It will subsequently compare to Y20, and then X20, in both cases getting same disagreement: -2.99.
If you change the sequence of the file so that Z20 comes first, and then X10, when X10 goes to match with Z20, that would get the disagreement of: -2.32.
So, think of the agreement or disagreement weight applied as the rarity of the value being compared to the group (not as the rarity of the group base or as the higher/lower of the possible weights).
Hope that helps.
Re: how match weight is calculated?2011-02-01T13:56:18Z in response to smithhaHi Harald,
thanks for you help and patience.
I am still confused with disagreement weight choosing during matching.
I tried to use input dataset with different sort order on MCODE coumn and result was the same (there was no any difference compared to my result which I describe in my previous post).
BlockId Descr mcode 1 AAAAAAAAAAAAAAAAAABB Z20 1 AAAAAAAAAAAAAAAAAACC Z20 1 AAAAAAAAAABBBBBBBBBB Z20 1 AAAAAAAAAABBBBBBBBBB Z20 1 AAAAAAAAAAAAAAAAAAAA X10 1 AAAAAAAAAAAAAAAAAAAB X10 1 AAAAAAAAAAAAAAAAAAAC X10 1 AAAAAAAAAAAAAAAAAAAF X10 1 AAAAAAAAAAAAAAAAAAAE X10 1 AAAAAAAAAAAAAAAAAAAF X10 1 AAAAAAAAAAAAAAAAAAAG X10 1 AAAAAAAAAABBBBBBBBBB X10 1 AAAAAAAAAABBBBBBBBBB X20 1 AAAAAAAAAAAAAAAAAACC X20 1 AAAAAAAAAAAAAAAAAAEE Y20 1 AAAAAAAAAAAAAAAAAAFF Y20 1 AAAAAAAAAABBBBBBBBBB Y20 1 AAAAAAAAAAAAAAAAAAGG Y20
VALUE FREQUENCY MPROB UPROB AGRWGT DISWGT X10 8 0.90 0.44 1.01 -2.47 *ALL-VALUES-IN-TABLE* 0 0.90 0.01 1.84 -2.90 Y20 4 0.90 0.22 2.01 -2.95 Z20 4 0.90 0.22 2.01 -2.95 X20 2 0.90 0.11 3.01 -3.15
QSMATCHSETID QSMATCHTYPE QSMATCHDATAID BlockId Descr mcode QSMATCHWEIGHT MCODE_WEIGHT 3 XA 3 1 AAAAAAAAAABBBBBBBBBB Z20 8.50 3 DA 13 1 AAAAAAAAAABBBBBBBBBB X20 3.54 -2.95 3 DA 17 1 AAAAAAAAAABBBBBBBBBB Y20 3.54 -2.95 3 DA 12 1 AAAAAAAAAABBBBBBBBBB X10 4.02 -2.47 3 DA 4 1 AAAAAAAAAABBBBBBBBBB Z20 8.50 2.01 8 XA 8 1 AAAAAAAAAAAAAAAAAAAF X10 7.50 8 DA 14 1 AAAAAAAAAAAAAAAAAACC X20 0.76 -2.47 8 DA 15 1 AAAAAAAAAAAAAAAAAAEE Y20 0.76 -2.47 8 DA 18 1 AAAAAAAAAAAAAAAAAAGG Y20 0.76 -2.47 8 DA 1 1 AAAAAAAAAAAAAAAAAABB Z20 0.76 -2.47 8 DA 2 1 AAAAAAAAAAAAAAAAAACC Z20 0.76 -2.47 8 DA 16 1 AAAAAAAAAAAAAAAAAAFF Y20 2.39 -2.47 8 DA 5 1 AAAAAAAAAAAAAAAAAAAA X10 5.87 1.01 8 DA 6 1 AAAAAAAAAAAAAAAAAAAB X10 5.87 1.01 8 DA 7 1 AAAAAAAAAAAAAAAAAAAC X10 5.87 1.01 8 DA 9 1 AAAAAAAAAAAAAAAAAAAE X10 5.87 1.01 8 DA 11 1 AAAAAAAAAAAAAAAAAAAG X10 5.87 1.01 8 DA 10 1 AAAAAAAAAAAAAAAAAAAF X10 7.50 1.01
For me strange things are:
- for first matching group (SetId=3, master record is with mcode="Z20") disagreement weight usually is from matched record (for record with mcode=Y20 weight is for Y20 value, for record mcode=X10 weights is for X10 value); But why for record mcode=X20 is choosed weight -2.95 (X20 has disagreement weight -3.15 )
- for second matching group (SetId=8, master record is with mcode="X10") for all mathed rows with different value in mcode Qualitystage choosed disagreement weight -2.47 (it is value for X10)
- why for the first matching group (SetId=3) Qualitystage didn't choose master record with mcode=X20 (in this group all values for Descr column is the same, but "X20" value has highest agreement weight)
Re: how match weight is calculated?2011-02-01T20:19:07Z in response to OlegT.Hi Oleg,
My earlier response indicating selection based on the record order was incorrect. I did confirm that the selection between the pair will be based on whichever value is most frequently occurring.
What you see in this selection is correct:
- for instance, between Z20 and X20, Z20 occurs more frequently, therefore gets the disagreement weight of Z20.
- in the group with X10 as the master, X10 always occurs most frequently, therefore all pairings get the disagreement weight of X10.
Now I'm not sure why in the original post you may have seen a difference. Were you reviewing output in the Match Designer or on an output file?
Re: how match weight is calculated?2011-02-02T14:42:30Z in response to smithhaHarald,
I got these weights from Match Designer. Now I am cannot reproduce this problem,
maybe there are additional factors which produce this strange results (for example I changed input file format - record delimiter).
Now I see that during mismatch Qualitystage choose weight for most frequent value (or maximum disagreement weight from two values).
What about my third question from my last post? As I know master record is a record with a highest composite weight calculated against this record itself. And according this definition it is a record with mcode="X20" (agreement weights for Descr column is 6.49 because I set NOFREQ spec.var.processing, and weight for X20 value is 3.16 so composite weight is 9.65 > 8.5 - composite weight for row with mcode="Y20"). Am I right?
Is there a way to output from Macth Stage for each row deatlisation of composite weight (weight for each attribute)?