Topic
• 16 replies
• Latest Post - ‏2017-10-17T01:00:26Z by yichen.kp
OlegT.
14 Posts

# Pinned topic how match weight is calculated?

‏2011-01-26T16:48:20Z |
Hi,
I try to understand how is weight calcualted (for each column) during matching.
For example we have small file with 10 records, column CODE contains four values (in parentheses number of rows with this value and calculated disagreement weight):
X10 (5, -2.32)
Y20 (2, -2.99)
Z20 (2, -2.99)
X20 (1, -3.16)

Now when I run test for simple matching specification (in Match Command for CODE column I use CHAR comparison type) I get such matching results (disagreement weights for two rows with specified values in CODE column):
X10&Y20 = -2.9
X10&X20 = -3.16
X10&Z20 = -2.99
X20&Z20 = -2.99
X20&Y20 = -3.16

Questins:
1. What disagreement weight is choosed for CHAR comparison type when there are two different disagreement weights (for each column value, for example for case X10&Z20 with disagr.weight -2.32/-2.99)?
2. How to explain different disagreement weight for cases X10&Y20 and X10&Z20 (Y20 and Z20 value have the same frequency!!!)?
3. Is there any document/white paper/redbook which describe this matching process in more details (I looked at QualityStage user guide and redbook "IBM WebSphere QualityStage Methodologies, Standardization, and Matching" - but these documents doesn't answer to my questions)?

Thanks,
Oleg
Updated on 2011-02-02T14:42:30Z at 2011-02-02T14:42:30Z by OlegT.
• smithha
23 Posts

#### Re: how match weight is calculated?

‏2011-01-26T17:44:02Z
You can find some discussion on calculating weights and penalties here: link: http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/topic/com.ibm.swg.im.iis.qs.ug.doc/topics/logarithm.html

Also some reference starting on page 89 of the QualityStage Redbook.

However, here is a short summary on match weights and their calculation:

A match process includes some number of fields on a record that will be used in calculating the match weight/score.
• Each field agreeing between records is going to add an agreement weight to the probability that the records are a match
• Each field disagreeing is going to contribute a disagreement weight
• Missing values get a weight of zero.
• If a field is very reliable, it is more useful for classification than a field with many errors or missing values

Weights take into account two properties of statistical variables:
• Reliability of the field (measured by the m-Probability)
• Probability of a random agreement of values (measured by the u-Probability)

Agreement weight is log2(m/u)
Disagreement weight is log2((1-m)/(1-u))

m is the m-probability, u is the u-probability; logs are base 2

The agreement weight is applied to the field if it matches in the record pair being examined, else the disagreement weight is applied (i.e. a penalty).

At a record level, the weights are composites:
• If a field matches, the agreement weight is used - This is a positive number
• If a field doesn’t match, the disagreement weight is used - This is a negative number
• Weights for all field comparisons are summed to form a composite weight
• If most fields match, the result will be a large positive value
• If most fields do not match, the result will be a large negative value

m-probability
The m-probability is defined as the probability of the variable agreeing in a matched pair
It is equivalent to 1 - error rate for the field
• For example, if the variable street type has a 10% error rate, then the m-probability is 0.90

The higher the m-probability, the higher the disagreement weight will be for the field not matching since errors are relatively rare events.

m-probability is an estimate
0.9 is generally good for most fields
Use 0.95 or greater if the field is fairly important as well as reliable

u-probability
The second property of a variable measured by weight is the probability that a variable agrees in an unmatched record (the u-probability). Consequently, the u-probability can be approximated as the probability that a field agrees at random (by chance).

u-Probability for each value is calculated as:
1. of occurrences of the value / total number of occurrences for all values
Each value of a field has a u-probability
Relatively rare events are more useful for record classification than common events
A match frequency stage can be used to calculate the u-probability automatically for a given field

The smaller the u-probability, the higher the agreement weight will be for the field as the likelihood of a match is rare, so when a match occurs it is likely good.
Some example calculations

Field m-prob u-prob weight penalty
last name 0.9 0.01 6.491853096 -3.307428525
first name 0.9 0.001 9.813781191 -3.320484678
taxid 0.999 0.0001 13.28626896 -9.965640008
date of birth 0.99 0.001 9.951284715 -6.642412773
gender 0.9 0.5 0.847996907 -2.321928095

If all of these were CHAR comparisons, they would either match (and get the full weight) or not match (and get the full penalty). Other comparisons will return various prorated scores between the maximum weight and the maximum penalty.

Changes to u-prob impact weight; changes to m-prob impact penalty

1. What disagreement weight is chosen for CHAR comparison type when there are two different disagreement weights (for each column value, for example for case X10&Z20 with disagr.weight -2.32/-2.99)?

Each field comparison has one score -- in your example X10 and Z20 are not a match, so they will get a disagreement and it will be calculated based on the m-prob entered in the match comparison and the u-prob will be based either on a value from a Match Frequency stage output, or what you enter in the match comparison.

2. How to explain different disagreement weight for cases X10&Y20 and X10&Z20 (Y20 and Z20 value have the same frequency!!!)?
It would be necessary to look at the details of your match specification, and any match frequency tables created for use with that.

Harald
• OlegT.
14 Posts

#### Re: how match weight is calculated?

‏2011-01-27T06:55:44Z
• smithha
• ‏2011-01-26T17:44:02Z
You can find some discussion on calculating weights and penalties here: link: http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/topic/com.ibm.swg.im.iis.qs.ug.doc/topics/logarithm.html

Also some reference starting on page 89 of the QualityStage Redbook.

However, here is a short summary on match weights and their calculation:

A match process includes some number of fields on a record that will be used in calculating the match weight/score.
• Each field agreeing between records is going to add an agreement weight to the probability that the records are a match
• Each field disagreeing is going to contribute a disagreement weight
• Missing values get a weight of zero.
• If a field is very reliable, it is more useful for classification than a field with many errors or missing values

Weights take into account two properties of statistical variables:
• Reliability of the field (measured by the m-Probability)
• Probability of a random agreement of values (measured by the u-Probability)

Agreement weight is log2(m/u)
Disagreement weight is log2((1-m)/(1-u))

m is the m-probability, u is the u-probability; logs are base 2

The agreement weight is applied to the field if it matches in the record pair being examined, else the disagreement weight is applied (i.e. a penalty).

At a record level, the weights are composites:
• If a field matches, the agreement weight is used - This is a positive number
• If a field doesn’t match, the disagreement weight is used - This is a negative number
• Weights for all field comparisons are summed to form a composite weight
• If most fields match, the result will be a large positive value
• If most fields do not match, the result will be a large negative value

m-probability
The m-probability is defined as the probability of the variable agreeing in a matched pair
It is equivalent to 1 - error rate for the field
• For example, if the variable street type has a 10% error rate, then the m-probability is 0.90

The higher the m-probability, the higher the disagreement weight will be for the field not matching since errors are relatively rare events.

m-probability is an estimate
0.9 is generally good for most fields
Use 0.95 or greater if the field is fairly important as well as reliable

u-probability
The second property of a variable measured by weight is the probability that a variable agrees in an unmatched record (the u-probability). Consequently, the u-probability can be approximated as the probability that a field agrees at random (by chance).

u-Probability for each value is calculated as:
1. of occurrences of the value / total number of occurrences for all values
Each value of a field has a u-probability
Relatively rare events are more useful for record classification than common events
A match frequency stage can be used to calculate the u-probability automatically for a given field

The smaller the u-probability, the higher the agreement weight will be for the field as the likelihood of a match is rare, so when a match occurs it is likely good.
Some example calculations

Field m-prob u-prob weight penalty
last name 0.9 0.01 6.491853096 -3.307428525
first name 0.9 0.001 9.813781191 -3.320484678
taxid 0.999 0.0001 13.28626896 -9.965640008
date of birth 0.99 0.001 9.951284715 -6.642412773
gender 0.9 0.5 0.847996907 -2.321928095

If all of these were CHAR comparisons, they would either match (and get the full weight) or not match (and get the full penalty). Other comparisons will return various prorated scores between the maximum weight and the maximum penalty.

Changes to u-prob impact weight; changes to m-prob impact penalty

1. What disagreement weight is chosen for CHAR comparison type when there are two different disagreement weights (for each column value, for example for case X10&Z20 with disagr.weight -2.32/-2.99)?

Each field comparison has one score -- in your example X10 and Z20 are not a match, so they will get a disagreement and it will be calculated based on the m-prob entered in the match comparison and the u-prob will be based either on a value from a Match Frequency stage output, or what you enter in the match comparison.

2. How to explain different disagreement weight for cases X10&Y20 and X10&Z20 (Y20 and Z20 value have the same frequency!!!)?
It would be necessary to look at the details of your match specification, and any match frequency tables created for use with that.

Harald
Hi Harald,
> Each field comparison has one score -- in your example X10 and Z20 are not a match, so they will get a disagreement ...
As I used frequency table (generated with Match Frequency stage) I have two disagreement scores - for X10 and Z20 values ( -2.32 and -2.99). As you can see in my example Qualitystage choosed value -2.99 and I guessed that for disagreement is used minimum value from two weights. But unfortunately my guess is not rights because in other case - X20&Z20 (weights respectively -3.16 and -2.99) - Qualitystage choosed maximum value -2.99.

> It would be necessary to look at the details of your match specification, and any match frequency tables created for use with that.
Example is very simple, input data (columns BlockId, Descr, mcode):
BlockId,Descr,mcode
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,Y20
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,Y20
1,AAAAAAAAAAAAAAAAAAAA,X20
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,Z20
1,AAAAAAAAAAAAAAAAAAAA,Z20

Descr and mcode used in matching commands (with UNCERT and CHAR comparison types respectivelly).
I also attached screen snapshot of Match Designer (file simple10.PNG)

And frequency table for mcode column:
pre
VALUE FREQUENCY MPROB UPROB AGRWGT DISWGT MISWGT
ALL-VALUES-IN-TABLE 0 0.899999976 0.01 1.84000003 -2.9000001 0
X10 5 0.899999976 0.5 0.839999974 -2.31999993 0
X20 1 0.899999976 0.100000001 3.16000009 -3.16000009 0
Y20 2 0.899999976 0.200000003 2.16000009 -2.99000001 0
Z20 2 0.899999976 0.200000003 2.16000009 -2.99000001 0
[/pre]

Regards,
Oleg

• smithha
23 Posts

#### Re: how match weight is calculated?

‏2011-01-28T22:17:46Z
• OlegT.
• ‏2011-01-27T06:55:44Z
Hi Harald,
> Each field comparison has one score -- in your example X10 and Z20 are not a match, so they will get a disagreement ...
As I used frequency table (generated with Match Frequency stage) I have two disagreement scores - for X10 and Z20 values ( -2.32 and -2.99). As you can see in my example Qualitystage choosed value -2.99 and I guessed that for disagreement is used minimum value from two weights. But unfortunately my guess is not rights because in other case - X20&Z20 (weights respectively -3.16 and -2.99) - Qualitystage choosed maximum value -2.99.

> It would be necessary to look at the details of your match specification, and any match frequency tables created for use with that.
Example is very simple, input data (columns BlockId, Descr, mcode):
BlockId,Descr,mcode
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,Y20
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,Y20
1,AAAAAAAAAAAAAAAAAAAA,X20
1,AAAAAAAAAAAAAAAAAAAA,X10
1,AAAAAAAAAAAAAAAAAAAA,Z20
1,AAAAAAAAAAAAAAAAAAAA,Z20

Descr and mcode used in matching commands (with UNCERT and CHAR comparison types respectivelly).
I also attached screen snapshot of Match Designer (file simple10.PNG)

And frequency table for mcode column:
pre
VALUE FREQUENCY MPROB UPROB AGRWGT DISWGT MISWGT
ALL-VALUES-IN-TABLE 0 0.899999976 0.01 1.84000003 -2.9000001 0
X10 5 0.899999976 0.5 0.839999974 -2.31999993 0
X20 1 0.899999976 0.100000001 3.16000009 -3.16000009 0
Y20 2 0.899999976 0.200000003 2.16000009 -2.99000001 0
Z20 2 0.899999976 0.200000003 2.16000009 -2.99000001 0
[/pre]

Regards,
Oleg
Hi Oleg,

Ok, I can see the pieces you are looking at with that detail.
What you are seeing is primarily a case of sequencing based on the input data set.

QualityStage will look at the data here record-by-record.
First record is X10, nothing else to match to, so it forms the first match group. However, that record's frequency information will not be the basis of the subsequent scoring.

Next record is Y20. It is compared to X10. You are using a CHAR comparison, so values must be exact to match. They aren't, so disagreement weight is based on the frequency of Y20 occurring: -2.99. Y20 will also become the second match group available.

Next record is X10. That is a match and you will get agreement weight based on X10 occurring: 0.84.

When you get to X20, it will compare to X10 and get a disagreement based on X20 occurring: -3.16.

When you get to Z20, it will compare first to X10 and get a disagreement based on Z20 occurring: -2.99.
It will subsequently compare to Y20, and then X20, in both cases getting same disagreement: -2.99.

If you change the sequence of the file so that Z20 comes first, and then X10, when X10 goes to match with Z20, that would get the disagreement of: -2.32.

So, think of the agreement or disagreement weight applied as the rarity of the value being compared to the group (not as the rarity of the group base or as the higher/lower of the possible weights).

Hope that helps.

Harald
• OlegT.
14 Posts

#### Re: how match weight is calculated?

‏2011-02-01T13:56:18Z
• smithha
• ‏2011-01-28T22:17:46Z
Hi Oleg,

Ok, I can see the pieces you are looking at with that detail.
What you are seeing is primarily a case of sequencing based on the input data set.

QualityStage will look at the data here record-by-record.
First record is X10, nothing else to match to, so it forms the first match group. However, that record's frequency information will not be the basis of the subsequent scoring.

Next record is Y20. It is compared to X10. You are using a CHAR comparison, so values must be exact to match. They aren't, so disagreement weight is based on the frequency of Y20 occurring: -2.99. Y20 will also become the second match group available.

Next record is X10. That is a match and you will get agreement weight based on X10 occurring: 0.84.

When you get to X20, it will compare to X10 and get a disagreement based on X20 occurring: -3.16.

When you get to Z20, it will compare first to X10 and get a disagreement based on Z20 occurring: -2.99.
It will subsequently compare to Y20, and then X20, in both cases getting same disagreement: -2.99.

If you change the sequence of the file so that Z20 comes first, and then X10, when X10 goes to match with Z20, that would get the disagreement of: -2.32.

So, think of the agreement or disagreement weight applied as the rarity of the value being compared to the group (not as the rarity of the group base or as the higher/lower of the possible weights).

Hope that helps.

Harald
Hi Harald,
thanks for you help and patience.
I am still confused with disagreement weight choosing during matching.
I tried to use input dataset with different sort order on MCODE coumn and result was the same (there was no any difference compared to my result which I describe in my previous post).

Input data:
``````
BlockId Descr                mcode 1       AAAAAAAAAAAAAAAAAABB Z20 1       AAAAAAAAAAAAAAAAAACC Z20 1       AAAAAAAAAABBBBBBBBBB Z20 1       AAAAAAAAAABBBBBBBBBB Z20 1       AAAAAAAAAAAAAAAAAAAA X10 1       AAAAAAAAAAAAAAAAAAAB X10 1       AAAAAAAAAAAAAAAAAAAC X10 1       AAAAAAAAAAAAAAAAAAAF X10 1       AAAAAAAAAAAAAAAAAAAE X10 1       AAAAAAAAAAAAAAAAAAAF X10 1       AAAAAAAAAAAAAAAAAAAG X10 1       AAAAAAAAAABBBBBBBBBB X10 1       AAAAAAAAAABBBBBBBBBB X20 1       AAAAAAAAAAAAAAAAAACC X20 1       AAAAAAAAAAAAAAAAAAEE Y20 1       AAAAAAAAAAAAAAAAAAFF Y20 1       AAAAAAAAAABBBBBBBBBB Y20 1       AAAAAAAAAAAAAAAAAAGG Y20```
```

Weight info:
``````
VALUE                 FREQUENCY MPROB UPROB AGRWGT DISWGT X10                   8         0.90  0.44  1.01   -2.47 *ALL-VALUES-IN-TABLE* 0         0.90  0.01  1.84   -2.90 Y20                   4         0.90  0.22  2.01   -2.95 Z20                   4 0.90  0.22  2.01   -2.95 X20                   2         0.90  0.11  3.01   -3.15```
```

Matching result:
``````
QSMATCHSETID QSMATCHTYPE QSMATCHDATAID BlockId Descr                mcode QSMATCHWEIGHT MCODE_WEIGHT 3            XA          3             1       AAAAAAAAAABBBBBBBBBB Z20   8.50 3             DA         13            1       AAAAAAAAAABBBBBBBBBB X20   3.54          -2.95 3             DA         17            1       AAAAAAAAAABBBBBBBBBB Y20   3.54          -2.95 3             DA         12            1       AAAAAAAAAABBBBBBBBBB X10   4.02          -2.47 3             DA         4             1       AAAAAAAAAABBBBBBBBBB Z20   8.50           2.01 8            XA          8             1       AAAAAAAAAAAAAAAAAAAF X10   7.50 8             DA         14            1       AAAAAAAAAAAAAAAAAACC X20   0.76          -2.47 8             DA         15            1       AAAAAAAAAAAAAAAAAAEE Y20   0.76          -2.47 8             DA         18            1       AAAAAAAAAAAAAAAAAAGG Y20   0.76          -2.47 8             DA         1             1       AAAAAAAAAAAAAAAAAABB Z20   0.76          -2.47 8             DA         2             1       AAAAAAAAAAAAAAAAAACC Z20   0.76          -2.47 8             DA         16            1       AAAAAAAAAAAAAAAAAAFF Y20   2.39          -2.47 8             DA         5             1       AAAAAAAAAAAAAAAAAAAA X10   5.87           1.01 8             DA         6             1       AAAAAAAAAAAAAAAAAAAB X10   5.87           1.01 8             DA         7             1       AAAAAAAAAAAAAAAAAAAC X10   5.87           1.01 8             DA         9             1       AAAAAAAAAAAAAAAAAAAE X10   5.87           1.01 8             DA         11            1       AAAAAAAAAAAAAAAAAAAG X10   5.87           1.01 8             DA         10            1       AAAAAAAAAAAAAAAAAAAF X10   7.50           1.01```
```

For me strange things are:
• for first matching group (SetId=3, master record is with mcode="Z20") disagreement weight usually is from matched record (for record with mcode=Y20 weight is for Y20 value, for record mcode=X10 weights is for X10 value); But why for record mcode=X20 is choosed weight -2.95 (X20 has disagreement weight -3.15 )
• for second matching group (SetId=8, master record is with mcode="X10") for all mathed rows with different value in mcode Qualitystage choosed disagreement weight -2.47 (it is value for X10)
• why for the first matching group (SetId=3) Qualitystage didn't choose master record with mcode=X20 (in this group all values for Descr column is the same, but "X20" value has highest agreement weight)

Regards,
Oleg
• smithha
23 Posts

#### Re: how match weight is calculated?

‏2011-02-01T20:19:07Z
• OlegT.
• ‏2011-02-01T13:56:18Z
Hi Harald,
thanks for you help and patience.
I am still confused with disagreement weight choosing during matching.
I tried to use input dataset with different sort order on MCODE coumn and result was the same (there was no any difference compared to my result which I describe in my previous post).

Input data:
<pre class="jive-pre">``` BlockId Descr mcode 1 AAAAAAAAAAAAAAAAAABB Z20 1 AAAAAAAAAAAAAAAAAACC Z20 1 AAAAAAAAAABBBBBBBBBB Z20 1 AAAAAAAAAABBBBBBBBBB Z20 1 AAAAAAAAAAAAAAAAAAAA X10 1 AAAAAAAAAAAAAAAAAAAB X10 1 AAAAAAAAAAAAAAAAAAAC X10 1 AAAAAAAAAAAAAAAAAAAF X10 1 AAAAAAAAAAAAAAAAAAAE X10 1 AAAAAAAAAAAAAAAAAAAF X10 1 AAAAAAAAAAAAAAAAAAAG X10 1 AAAAAAAAAABBBBBBBBBB X10 1 AAAAAAAAAABBBBBBBBBB X20 1 AAAAAAAAAAAAAAAAAACC X20 1 AAAAAAAAAAAAAAAAAAEE Y20 1 AAAAAAAAAAAAAAAAAAFF Y20 1 AAAAAAAAAABBBBBBBBBB Y20 1 AAAAAAAAAAAAAAAAAAGG Y20``` </pre>

Weight info:
<pre class="jive-pre">``` VALUE FREQUENCY MPROB UPROB AGRWGT DISWGT X10 8 0.90 0.44 1.01 -2.47 *ALL-VALUES-IN-TABLE* 0 0.90 0.01 1.84 -2.90 Y20 4 0.90 0.22 2.01 -2.95 Z20 4 0.90 0.22 2.01 -2.95 X20 2 0.90 0.11 3.01 -3.15``` </pre>

Matching result:
<pre class="jive-pre">``` QSMATCHSETID QSMATCHTYPE QSMATCHDATAID BlockId Descr mcode QSMATCHWEIGHT MCODE_WEIGHT 3 XA 3 1 AAAAAAAAAABBBBBBBBBB Z20 8.50 3 DA 13 1 AAAAAAAAAABBBBBBBBBB X20 3.54 -2.95 3 DA 17 1 AAAAAAAAAABBBBBBBBBB Y20 3.54 -2.95 3 DA 12 1 AAAAAAAAAABBBBBBBBBB X10 4.02 -2.47 3 DA 4 1 AAAAAAAAAABBBBBBBBBB Z20 8.50 2.01 8 XA 8 1 AAAAAAAAAAAAAAAAAAAF X10 7.50 8 DA 14 1 AAAAAAAAAAAAAAAAAACC X20 0.76 -2.47 8 DA 15 1 AAAAAAAAAAAAAAAAAAEE Y20 0.76 -2.47 8 DA 18 1 AAAAAAAAAAAAAAAAAAGG Y20 0.76 -2.47 8 DA 1 1 AAAAAAAAAAAAAAAAAABB Z20 0.76 -2.47 8 DA 2 1 AAAAAAAAAAAAAAAAAACC Z20 0.76 -2.47 8 DA 16 1 AAAAAAAAAAAAAAAAAAFF Y20 2.39 -2.47 8 DA 5 1 AAAAAAAAAAAAAAAAAAAA X10 5.87 1.01 8 DA 6 1 AAAAAAAAAAAAAAAAAAAB X10 5.87 1.01 8 DA 7 1 AAAAAAAAAAAAAAAAAAAC X10 5.87 1.01 8 DA 9 1 AAAAAAAAAAAAAAAAAAAE X10 5.87 1.01 8 DA 11 1 AAAAAAAAAAAAAAAAAAAG X10 5.87 1.01 8 DA 10 1 AAAAAAAAAAAAAAAAAAAF X10 7.50 1.01``` </pre>

For me strange things are:
• for first matching group (SetId=3, master record is with mcode="Z20") disagreement weight usually is from matched record (for record with mcode=Y20 weight is for Y20 value, for record mcode=X10 weights is for X10 value); But why for record mcode=X20 is choosed weight -2.95 (X20 has disagreement weight -3.15 )
• for second matching group (SetId=8, master record is with mcode="X10") for all mathed rows with different value in mcode Qualitystage choosed disagreement weight -2.47 (it is value for X10)
• why for the first matching group (SetId=3) Qualitystage didn't choose master record with mcode=X20 (in this group all values for Descr column is the same, but "X20" value has highest agreement weight)

Regards,
Oleg
Hi Oleg,
My earlier response indicating selection based on the record order was incorrect. I did confirm that the selection between the pair will be based on whichever value is most frequently occurring.

What you see in this selection is correct:
• for instance, between Z20 and X20, Z20 occurs more frequently, therefore gets the disagreement weight of Z20.
• in the group with X10 as the master, X10 always occurs most frequently, therefore all pairings get the disagreement weight of X10.

Now I'm not sure why in the original post you may have seen a difference. Were you reviewing output in the Match Designer or on an output file?

Harald
• OlegT.
14 Posts

#### Re: how match weight is calculated?

‏2011-02-02T14:42:30Z
• smithha
• ‏2011-02-01T20:19:07Z
Hi Oleg,
My earlier response indicating selection based on the record order was incorrect. I did confirm that the selection between the pair will be based on whichever value is most frequently occurring.

What you see in this selection is correct:
• for instance, between Z20 and X20, Z20 occurs more frequently, therefore gets the disagreement weight of Z20.
• in the group with X10 as the master, X10 always occurs most frequently, therefore all pairings get the disagreement weight of X10.

Now I'm not sure why in the original post you may have seen a difference. Were you reviewing output in the Match Designer or on an output file?

Harald
Harald,
I got these weights from Match Designer. Now I am cannot reproduce this problem,
maybe there are additional factors which produce this strange results (for example I changed input file format - record delimiter).
Now I see that during mismatch Qualitystage choose weight for most frequent value (or maximum disagreement weight from two values).

What about my third question from my last post? As I know master record is a record with a highest composite weight calculated against this record itself. And according this definition it is a record with mcode="X20" (agreement weights for Descr column is 6.49 because I set NOFREQ spec.var.processing, and weight for X20 value is 3.16 so composite weight is 9.65 > 8.5 - composite weight for row with mcode="Y20"). Am I right?

Is there a way to output from Macth Stage for each row deatlisation of composite weight (weight for each attribute)?

Thanks,
Oleg
• EHARF
1 Post

#### Re: how match weight is calculated?

‏2017-03-23T09:56:34Z

Hello,
Kindly I need your help to know how i can add new dictionary of the "Arabe Countries" and how I can download it, for uses in standardization rules.

• yichen.kp
5 Posts

#### Re: how match weight is calculated?

‏2017-07-26T23:51:22Z

a) By default, the u probability for each comparison is calculated automatically by the matching process using the frequency information from the Frequency stage. This calculated u probability is important for columns with non-uniform distributions.

from

b) u-prob will be based either on a value from a Match Frequency stage output, or what you enter in the match comparison.

(1)According to above reference webpage, I guess i can get automatically calculated U probability.

so my question is how can I configure a Frequency stage (or other stage) to get U probability

(2) when in match designer-> match command->choose a certain column->properties

and it yells fails to retrieve freq data. How to fix this?

Updated on 2017-07-26T23:53:58Z at 2017-07-26T23:53:58Z by yichen.kp
• RobertDickson
69 Posts

#### Re: how match weight is calculated?

‏2017-09-13T11:58:30Z
• yichen.kp
• ‏2017-07-26T23:51:22Z

a) By default, the u probability for each comparison is calculated automatically by the matching process using the frequency information from the Frequency stage. This calculated u probability is important for columns with non-uniform distributions.

from

b) u-prob will be based either on a value from a Match Frequency stage output, or what you enter in the match comparison.

(1)According to above reference webpage, I guess i can get automatically calculated U probability.

so my question is how can I configure a Frequency stage (or other stage) to get U probability

(2) when in match designer-> match command->choose a certain column->properties

and it yells fails to retrieve freq data. How to fix this?

The frequency stage will get the frequencies to be used in the calculation. The frequency stage just accepts columns to compute. Make sure you use the defined output from that stage.

You need to have data and frequency (in data set form, not text file) and then load that data with 'Configure Specfification > Test Environment'

I hope this helps

Robert

• yichen.kp
5 Posts

#### Re: how match weight is calculated?

‏2017-09-13T16:10:14Z

The frequency stage will get the frequencies to be used in the calculation. The frequency stage just accepts columns to compute. Make sure you use the defined output from that stage.

You need to have data and frequency (in data set form, not text file) and then load that data with 'Configure Specfification > Test Environment'

I hope this helps

Robert

Hi Robert,

Thank you so much for your help. Could you please tell me more about  Test environment ?

The Test environment need the ODBC connection, right?

For the server, I have configured and test a DB2Wire connection

but for the client, I only have following ODBC Data Source. Could you please let me know what should I do on the client side or on the server side?

Best,

Yichen

• RobertDickson
69 Posts

#### Re: how match weight is calculated?

‏2017-09-13T16:57:47Z
• yichen.kp
• ‏2017-09-13T16:10:14Z

Hi Robert,

Thank you so much for your help. Could you please tell me more about  Test environment ?

The Test environment need the ODBC connection, right?

For the server, I have configured and test a DB2Wire connection

but for the client, I only have following ODBC Data Source. Could you please let me know what should I do on the client side or on the server side?

Best,

Yichen

Hi Yichen,

You do need to configure identical ODBC connectiom names on the client and server. For the client (pictured above), select the 'System' tab (not the user tab) and select 'Add'. You should see the IBM DB2 Wire Protocol driver.

Robert

• yichen.kp
5 Posts

#### Re: how match weight is calculated?

‏2017-09-13T19:25:55Z
• yichen.kp
• ‏2017-09-13T16:10:14Z

Hi Robert,

Thank you so much for your help. Could you please tell me more about  Test environment ?

The Test environment need the ODBC connection, right?

For the server, I have configured and test a DB2Wire connection

but for the client, I only have following ODBC Data Source. Could you please let me know what should I do on the client side or on the server side?

Best,

Yichen

Hi Robert,

Actually under System tab -> Add. I can not see the IBM DB2 Wire Protocol driver.

Does this mean that I need to install the IBM DB2 Wire Protocol driver on the clients computer first?

Is there any other ways to get the U-prob without configuration the test environment or install data driver on the clients side ?  Thank you so much!

• RobertDickson
69 Posts

#### Re: how match weight is calculated?

‏2017-09-19T21:52:52Z
• yichen.kp
• ‏2017-09-13T19:25:55Z

Hi Robert,

Actually under System tab -> Add. I can not see the IBM DB2 Wire Protocol driver.

Does this mean that I need to install the IBM DB2 Wire Protocol driver on the clients computer first?

Is there any other ways to get the U-prob without configuration the test environment or install data driver on the clients side ?  Thank you so much!

Hi,

You need to install the IBM ODBC Drivers. They should have been automatically installed with the QualityStage client.

You can develop a match without 'testing', but the Match Designer is the only place to see the u-prob.

Robert

• yichen.kp
5 Posts

#### Re: how match weight is calculated?

‏2017-09-26T21:34:54Z

Hi,

You need to install the IBM ODBC Drivers. They should have been automatically installed with the QualityStage client.

You can develop a match without 'testing', but the Match Designer is the only place to see the u-prob.

Robert

Hi Robert,

Yes, you are right! The IBM ODBC Driver has installed automatically. I need to navigate to the place for the ODBC Driver. After setting up the testing environment, in the match designer, Match Commands panel, I can right-click a data column or a reference column-> properties

Now I can get fig 1 and fig 2. It is the frequency for a column in data or reference data set. But according to

For match columns, the Frequency field provides u-prob and m-prob. Could you please help me figure out how to find the u-prob?

Thank you so much!

Best regards,

Yichen

#### Attachments

• RobertDickson
69 Posts

#### Re: how match weight is calculated?

‏2017-09-27T14:22:52Z

Hi,

It looks like the doc may be a bit outdated.

1) After testing the pass, select two rows and then 'Compare Weights > Based on Last Match Run'

To see:

I hope this helps.

Robert

• yichen.kp
5 Posts

#### Re: how match weight is calculated?

‏2017-10-17T01:00:26Z

Hi,

It looks like the doc may be a bit outdated.

1) After testing the pass, select two rows and then 'Compare Weights > Based on Last Match Run'

To see:

I hope this helps.

Robert

Hi Robert,

Thank you so much! I get the same result but I am afraid that this u-prob and m-prob is what I designed (input) in the match parameter instead of the "probability of accidental agreement calculated from frequency of data.

Thank you again for your help and in time response!

Best,

Yichen