Guidelines when determining probability

You can use these guidelines as you assign probabilities.

The higher the m probability is, the greater the disagreement weight is. Therefore, if a column is important, give the m probability higher values. If the m probability is high, it is equivalent to saying that a disagreement of that column is a rare event in a matched pair, and consequently the penalty for a nonmatch is high. The weights computed from the probabilities are visible in the data viewer of the Match Designer so that you can inspect the results.

Use the following guidelines when determining m probabilities.

  • Give high m probabilities to the columns that are the most important and reliable.
  • Give lower m probabilities to the columns that are often in error or incomplete.
  • The m probability must always be greater than the u probability and must never be zero or 1.

Agreement or disagreement between data values is more significant for reliable data and less significant for unreliable data.

As a starting point, you can guess the u probability because the matching process replaces any guess with actual values. A good estimate is to make the u probability 1/n values, where n is the number of unique values for the column. By default, the u probability for each comparison is calculated automatically by the matching process using the frequency information from the Frequency stage. This calculated u probability is important for columns with non-uniform distributions.

The frequency information allows match to vary the weights according to the particular values of a column. Rare values bring more weight to a match. For example, in using a column such as FamilyName, the values Smith and Jones are common in the United States. However, a value such as Alcott is relatively rare in the United States. A match on the rare family name, Alcott, gets a higher weight than a match on the more common family names because the probability of chance agreement on Alcott is relatively low compared to chance agreements on other values such as Smith or Jones.

For columns with a uniform distribution of values and a high number of different values (such as individual identification numbers), it is better not to generate frequency information. Specify the vartype NOFREQ in the Variable Special Handling window of the Match Designer.

Even exact matching is subject to the same statistical laws as probabilistic matching. It is possible to have two records that contain identical values and yet do not represent the same entity. You cannot make a definitive determination when there is not enough information.