IBM InfoSphere Master Data Management, Version 10.1

Method for evaluating data quality

To evaluate the quality of the data, the values in specific fields across multiple records need to be compared to one another.

The method used to evaluate data quality should differentiate between data quality issues that range from cosmetic differences to missing and even totally incorrect values.

The significance of a data quality issue should be adequately reflected by the data quality penalty. The method should punish severe information quality issues with higher penalties whereas imposing lower penalties on less significant information quality issues.

Policy monitoring uses a fuzzy comparison method to evaluate data quality. A fuzzy comparison is the technique of finding values with an approximate or exact match, rather than only an exact match.

The following example shows the values for the Name attribute. The golden record value is WILLIAM SMITH. There five matching source records.

Record	Value
A	WILLIAM SMITH
B	B. SMITH
C	WILLIAM SMYTH
D	NAME UNKNOWN
E	JAMES WILLIAMS

The five source records are linked and share the same enterprise identifier (EID) with the golden record. Even for records D and E, where the names differ significantly from the golden record source, the there exists enough similarity in other matching attributes, such as address, phone number, and email to match and link the records with a common identifier.

With the fuzzy comparison method the records are evaluated with the following results:

Record A is an exact match with the golden record source.
Record B shows a deviation from the golden record source. The value is almost right except the first name is incomplete and the nick name Bill is implied.
Record C shows a deviation from the golden record source. The value for the last name is incorrect but only one character is different.
Record D does not contain a value for the Name attribute. The blank is replaced with an anonymous value of NAME UNKNOWN.
Record E displays a completely incorrect name that seems to belong to another person.

Data quality penalties

The fuzzy comparison functions in the MDM matching algorithms calculate the data quality for each attribute and the data quality penalty.

For record B the matching algorithm recognize that William and Bill are synonymous and that Bill is a nickname for William. In calculating the data quality, the attribute content is penalized slightly since some information has been lost.

For record C the data quality penalty is higher than for record B but not significantly higher since the data attribute still contains valuable information. The matching algorithm recognizes that the name contains only one incorrect character and that the stored value WILLIAM SMYTH is close to the golden record source value WILLIAM SMITH.

For record D the value is NAME UNKNOWN. The matching algorithms recognize this value and interpret that value as an equivalent to a blank. The data quality penalty for a blank or anonymous attribute results in zero (0)

For record E the content is incorrect and displays the wrong name. This results in the highest data quality penalty. Not only is the original content for the name WILLIAM SMITH lost, but the attribute contains misleading or confusing information represented by the incorrect name JAMES WILLIAMS. In this case the data quality penalty is higher than for record D and the data quality is negative.

Data quality scores

A matching algorithm function called MemScore is used to compare two values and return a score quantifying the similarity between the two values. The following equation is the foundation of the method that is used for calculating data quality scores:

Data Quality = MemScore (source record value, golden record value) / MemScore (golden record value, golden record value)

The MemScore function in the numerator of the equation computes the similarity between the source record value and the golden record value. The MemScore function in the denominator represents the result of self-scoring the golden record value. The score in the numerator is always lower than or equal to the score in the denominator. The scores are equal when the source record value and the golden record value are the same.

The data quality scores for the records in this example are:

Record A scores 1 or 100%.
Records B and C score positive numbers that are less that 100%.
Record D scores 0 because one of the values in not available or anonymous.
Record E scores a negative number because the compared values differ significantly from each other.

Feedback

Last updated: 18 Oct 2012

Topic URL: