IBM InfoSphere Master Data Management, Version 10.1To evaluate the quality of the data, the values in specific fields across multiple records need to be compared to one another.
The method used to evaluate data quality should differentiate between data quality issues that range from cosmetic differences to missing and even totally incorrect values.
The significance of a data quality issue should be adequately reflected by the data quality penalty. The method should punish severe information quality issues with higher penalties whereas imposing lower penalties on less significant information quality issues.
Policy monitoring uses a fuzzy comparison method to evaluate data quality. A fuzzy comparison is the technique of finding values with an approximate or exact match, rather than only an exact match.
| Record | Value |
|---|---|
| A | WILLIAM SMITH |
| B | B. SMITH |
| C | WILLIAM SMYTH |
| D | NAME UNKNOWN |
| E | JAMES WILLIAMS |
The five source records are linked and share the same enterprise identifier (EID) with the golden record. Even for records D and E, where the names differ significantly from the golden record source, the there exists enough similarity in other matching attributes, such as address, phone number, and email to match and link the records with a common identifier.
The fuzzy comparison functions in the MDM matching algorithms calculate the data quality for each attribute and the data quality penalty.
For record B the matching algorithm recognize that William and Bill are synonymous and that Bill is a nickname for William. In calculating the data quality, the attribute content is penalized slightly since some information has been lost.
For record C the data quality penalty is higher than for record B but not significantly higher since the data attribute still contains valuable information. The matching algorithm recognizes that the name contains only one incorrect character and that the stored value WILLIAM SMYTH is close to the golden record source value WILLIAM SMITH.
For record D the value is NAME UNKNOWN. The matching algorithms recognize this value and interpret that value as an equivalent to a blank. The data quality penalty for a blank or anonymous attribute results in zero (0)
For record E the content is incorrect and displays the wrong name. This results in the highest data quality penalty. Not only is the original content for the name WILLIAM SMITH lost, but the attribute contains misleading or confusing information represented by the incorrect name JAMES WILLIAMS. In this case the data quality penalty is higher than for record D and the data quality is negative.
Data Quality = MemScore (source record value, golden record value) / MemScore (golden record value, golden record value)
The MemScore function
in the numerator of the equation computes the similarity between the
source record value and the golden record value. The MemScore function
in the denominator represents the result of self-scoring the golden
record value. The score in the numerator is always lower than or equal
to the score in the denominator. The scores are equal when the source
record value and the golden record value are the same. 