Before you start
Editor's note: All personal data appearing in this tutorial is fictitious and was created for sample purposes only.
Enterprises often face issues with data arising out of lack of standards. Data may be entered in inconsistent ways across different systems, causing records to appear different even though they are actually the same. For example, the following two records describe the same person at the same address, even though the name and address appear to be quite different:
|Bob Christiansan||614 Columbus Ave #3, Boston, Massachusetts 02116|
|R.J. Christensen||614 Columbus Suite #3, Suffolk County 02116|
Another common error leading to "data surprises" is that data can be misplaced. Here is an example where several of the fields contain the wrong type of information. The name field contains address information, the tax ID field contains telephone numbers, and the telephone field contains city name information. This misplacement of data often leads to application errors.
|Becker & Co. C/O Bill||025-37-1998||415-392-2770|
|B Smith DBA Lime Cons.||228-02-1695||6173380220|
|1st Natl Provident||34-2671854||3309321|
|HP 15 State St.||508-466-1550||Orlando|
A third kind of common data standardization problem involves the lack of consistent identifiers. The following example has three records containing a product description. They look different, but they are actually same. This is because of the lack of consistent identifiers.
|91-84-301 RS232 Cable 5' M-F CandS|
|CS-89641 5 ft. Cable Male-F, RS232 #87951|
|C&SUCH6 Male/Female 25 PIN 5 Foot Cable|
InfoSphere QualityStage (hereafter called QualityStage), a component product of InfoSphere Information Server, helps identify and resolve the issues described above and provides a way to maintain an accurate view of master data entities. QualityStage has following capabilities:
- Investigation — Helps you understand the nature and scope of data anomalies
- Standardization — Parses individual fields and makes them uniform according to business standards
- Matching — Identifies duplicate records within and across data sources
- Survivorship — Helps eliminate duplicate records and create the best-breed record of data
Standardization parses or separates free-form fields into single component fields or assigns data to its appropriate metadata fields in a standard format.
Data is frequently captured with variations resulting from:
- Data entry errors
- Different conventions for representing the same data value
- Semantic differences across systems
- Multiple sources for the same data element
- Lack of data quality standards
But the target systems require cleansed data for reporting and decision-making. Standardization helps improve the addressability of data stored in free-form columns and ensures that each data element has relevant content and format. It normalizes data values to standard forms and prepares data elements for more effective matching. It also helps in identifying and removing invalid data values. Standardization is important because it prepares the data for further processing.
Standardization works based on special instructions called rule sets. Some rule sets are:
- Country identifier, such as COUNTRY
- Domain pre-processor, such as USPREP
- Domain-specific, such as USNAME
- Validation, such as VDATE
Most of the packaged rule sets are country-specific. For example, there are different name standardization rule sets for the United States and Japan. As of InfoSphere Information Server V8.5, these rule sets are packaged with QualityStage. Advanced users can create rule sets based on their business requirements.
Rule sets have three required components:
- Classification Table — Contains the keywords, standard value, and user-defined class
- Dictionary File — Defines the layout of the output columns
- Pattern-Action File — Contains the logic to populate output columns and parsing parameters
Figure 1. Standardization process overview
Figure 1 shows an overview of the standardization process:
- Parses input data using pattern action file (SEPLIST/STRIPLIST) parameters
- Assigns user-defined classes from classification table and apples default classes for remaining tokens
- Forms output fields using a dictionary file
- Populates data to output fields using a pattern action file
The remaining sections of the tutorial contain detailed steps to create standardize jobs using different type of rule sets with examples.