Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Standardize your data using InfoSphere QualityStage

Dhanunjaya Lokireddy is a Senior QA Engineer working for the InfoSphere QualityStage team at IBM India Software Lab, Hyderabad. He has six years of experience in IBM working for different QA teams in the Information Server product area.

Summary:  Data standardization is a process that ensures that data conforms to quality rules. This tutorial introduces data standardization concepts and demonstrates how you can achieve standardized data using IBM® InfoSphere® QualityStage™. A reader who is new to QualityStage standardization will get a basic understanding of the process. Readers should have basic knowledge of InfoSphere DataStage® job development. This tutorial covers standardization using country identifier, domain pre-processor, domain-specific and validation types of rule sets.

Date:  11 Aug 2011
Level:  Intermediate PDF:  A4 and Letter (1179 KB | 33 pages)Get Adobe® Reader®

Activity:  12449 views
Comments:  

Before you start

Editor's note: All personal data appearing in this tutorial is fictitious and was created for sample purposes only.

InfoSphere QualityStage overview

Enterprises often face issues with data arising out of lack of standards. Data may be entered in inconsistent ways across different systems, causing records to appear different even though they are actually the same. For example, the following two records describe the same person at the same address, even though the name and address appear to be quite different:

Bob Christiansan614 Columbus Ave #3, Boston, Massachusetts 02116
R.J. Christensen614 Columbus Suite #3, Suffolk County 02116

Another common error leading to "data surprises" is that data can be misplaced. Here is an example where several of the fields contain the wrong type of information. The name field contains address information, the tax ID field contains telephone numbers, and the telephone field contains city name information. This misplacement of data often leads to application errors.

NameTax IDTelephone
Becker & Co. C/O Bill025-37-1998415-392-2770
B Smith DBA Lime Cons.228-02-16956173380220
1st Natl Provident34-26718543309321
HP 15 State St.508-466-1550Orlando

A third kind of common data standardization problem involves the lack of consistent identifiers. The following example has three records containing a product description. They look different, but they are actually same. This is because of the lack of consistent identifiers.

91-84-301 RS232 Cable 5' M-F CandS
CS-89641 5 ft. Cable Male-F, RS232 #87951
C&SUCH6 Male/Female 25 PIN 5 Foot Cable

InfoSphere QualityStage (hereafter called QualityStage), a component product of InfoSphere Information Server, helps identify and resolve the issues described above and provides a way to maintain an accurate view of master data entities. QualityStage has following capabilities:

  • Investigation — Helps you understand the nature and scope of data anomalies
  • Standardization — Parses individual fields and makes them uniform according to business standards
  • Matching — Identifies duplicate records within and across data sources
  • Survivorship — Helps eliminate duplicate records and create the best-breed record of data

Understanding the standardization process

Standardization parses or separates free-form fields into single component fields or assigns data to its appropriate metadata fields in a standard format.

Data is frequently captured with variations resulting from:

  • Data entry errors
  • Different conventions for representing the same data value
  • Semantic differences across systems
  • Multiple sources for the same data element
  • Lack of data quality standards

But the target systems require cleansed data for reporting and decision-making. Standardization helps improve the addressability of data stored in free-form columns and ensures that each data element has relevant content and format. It normalizes data values to standard forms and prepares data elements for more effective matching. It also helps in identifying and removing invalid data values. Standardization is important because it prepares the data for further processing.

Standardization works based on special instructions called rule sets. Some rule sets are:

  • Country identifier, such as COUNTRY
  • Domain pre-processor, such as USPREP
  • Domain-specific, such as USNAME
  • Validation, such as VDATE

Most of the packaged rule sets are country-specific. For example, there are different name standardization rule sets for the United States and Japan. As of InfoSphere Information Server V8.5, these rule sets are packaged with QualityStage. Advanced users can create rule sets based on their business requirements.

Rule sets have three required components:

  • Classification Table — Contains the keywords, standard value, and user-defined class
  • Dictionary File — Defines the layout of the output columns
  • Pattern-Action File — Contains the logic to populate output columns and parsing parameters

Figure 1. Standardization process overview
Image shows standardization process flow         diagram

Figure 1 shows an overview of the standardization process:

  1. Parses input data using pattern action file (SEPLIST/STRIPLIST) parameters
  2. Assigns user-defined classes from classification table and apples default classes for remaining tokens
  3. Forms output fields using a dictionary file
  4. Populates data to output fields using a pattern action file

The remaining sections of the tutorial contain detailed steps to create standardize jobs using different type of rule sets with examples.

1 of 9 | Next

Comments



Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=751123
TutorialTitle=Standardize your data using InfoSphere QualityStage
publish-date=08112011
author1-email=dhanunjaya@in.ibm.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.