Standardize your data using InfoSphere QualityStage

Data standardization is a process that ensures that data conforms to quality rules. This tutorial introduces data standardization concepts and demonstrates how you can achieve standardized data using IBM® InfoSphere® QualityStage™. A reader who is new to QualityStage standardization will get a basic understanding of the process. Readers should have basic knowledge of InfoSphere DataStage® job development. This tutorial covers standardization using country identifier, domain pre-processor, domain-specific and validation types of rule sets.

Share:

Dhanunjaya Lokireddy (dhanunjaya@in.ibm.com), Senior QA Engineer, IBM China

Dhanunjaya Lokireddy is a Senior QA Engineer working for the InfoSphere QualityStage team at IBM India Software Lab, Hyderabad. He has six years of experience in IBM working for different QA teams in the Information Server product area.



11 August 2011

Also available in Chinese Portuguese

Before you start

Editor's note: All personal data appearing in this tutorial is fictitious and was created for sample purposes only.

InfoSphere QualityStage overview

Enterprises often face issues with data arising out of lack of standards. Data may be entered in inconsistent ways across different systems, causing records to appear different even though they are actually the same. For example, the following two records describe the same person at the same address, even though the name and address appear to be quite different:

Bob Christiansan614 Columbus Ave #3, Boston, Massachusetts 02116
R.J. Christensen614 Columbus Suite #3, Suffolk County 02116

Another common error leading to "data surprises" is that data can be misplaced. Here is an example where several of the fields contain the wrong type of information. The name field contains address information, the tax ID field contains telephone numbers, and the telephone field contains city name information. This misplacement of data often leads to application errors.

NameTax IDTelephone
Becker & Co. C/O Bill025-37-1998415-392-2770
B Smith DBA Lime Cons.228-02-16956173380220
1st Natl Provident34-26718543309321
HP 15 State St.508-466-1550Orlando

A third kind of common data standardization problem involves the lack of consistent identifiers. The following example has three records containing a product description. They look different, but they are actually same. This is because of the lack of consistent identifiers.

91-84-301 RS232 Cable 5' M-F CandS
CS-89641 5 ft. Cable Male-F, RS232 #87951
C&SUCH6 Male/Female 25 PIN 5 Foot Cable

InfoSphere QualityStage (hereafter called QualityStage), a component product of InfoSphere Information Server, helps identify and resolve the issues described above and provides a way to maintain an accurate view of master data entities. QualityStage has following capabilities:

  • Investigation— Helps you understand the nature and scope of data anomalies
  • Standardization— Parses individual fields and makes them uniform according to business standards
  • Matching— Identifies duplicate records within and across data sources
  • Survivorship— Helps eliminate duplicate records and create the best-breed record of data

Understanding the standardization process

Standardization parses or separates free-form fields into single component fields or assigns data to its appropriate metadata fields in a standard format.

Data is frequently captured with variations resulting from:

  • Data entry errors
  • Different conventions for representing the same data value
  • Semantic differences across systems
  • Multiple sources for the same data element
  • Lack of data quality standards

But the target systems require cleansed data for reporting and decision-making. Standardization helps improve the addressability of data stored in free-form columns and ensures that each data element has relevant content and format. It normalizes data values to standard forms and prepares data elements for more effective matching. It also helps in identifying and removing invalid data values. Standardization is important because it prepares the data for further processing.

Standardization works based on special instructions called rule sets. Some rule sets are:

  • Country identifier, such as COUNTRY
  • Domain pre-processor, such as USPREP
  • Domain-specific, such as USNAME
  • Validation, such as VDATE

Most of the packaged rule sets are country-specific. For example, there are different name standardization rule sets for the United States and Japan. As of InfoSphere Information Server V8.5, these rule sets are packaged with QualityStage. Advanced users can create rule sets based on their business requirements.

Rule sets have three required components:

  • Classification Table — Contains the keywords, standard value, and user-defined class
  • Dictionary File — Defines the layout of the output columns
  • Pattern-Action File — Contains the logic to populate output columns and parsing parameters
Figure 1. Standardization process overview
Image shows standardization process flow diagram

Figure 1 shows an overview of the standardization process:

  1. Parses input data using pattern action file (SEPLIST/STRIPLIST) parameters
  2. Assigns user-defined classes from classification table and apples default classes for remaining tokens
  3. Forms output fields using a dictionary file
  4. Populates data to output fields using a pattern action file

The remaining sections of the tutorial contain detailed steps to create standardize jobs using different type of rule sets with examples.


Implementing the country identifier rule set

Editor's note: All personal data appearing in this tutorial is fictitious and was created for sample purposes only.

The country identifier rule set helps to identify the country using the given data. For example, take the following data:

Listing 1. Data records for country identifier example
Andrew Conacher Level 10, 135 Exhibition St Melbourne VIC 3000
Ian Williams 167-170 Washway Road Sale Manchester M33 6RJ
Eric Ferm 17 Wellington Street W. 4th Floor Toronto, Ontario, M5K 1B1
Dr Jeffery David Thomson Jnr PHD 52280A NC 42 72 HWY # 42

The data contains records belongs to various countries. The steps below show how to use QualityStage to identify the country for each record.

Step 1: Create a parallel job

Create a parallel job as shown in Figure 2. Configure the input sequential file stage to read the input file, which contains the example records listed above.

Figure 2. Parallel job with sequential and standardize stages
Screen capture shows diagram with data on the left, going through the standardize_country jog, producing output_country on the right

Figure 3 shows the designer palette where the standardize stage is selected.

Figure 3. Designer palette showing standardize stage
Screen capture shows standardize option selected

Figure 4 shows the input sequential file with the data from the listing above.

Figure 4. Input sequential file view data
Image shows data input sequential file view data

Step 2: Configure the standardize stage

  1. Create a new process. Use the New Process button in the toolbar.
    Figure 5. Standardize stage properties
    Screen capture shows standardize stage properties where you select New Process

    The next screen is the standardize new rule process window, with the available columns listed.

    Figure 6. Standardize new rule process window
    Image shows data column listed as available
  2. For the listed data column, which is the input sequential file metadata, select Rule Sets > Other > COUNTRY.
    Figure 7. Rule set selection
    Image shows country selected
  3. Click the > button to move the Data column to the Selected column area.
    Figure 8. Standardize rule process window with selected rule set and columns
    Image shows standardize rule process window with selected rule set and columns
  4. Add metadata delimiter. Metadata delimiter plays an important rule in this type of rule set. The delimiter is used to set default country code. If the country rule set can't determine the country based on the information provided, it defaults to the delimiter value. The format of the metadata delimiter is ZQ<Country Code>ZQ. In this example, we are setting US as the default country. Enter ZQUSZQ in the Literal field.
    Figure 9. Standardize rule process window with metadata delimiter entered
    Image shows ZQUSZQ in the Literal field
  5. Click the > button beside the Literal field.
    Figure 10. Using literal to set the country code
    Image shows literal moved to selected column area
  6. Use the Move Up and Move Down buttons to arrange the metadata delimiter in the following way:

    ZQUSZQ
    Data

    Click OK to add the process.
    Figure 11. Standardize rule process window with all metadata delimiter arranged in order
    Image shows standardize rule process window with all metadata delimiter arranged in order
    Figure 12. Standardize stage properties window with created rule process
    Image shows standardize stage properties window with created rule process
  7. Map the output columns (Stage Properties > Output > Mapping)
    The standardize stage produces columns based on the rule set selected. The following columns were selected in this example: ISOCountryCode_COUNTRY, IdentifierFlag_COUNTRY, along with "Data" input field.

    Drag and drop the columns listed above to the output.
    Figure 13. Standardize stage output column mapping
    Image shows standardize stage output column mapping

Step 3: Configure the output file and run the job

Configure the output sequential file stage to supply required fields like file name and other settings like format as required. Run the job and verify the output. Here is the output produced:

Figure 14. Output sequential file view data
Image shows country code for each record

Andrew Conacher Level 10, 135 Exhibition St Melbourne VIC 3000
Country code for this record is identified as AU (ISOCountryCode_COUNTRY)
Country code is identified based on the data only (IdentifierFlag_COUNTRY)

Ian Williams 167-170 Washway Road Sale Manchester M33 6RJ
Country code for this record is identified as GB (ISOCountryCode_COUNTRY)
Country code is identified based on the data only (IdentifierFlag_COUNTRY)

Eric Ferm 17 Wellington Street W. 4th Floor Toronto, Ontario, M5K 1B1
Country code for this record is identified as CA (ISOCountryCode_COUNTRY)
Country code is identified based on the data only (IdentifierFlag_COUNTRY)

Dr Jeffery David Thomson Jnr PHD 52280A NC 42 72 HWY # 42
Country code for this record is identified as US (ISOCountryCode_COUNTRY)
Here country code couldn't identify based on data so it used default country code based on the metadata delimiter (US (IdentifierFlag_COUNTRY))


Implementing the domain pre-processor

Editor's note: All personal data appearing in this tutorial is fictitious and was created for sample purposes only.

The domain pre-processor will identify different domains (like name, address and area) from the given data and populate them to the correct fields. Let's take the following data:

"52280A NC 42 72 HWY # 42","KNOXVILLE TN 37920","Dr Jeffery David Thomson Jnr PHD"
"International Business Machines Corp","1480 CARRIAGE LN APT 301","AUBURN IN 467069555"
"Peter heines","ASHVILLE NEW YORK 147109762","930 SOUTH BROAD ST EAST APT H"

It has three fields: Field1, Field2, and Field3 (see Figure 16). But the data is scattered in all three fields. For example, the name in the first record is in Field3, in Field 1 in the second record, and in Field1 in the third record. We will create a standardize job using pre-processor rule set to identify different domains.

Step 1: Create a parallel job

Create a parallel job as shown in Figure 15. Configure the input sequential file stage to read the input file, which contains the example records listed above.

Figure 15. Parallel job with sequential and standardize stages
Image shows parallel job with sequential and standardize stages
Figure 16. Input sequential file view data
Image shows input sequential file view data

Step 2: Configure the standardize stage

  1. Create a new process.
    Figure 17. Standardize stage properties
    Image shows standardize stage properties
    Figure 18. Standardize new rule process window
    Image shows standardize new rule process window
  2. Select the USPREP rule set (Standardization Rules > USA > USPREP > USPREP) for the available columns Field1, Field2, and Field3, which is the input sequential file metadata.
    Figure 19. Rule set selection
    Image shows rule set selection
  3. Click the > button for the three fields to move them to the selected column area.
    Figure 20. Standardize rule process window with selected rule set and columns
    Image shows standardize rule process window with selected rule set and columns
  4. Add metadata delimiters. Metadata delimiters are used to convey what kind of information we are expecting in each of the input field. If the pre-processor cannot determine the domain of a token, it will be defaulted to the domain that specified through metadata delimiter. The format of the metadata delimiter is ZQ<Domain>ZQ. In this example, we are anticipating that Field1 contains Name data, Field2 contains Address data, and Field3 contains Area data. Add three delimiters: ZQNAMEZQ, ZQADDRZQ and ZQAREAZQ. Enter ZQNAMEZQ in the Literal field.
    Figure 21. Standardize rule process window with metadata delimiter entered
    Image shows standardize rule process window with metadata delimiter entered
  5. Click the > button.
    Figure 22. Standardize rule process window with metadata delimiter selected
    Image shows tandardize rule process window with metadata delimiter selected
  6. Repeat steps 4 and 5 to add delimiters ZQADDRZQ and ZQAREAZQ.
    Figure 23. Standardize rule process window with all metadata delimiters selected
    Image shows standardize rule process window with all metadata delimiters selected
  7. Use the Move Up and Move Down buttons to arrange the metadata delimiters in the following way:

    ZQNAMEZQ
    Field1
    ZQADDRZQ
    Field2
    ZQAREAZQ
    Field3

    Click OK to add the process.
    Figure 24. Standardize rule process window with all metadata delimiters arranged in order
    Image shows standardize rule process window with all metadata delimiters arranged in order
    Figure 25. Standardize stage properties window with created rule process
    Image shows standardize stage properties window with created rule process
  8. Map the output columns (Stage Properties > Output > Mapping)
    The standardize stage produces columns based on the rule set selected. The following columns were selected in this example: NameDomain_USPREP, AddressDomain_U SPREP and AreaDomain_USPREP

    Drag and drop the columns listed above to the output.
    Figure 26. Standardize stage output column mapping
    Image shows standardize stage output column mapping

Step 3: Configure the output file and run the job

Configure the output sequential file stage to supply required fields like the file name and other settings like format as required. Run the job and verify the output. Figure 27 shows the output produced.

Figure 27. Output sequential file view data
Image shows output sequential file view data

"International Business Machines Corp","1480 CARRIAGE LN APT 301","AUBURN IN 467069555"
"International Business Machines Corp" is identified as name domain (NameDomain)
"1480 CARRIAGE LN APT 301" is address domain (AddressDomain)
"AUBURN IN 467069555" is area domain (AreaDomain)

"52280A NC 42 72 HWY # 42","KNOXVILLE TN 37920","Dr Jeffery David Thomson Jnr PHD"
"Dr Jeffery David Thomson Jnr PHD" is identified as name domain (NameDomain)
"52280A NC 42 72 HWY # 42" is address domain (AddressDomain)
"KNOXVILLE TN 37920" is area domain (AreaDomain)

"Peter heines","ASHVILLE NEW YORK 147109762","930 SOUTH BROAD ST EAST APT H"
"Peter heines" is identified as name domain (NameDomain)
"930 SOUTH BROAD ST EAST APT H" is address domain (AddressDomain)
"ASHVILLE NEW YORK 147109762" is area domain (AreaDomain)


Implementing name standardization

Editor's note: All personal data appearing in this tutorial is fictitious and was created for sample purposes only.

This is the domain-specific type of standardization. Let's take the following name examples.

Dr Jeffery David Thomson Jnr PHD
International Business Machines Corp
Peter heines

These examples contain individual and organization names, and assume these belong to country US. Our intention here is to identify different parts of the name like the primary name, first name, and last name.

Step 1: Create a parallel job

Create a parallel job as shown in Figure 28. Configure input sequential file stage to read the input file that contains the above example records.

Figure 28. Parallel job with sequential and standardize stages
Image shows parallel job with sequential and standardize stages
Figure 29. Input sequential file view data
Image shows input sequential file view data

Step 2: Configure the standardize stage

  1. Create a new process.
    Figure 30. Standardize stage properties
    Image shows standardize stage properties
    Figure 31. Standardize new rule process window
    Image shows standardize new rule process window
  2. Select the USNAME rule set (Standardization Rules > USA > USNAME > USNAME) for the column "name," which is the input sequential file metadata.
    Figure 32. Rule set selection
    Image shows rule set selection
  3. Click the > button.
    Figure 33. Standardize rule process window with rule set selected
    Image shows standardize rule process window with rule set selected
  4. Do not add the "Optional NAMES Handling" option. The Optional NAMES Handling field has the following options:
    • Process All as Individual — All columns are standardized as individual names.
    • Process All as Organization — All columns are standardized as organization names.
    • Process Undefined as Individual — All unhandled columns are standardized as individual names.
    • Process Undefined as Organization — All unhandled columns are standardized as organization names.
    This option is useful if we know the types of names in the input file. For example, if the file mainly contains organization names, specifying Process All as Organization enhances performance by eliminating the processing steps of determining the name's type.
  5. Click OK.
    Figure 34. Standardize rule process window with selected rule set and columns
    Image shows standardize rule process window with selected rule set and columns
    Figure 35. Standardize stage properties window with created rule process
    Image shows standardize stage properties window with created rule process
  6. Map the output columns (Stage Properties > Output > Mapping)
    The standardize stage produces columns based on the rule set selected. In this example, the following columns were selected: NameType_USNAME, GenderCode_USNAME, NamePrefix_USNAME, FirstName_USNAME, MiddleName_USNAME, PrimaryName_USNAME, NameGeneration_USNAME, and NameSuffix_USNAME

    Drag and drop the above columns to the output.
    Figure 36. Standardize stage output column mapping
    Image shows standardize stage output column mapping

Step 3: Configure the output file and run the job

Configure the output sequential file stage to supply required fields like the file name and other settings like format as required. Run the job and verify the output. Figure 37 shows the output produced.

Figure 37. Output sequential file view data
Image shows output sequential file view data

Dr Jeffery David Thomson Jnr PHD
The data is identified as an individual name (NameType)
Gender is male (GenderCode)
Dr is the name prefix (NamePrefix).
Jeffery is the first name(FirstName).
David is the middle name (MiddleName).
Thomson is the primary name (PrimaryName).
Jr is identified as generation (NameGeneration) — here, the actual input contains Jnr, but the standardize stage gave the commonly used standard format
PHD is the name suffix (NameSuffix).

International Business Machines Corp
The data is identified as the organization name (NameType).
International Business Machines is the primary name (PrimaryName).
Corp is the name suffix (NameSuffix).

Peter heines
The data is identified as the individual name (NameType).
Gender is male (GenderCode).
Peter is the first name (FirstName).
Heines is the primary name (PrimaryName).


Implementing validation

This type of rule set is mainly used to validate the data (VDATE, VEMAIL, for example). Let's take the following date examples:

OCT021983
09211991
02/29/2011

These are some of the acceptable input formats. The standardization job verifies whether these are valid and sets valid flag, if valid. Then it produces the output in standard format CCYYMMDD; otherwise, it sets invalid reason code.

Step 1: Create the parallel job

Create a parallel job as shown in Figure 38. Configure the input sequential file stage to read the input file, which contains the above example records.

Figure 38. Parallel job with sequential and standardize stages
Image shows parallel job with sequential and standardize stages
Figure 39. Input sequential file view data
Image shows input sequential file view data

Step 2: Configure the standardize stage

  1. Create a new process.
    Figure 40. Standardize stage properties
    Image shows standardize stage properties
    Figure 41. Standardize new rule process window
    Image shows standardize new rule process window
  2. Select the VDATE rule set (Standardization Rules > Other > VDATE) for the column "Date," which is the input sequential file metadata.
    Figure 42. Rule set selection
    Image shows rule set selection
  3. Click the > button.
    Figure 43. Standardize rule process window with rule set selected
    Image shows standardize rule process window with rule set selected
  4. Click OK.
    Figure 44. Standardize rule process window with selected rule set and columns
    Image shows standardize rule process window with selected rule set and columns
    Figure 45. Standardize stage properties window with created rule process
    Image shows standardize stage properties window with created rule process
  5. Map the output columns (Stage Properties > Output > Mapping)
    The standardize stage produces columns based on the rule set selected. In this example following columns were selected: ValidFlag_VDATE, DateCCYYMMDD_VDATE, InvalidReason_VDATE, along with input column "Date."

    Drag and drop the above columns to the output.
    Figure 46. Standardize stage output column mapping
    Image shows standardize stage output column mapping

Step 3: Configure the output file and run the job

Configure the output sequential file stage to supply required fields like the file name and other settings like format as required. Run the job and verify the output. Here is the output produced:

Figure 47. Output sequential file view data
Image shows output sequential file view data

OCT021983
Valid date (ValidFlag_VDATE)
19831002 is the standard format (DateCCYYMMDD_VDATE)

09211991
Valid date (ValidFlag_VDATE)
19910921 is the standard format (DateCCYYMMDD_VDATE)

02/29/2011
Invalid date (ValidFlag_VDATE)
The reason is it is invalid leap-year date (InvalidReason_VDATE)


Conclusion

In this tutorial, you have learned what the standardization process is and how it can be achieved by using InfoSphere QualityStage. You have also learned about standardization using different types of rule sets like country identifier, domain pre-processor, domain-specific, and validation.


Download

DescriptionNameSize
Sample jobs and dataSampleJobDesigns.zip10KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=751123
ArticleTitle=Standardize your data using InfoSphere QualityStage
publish-date=08112011