Tutorial: Creating parallel jobs

In this tutorial, you use InfoSphere® DataStage® to develop jobs that extract, transform, and load data. By transforming and cleansing the source data and applying consistent formatting, you enhance the quality of the data.

In this scenario, the worldwide companies, GlobalCo and WorldCo, are merging. Because their businesses are similar, the two companies have many customers in common. The merged company, GlobalCo Worldwide, wants to build a data warehouse for their delivery and billing information. The exercises in this tutorial focus on a small portion of the work that needs to be done to accomplish this goal.

Your part of the project is to work on the GlobalCo data that includes billing records for customer data. You read this data from comma-separated values (CSV) files, and then cleanse and transform the data, in preparation for it to be merged with the equivalent data from WorldCo. This data forms the GlobalCo_billing dimension table in the finished data warehouse. Another developer merges this dimension table with the WorldCo_billing dimension table to create the billing information for GlobalCo Worldwide.

This tutorial guides you through the tasks that you complete to extract, transform, and load the billing data for GlobalCo. The following steps summarize the sequence of actions that you complete:

In Module 1, you open the samplejob job and explore each stage that the job is composed of. Understanding how the stages operate is important before you begin designing your own job. You also learn how to compile the job, run the job, and view the generated output.
In Module 2, you create your first job by adding stages and links to the InfoSphere DataStage and QualityStage® Designer canvas. You then import metadata to create table definitions that you use throughout this tutorial. You also create parameters and parameter sets that you reuse to simplify your job design and promote reuse across your jobs.
In Module 3, you design a transformation job that cleanses the GlobalCo billing data so that it can be merged with the WorldCo billing data. You then expand this job by adding multiple transformations to your job to add stricter data typing to the billing data.
In Module 4, you load the cleansed GlobalCo billing data into a relational database so that other developers in the GlobalCo Worldwide organization can access the data. You create a data connection object, import column metadata from a database table, and then write the output to an existing table in the database.

Learning objectives

As you work through the tutorial, you learn how to complete the following tasks:

Design parallel jobs that extract, transform, and load data
Create reusable objects that can be included in other job designs
Modify your job design to implement stricter data typing
Run the jobs that you design and view the results

Time required

Before you begin the tutorial, ensure that your InfoSphere DataStage and QualityStage Administrator completed the steps in Setting up the parallel job tutorial environment. The time that is required to install and configure the tutorial depends on your InfoSphere DataStage environment.

This tutorial takes approximately four hours to finish. If you explore other concepts related to this tutorial, it can take longer to complete.

System requirements

This tutorial requires the following hardware and software:

InfoSphere DataStage clients installed on a Windows platform
A connection to an InfoSphere DataStage server that runs on a Windows or UNIX platform
Tip: Windows servers can be on the same computer as the clients

Prerequisites

This tutorial is intended for novice InfoSphere DataStage designers who want to learn how to create parallel jobs. Knowing about basic InfoSphere DataStage concepts, such as jobs, stages, and links might be helpful, but is not required.

Notices: GlobalCo, WorldCo, and GlobalCo Worldwide depict fictitious business operations with sample data used to develop sample applications for IBM and IBM customers. These fictitious records include sample data for sales transactions, product distribution, finance, and human resources. Any resemblance to actual names, addresses, contact numbers, or transaction values, is coincidental. Other sample files might contain fictional data manually or machine generated, factual data compiled from academic or public sources, or data used with permission of the copyright holder, for use as sample data to develop sample applications. Product names referenced might be the trademarks of their respective owners. Unauthorized duplication is prohibited.

Setting up the parallel job tutorial environment
Before you can start the tutorial, your IBM InfoSphere DataStage and QualityStage Administrator must create folders, create the tutorial project, import source files, and complete other setup tasks.
Module 1: Opening and running the sample job
The tutorial includes a sample job that you explore to better understand basic concepts about jobs. You open the sample job, explore the stages that comprise the job, and then compile and run the job.
Module 2: Designing your first job
You learned how to open, compile, and run the sample job. However, this job was prebuilt, so it is time to learn how to design your own job.
Module 3: Transforming data
You developed a job that writes data from a source file to a target file. Now that you understand the basics of job design, you learn how to design a job that transforms data.
Module 4: Loading data to a relational database
In this module, you write the GlobalCo_billing data to a relational database. This database is the final target for the GlobalCo billing data that you transformed and cleansed.