In this tutorial, you use InfoSphere® DataStage® to
develop jobs that extract, transform, and load data. By transforming
and cleansing the source data and applying consistent formatting,
you enhance the quality of the data .
In
this scenario, the worldwide companies, GlobalCo and WorldCo, are
merging. Because their businesses are similar, the two companies have
many customers in common. The merged company, GlobalCo Worldwide,
wants to build a data warehouse for their delivery and billing information.
The exercises in this tutorial focus on a small portion of the work
that needs to be done to accomplish this goal.
Your part of
the project is to work on the GlobalCo data that includes billing
records for customer data. You read this data from comma-separated
values (CSV) files, and then cleanse and transform the data, in preparation
for it to be merged with the equivalent data from WorldCo. This data
forms the GlobalCo_billing dimension table in the
finished data warehouse. Another developer merges this dimension table
with the WorldCo_billing dimension table to create
the billing information for GlobalCo Worldwide.
This
tutorial guides you through the tasks that you complete to extract,
transform, and load the billing data for GlobalCo. The following steps
summarize the sequence of actions that you complete:
- In Module 1, you open the samplejob job
and explore each stage that the job is composed of. Understanding
how the stages operate is important before you begin designing your
own job. You also learn how to compile the job, run the job, and view
the generated output.
- In Module 2, you create your first job by adding stages and links
to the InfoSphere DataStage and QualityStage® Designer canvas.
You then import metadata to create table definitions that you use
throughout this tutorial. You also create parameters and parameter
sets that you reuse to simplify your job design and promote reuse
across your jobs.
- In Module 3, you design a transformation job that cleanses the
GlobalCo billing data so that it can be merged with the WorldCo billing
data. You then expand this job by adding multiple transformations
to your job to add stricter data typing to the billing data.
- In Module 4, you load the cleansed GlobalCo billing data into
a relational database so that other developers in the GlobalCo Worldwide
organization can access the data. You create a data connection object,
import column metadata from a database table, and then write the output
to an existing table in the database.
Learning
objectives
As you work through the tutorial, you learn how
to complete the following tasks:
- Design parallel jobs that extract, transform, and load data
- Create reusable objects that can be included in other job designs
- Modify your job design to implement stricter data typing
- Run the jobs that you design and view the results
Time
required
Before you begin the tutorial, ensure that your
InfoSphere DataStage and QualityStage Administrator
completed the steps in
Setting up the parallel job tutorial environment.
The time that is required to install and configure the tutorial depends
on your
InfoSphere DataStage environment.
This
tutorial takes approximately four hours to finish. If you explore
other concepts related to this tutorial, it can take longer to complete.
System
requirements
This tutorial requires the following hardware
and software:
Prerequisites
This
tutorial is intended for novice
InfoSphere DataStage designers
who want to learn how to create parallel jobs. Knowing about basic
InfoSphere DataStage concepts,
such as jobs, stages, and links might be helpful, but is not required.
Notices: GlobalCo, WorldCo, and GlobalCo Worldwide
depict fictitious business operations with sample data used to develop
sample applications for IBM and IBM customers. These fictitious records
include sample data for sales transactions, product distribution,
finance, and human resources. Any resemblance to actual names, addresses,
contact numbers, or transaction values, is coincidental. Other sample
files might contain fictional data manually or machine generated,
factual data compiled from academic or public sources, or data used
with permission of the copyright holder, for use as sample data to
develop sample applications. Product names referenced might be the
trademarks of their respective owners. Unauthorized duplication is
prohibited.