Tutorial: Creating parallel jobs
In this tutorial, you use InfoSphere® DataStage® to develop jobs that extract, transform, and load data. By transforming and cleansing the source data and applying consistent formatting, you enhance the quality of the data.
In this scenario, the worldwide companies, GlobalCo and WorldCo, are merging. Because their businesses are similar, the two companies have many customers in common. The merged company, GlobalCo Worldwide, wants to build a data warehouse for their delivery and billing information. The exercises in this tutorial focus on a small portion of the work that needs to be done to accomplish this goal.
Your part of
the project is to work on the GlobalCo data that includes billing
records for customer data. You read this data from comma-separated
values (CSV) files, and then cleanse and transform the data, in preparation
for it to be merged with the equivalent data from WorldCo. This data
forms the GlobalCo_billing
dimension table in the
finished data warehouse. Another developer merges this dimension table
with the WorldCo_billing
dimension table to create
the billing information for GlobalCo Worldwide.
This tutorial guides you through the tasks that you complete to extract, transform, and load the billing data for GlobalCo. The following steps summarize the sequence of actions that you complete:
- In Module 1, you open the samplejob job and explore each stage that the job is composed of. Understanding how the stages operate is important before you begin designing your own job. You also learn how to compile the job, run the job, and view the generated output.
- In Module 2, you create your first job by adding stages and links to the InfoSphere DataStage and QualityStage® Designer canvas. You then import metadata to create table definitions that you use throughout this tutorial. You also create parameters and parameter sets that you reuse to simplify your job design and promote reuse across your jobs.
- In Module 3, you design a transformation job that cleanses the GlobalCo billing data so that it can be merged with the WorldCo billing data. You then expand this job by adding multiple transformations to your job to add stricter data typing to the billing data.
- In Module 4, you load the cleansed GlobalCo billing data into a relational database so that other developers in the GlobalCo Worldwide organization can access the data. You create a data connection object, import column metadata from a database table, and then write the output to an existing table in the database.
Learning objectives
As you work through the tutorial, you learn how to complete the following tasks:- Design parallel jobs that extract, transform, and load data
- Create reusable objects that can be included in other job designs
- Modify your job design to implement stricter data typing
- Run the jobs that you design and view the results
Time required
Before you begin the tutorial, ensure that your InfoSphere DataStage and QualityStage Administrator completed the steps in Setting up the parallel job tutorial environment. The time that is required to install and configure the tutorial depends on your InfoSphere DataStage environment.This tutorial takes approximately four hours to finish. If you explore other concepts related to this tutorial, it can take longer to complete.
System requirements
This tutorial requires the following hardware and software:- InfoSphere DataStage clients installed on a Windows platform
- A connection to an InfoSphere
DataStage server
that runs on a Windows or UNIX platform Tip: Windows servers can be on the same computer as the clients
Prerequisites
This tutorial is intended for novice InfoSphere DataStage designers who want to learn how to create parallel jobs. Knowing about basic InfoSphere DataStage concepts, such as jobs, stages, and links might be helpful, but is not required.- Setting up the parallel job tutorial environment
Before you can start the tutorial, your IBM InfoSphere DataStage and QualityStage Administrator must create folders, create the tutorial project, import source files, and complete other setup tasks. - Module 1: Opening and running the sample job
The tutorial includes a sample job that you explore to better understand basic concepts about jobs. You open the sample job, explore the stages that comprise the job, and then compile and run the job. - Module 2: Designing your first job
You learned how to open, compile, and run the sample job. However, this job was prebuilt, so it is time to learn how to design your own job. - Module 3: Transforming data
You developed a job that writes data from a source file to a target file. Now that you understand the basics of job design, you learn how to design a job that transforms data. - Module 4: Loading data to a relational database
In this module, you write theGlobalCo_billing
data to a relational database. This database is the final target for the GlobalCo billing data that you transformed and cleansed.