Developing DataStage and QualityStage parallel jobs

You design parallel jobs to transform and to cleanse data.

Parallel jobs are compiled and run on the IBM® InfoSphere® Information Server engine.

DataStage parallel jobs
Parallel jobs consist of individual stages. Each stage describes a particular process, this might be accessing a database or transforming data in some way.

Designing parallel jobs
Parallel jobs brings the power of parallel processing to your data extraction and transformation applications.

Parallel jobs and NLS
These topics give details about NLS in InfoSphere DataStage® parallel jobs.

Stage editors
The Parallel job stage editors all use a generic user interface (with the exception of the Transformer stage, Shared Container, and Complex Flat File stages).

Reading and writing files
Use the stages in the File section of the palette to read and write data from files.

Processing data
Use the stages on the Processing section of the palette to manipulate data that you have read from a data source before writing it to a data target.

Cleansing your data
Use the stages in the Data Quality section of the palette to cleanse your data.

Restructuring data
Use the stages in the restructure section of the palette to restructure complex data.

Debugging parallel jobs
Run your parallel jobs in debug mode or use the debugging stages to help you debug your parallel job designs.

Viewing the job log
View the entries in the job log when you run your current job within the IBM InfoSphere DataStage and QualityStage® Designer client.

Introduction to InfoSphere DataStage Balanced Optimization
Use Balanced Optimization to improve the performance of some InfoSphere DataStage parallel jobs.

Managing data sets
Parallel jobs use data sets to store data being operated on in a persistent form. Data sets are operating system files, each referred to by a descriptor file, usually with the suffix .ds.

The parallel engine configuration file
Use configuration files to specify what processing, storage, and sorting facilities on your system are used to run a parallel job.

Grid deployment
Grid computing can improve the ability of any IT organization to maximize resource value. Information integration solutions that are built on grid technology can increase computing capacity at a lower cost.

Remote deployment
Remote deployment of parallel jobs allows job scripts to be stored and run on a separate machine from the engine tier. The remote deployment option can, for example, be used to run jobs on a computer grid.

Schemas
Schemas are an alternative way for you to specify column definitions for the data used by parallel jobs.

Parallel Transform functions
These topics describe the functions that are available from the expression editor under the Function... menu item. You can use these functions when defining a column derivation in a Transformer stage.

Fillers
These topics describe how to create fillers for the Complex Flat File stage.