Running a data transformation job

You use the Transform data page of DataStage® to run data transformation jobs.

About this task

Create either parallel jobs or sequence jobs to transform your data.

Parallel jobs: Parallel jobs consist of individual stages. Each stage describes a particular process, such as accessing a database or transforming data in some way. For example, one stage might extract data from a data source, while another transforms it. Parallel jobs bring the power of parallel processing to your data extraction and transformation applications.
Sequence jobs: For more complex designs, you can build sequence jobs to run multiple jobs along with other jobs. By using sequence jobs, you can integrate programming controls into your job workflow, such as branching and looping.

Procedure

To compile the job if it hasn't been already, click the Compile icon on the toolbar.
Messages indicate whether the compilation was successful or not, and if the compilation fails the error messages are shown for the whole job, not just one stage at a time. Hover over a node to see any error messages for that node.
Click the Run icon on the toolbar to run the job.

What to do next

You can also work with jobs in the following ways.

Editing an existing job

To edit an existing job, open the Jobs tab, click the vertical ellipsis menu, and then click Edit. The menu is the same in both the tile view and list view of the job.

Using containers

If a job has many stages and links, you can use containers to describe a particular sequence of steps. Containers are linked to other stages or containers in the job by input and output stages and simplify a complex design visually, making it easier to understand on the job canvas. Both local and shared containers are available to you. A local container is only accessible to the job in which it is created. A shared container is accessible to multiple jobs.

To create a container, drag the container icon from the palette to the canvas, then double-click the container icon to open it. An input and output stage are automatically provided in the container. Add additional stages and links in between the input and output stages, then rename the input and output links to match the names of the links going into and coming out of the container itself.

Renaming links and stages

You can rename links and stages from the Details card by clicking the pencil icon that is next to the name.

Changing link types

You can change the type of a link. For example, you can change a stream link into a reject link for nodes that support reject links.

Loading columns

You can load columns from table definitions as part of a stage. You can also append or replace existing columns and they are automatically propagated to downstream stages.

Mapping columns

Use the Columns tab in the Details card to map columns for each column in an output link. Load new columns from table definitions.

Editing column metadata

You can define and edit column metadata. For example, for a column you can specify what delimiter character should separate text strings or you can set null field values.

Updating configuration files

You can create, edit, and delete configuration files. To work with a configuration file, go to the projects dashboard and click the settings icon. Select Configurations. Select the configuration file that you want to work with from the drop-down menu. To create a new configuration file, select <new>. From there, you can save, check the configuration for, or delete the file.

A configuration file, such as the APT_CONFIG_FILE, is a parallel engine file that defines the processing and storage resources that belong to your system. Processing resources include nodes. Storage resources include both disks for the permanent storage of data and disks for the temporary storage of data (scratch disks). The parallel engine uses this information to determine how to arrange resources for parallel execution.

Previewing a data sample

You can preview a sample of data from relational connectors by using a live connection and sequential files.

Using the canvas settings

You can customize your canvas experience by changing various canvas settings. For example, you can add names to the links in your job. Settings for auto layout, showing and hiding annotations, showing and hiding link names, and showing and hiding node types are also available.

The Smart Palette feature arranges the connectors and stages in the most used fashion. This arrangement keeps your favorite connectors and stages on top.

The Suggested Stages feature suggests stages when you click a stage on the canvas that has no outputs. The suggested stages are highlighted in the palette and displayed on the canvas with dotted lines. You can select one of the suggestions by either dragging the suggested stage from the palette or clicking the suggested stage on the canvas. When you add a stage by either method, the added stage is automatically linked to the stage that you clicked that had no outputs.

Preference settings are automatically saved for the next session.

Configuring parameters

You can create, edit, and delete Job parameters of type encrypted, date, integer, float, pathname, date and time, and environment variable.

To create, edit, and delete Job parameters, open the Jobs tab, click the vertical ellipsis menu, and then click Properties. Click the Parameters tab to see the options to add, edit, and delete parameters.

For parallel jobs, you can create runtime parameters by opening a job, clicking the View icon, selecting Properties, and then clicking the Parameters tab. Click + Add and specify the parameter type, the name of the parameter, the default value, a prompt for the parameter, and help text. Click OK. Then click on your data source, select your job parameter on the Properties tab, and click OK. Save the job and compile it. Then, run the job, and on the parameters tab, you can see the new job parameter that you created. You can run the job to use the new parameter.

You can also create, edit, and delete List type parameters. List type parameters can contain multiple string values, giving you more flexibility and efficiency in your jobs.

Configuring parameter sets

You can create, edit, andand delete parameter sets. To create a parameter set, click the + Create icon on the Parameter Sets dashboard. To edit a parameter set, click the vertical ellipsis menu, then click Edit. To delete a parameter set, click the vertical ellipsis menu, and then click delete.

Use parameter sets to define job parameters that you are likely to reuse in different jobs, such as connection details for a particular database. Then, when you need this set of parameters in a job design, you can insert them into the job properties from the parameter set.

Configuring table definitions

You can edit and delete table definitions. To edit a table definition, on the Table Definitions tab, click the vertical ellipsis menu next to the table definition, and then click Edit. The menu is the same in both the tile view and list view of the job. To delete a table definition, click the vertical ellipsis menu, and then click Delete.

Clustering and identifying jobs based on machine learning

DataStage uses a machine learning thread to identify similar jobs, for example, jobs that connect to the same source database, use the same form of processing, and connect to the same target database. DataStage can then group those jobs into useful clusters. By reviewing these clusters that you can find and eliminate duplicates or forgotten jobs, freeing up computing resources.

Automatically adding a single column table definition for Kafka

DataStage improves the usability of Kafka by automatically adding a table definition with a single column.

Reviewing log files after a job run

Open the project that contains the job and check the job status at the bottom of each job tile. Click the details at the bottom of the time to view the job log and copy log details.

Running jobs on multiple nodes

To run data transformation jobs on multiple nodes, you must set up a configuration file to distribute resources. For more information, see Running a job on multiple nodes.

Scheduling jobs

To schedule a job, open the job and click the schedule icon in the job canvas. Set the job to automatically run at specified times every day, week, or month. Click Limits to set the threshold for canceling scheduled jobs after a certain number of warnings are encountered. You must successfully compile a job before you can create a schedule for it.

Exporting jobs

To export a job, open the Jobs dashboard and click the export icon. Select the job that you want to export and optionally include job dependencies, the job design, and job executables. Click Export. After the export processes, click the export icon again, and select the .isx file for your export. Click Save File to save the file in default downloads folder for your browser.

Importing jobs

To import a job, table definition, or connection, open the Jobs dashboard and switch to the list view. Click the vertical ellipsis menu and select Import. Drag or select the .isx file that contains the assets that you want to import. Select Replace existing assets if you want to overwrite assets that already exist in your project. Optionally, review the assets and then select Import. Refresh the page to view the assets that are imported.

Managing source control with Git

Integrate your Git server with transformation jobs. This integration allows you to publish jobs and related artifacts to different Git branches and load other versions of a job from Git onto the job design canvas. For more information, see Managing source control with Git.