Creating a DataStage flow

DataStage® flows are the design-time assets that contain data integration logic.

You can create an empty DataStage flow and add connectors and stages to it or you can import an existing DataStage flow from an ISX or ZIP file.

The basic building blocks of a flow are:
  • Data sources that read data
  • Stages that transform the data
  • Data targets that write data
  • Links that connect the sources, stages, and targets

Palette and canvas in IBM DataStage

DataStage flows and their associated objects are organized in projects. To start, open an existing project or create a new project.

Creating a DataStage flow

To create a DataStage flow, complete the following steps.

  1. Open an existing project or create a project.
  2. On the Assets tab, click New asset > Graphical builders > DataStage.
  3. On the Create a DataStage flow page, use one of the following two methods to create the DataStage flow:
    • Click the New tab, add the necessary details for the DataStage flow, then click Create. The new DataStage flow opens with no objects on the DataStage designer canvas.
    • Click the Local file tab, then upload an ISX or ZIP file from your local computer. Then, click Create. When the import process is complete, close the import report page, then open the imported DataStage flow from the Assets tab of the project.
  4. Drag connectors or stages from the palette onto the DataStage design canvas as nodes and arrange them as you like. Connect these nodes on the canvas by hovering your pointer over a node to make an arrow appear on the node, then click the arrow icon and drag it to the node that you want to connect to.

    This action creates a link between the nodes.

    To connect to remote data, see Connecting to a data source in DataStage.

  5. Double-click a node to open up its properties panel, where you can specify configurations and settings for the node.
  6. Click Run when you are done setting up the flow.

    The flow is automatically saved, compiled, and run. You can view logs for both the compilation and job run.

After the flow is compiled into a job, you can rerun the job, set a schedule, monitor the job, and update the environment that you want to run it in. For more information about updating the DataStage environment where you want your jobs to run, see Setting DataStage environment definitions.

Editing a DataStage flow

You can use the following actions to edit a DataStage flow.

  • Drag a stage or connector and drop it on a link between two nodes that are already on the DataStage design canvas. Links are automatically added for the new node and columns are automatically propagated. Click Run again to see the results.
  • Manually detach and reattach links from nodes on the DataStage design canvas by hovering your pointer over them and clicking the end points of the links.
  • Drag a stage or connector from the palette and drop it onto a link that is already on the canvas. The stage or connector is automatically linked to the node on either side of it and the columns in the DataStage flow automatically propagated.

Considerations

Naming files in sources and targets to avoid data corruption
In most cases, do not use the same file name in the source as in the target if the source and target points to the same database or storage system. This rule applies to files and database tables. If the names are the same, the data can be corrupted.
Writing and reading persistent data
Use persistent storage mounted at /px-storage whenever writing data from a stage to ensure all parallel processes that are running on the conductor or compute pods can access the data. Paths that are local to individual pods such as /tmp are not recommended.
Automatic column propagation
When you change a column's metadata, the changes are automatically propagated downstream. Changes made upstream do not apply to a column once you modify its metadata. If you delete a column, modifying the column in a later stage will not add the column back.
Adding parameters
See Adding parameters.