Overview of IBM DataStage Flow Designer

IBM® DataStage® Flow Designer is a web-based UI for DataStage, compared to DataStage Designer, which is a Windows-based thick client. You can use IBM DataStage Flow Designer to create, edit, load, and run DataStage jobs with many benefits compared to DataStage Designer, although you can also use IBM DataStage Flow Designer and DataStage Designer to complement each other.

IBM DataStage Flow Designer gives you the following benefits:

Backwards compatibility – no need to migrate jobs

Many companies have thousands of jobs in a single project, and they depend on these jobs to run 24 hours a day, 7 days a week. Migration, with the likely possibility of errors and outages, is not an option for them. You can take any existing DataStage job and render it in IBM DataStage Flow Designer, so there’s no need to migrate those jobs to a new location.

Time and money saving

Some companies have hundreds of DataStage developers. Keeping up with the latest version of the software not only requires a server upgrade, but also an upgrade of all the thick clients. Valuable time and resources are used to make these upgrades. You don't need to worry about thick-client upgrades with the web-based IBM DataStage Flow Designer. You can save even more money with IBM DataStage Flow Designer because you don’t need to purchase Microsoft Windows or Citrix licenses.

More productivity from developers

IBM DataStage Flow Designer has features like built-in search and a quick tour to get you jump-started, automatic metadata propagation, and simultaneous highlighting of all compilation errors. Developers can use these features to be more productive.

Search: Find what you need fast by using the flexible Search feature.
Quick Tour: Take the built-in Quick Tour to familiarize yourself with the product.
Automatic metadata propagation: Changing a stage in a DataStage job can be time consuming because you must go to each subsequent stage and redo the change. IBM DataStage Flow Designer automatically propagates the metadata to subsequent stages in that flow, increasing productivity.
Highlighting of all compilation errors: The DataStage thick client identifies compilation errors one at a time. Large jobs with upwards of 30 or 50 stages can take longer to troubleshoot in this situation. IBM DataStage Flow Designer highlights all errors and gives you a way to see the problem with a quick hover over each stage, so you can fix multiple problems at the same time before you recompile the job.
Built-in machine learning features like Smart Palette and Suggested Stages: These machine learning features provide efficient stage palette ordering and tailored suggestions for next stages to add to a job, which reduces developer friction. The Smart Cluster feature for jobs also lets you quickly identify sets of jobs that might be duplicates or slightly varying clones of each other.

Dashboards

IBM DataStage Flow Designer features the following dashboards, providing quick access to the essential parts of DataStage:

Projects
Connections
Table definitions
Jobs
Parameter sets

Each of the dashboards has the same layout, with common functionality.

Items are shown in tile view, with the option for you to select a list view instead.
Menu options include Edit, Rename, Clone, and Delete (support varies depending on dashboard).
When you try to delete a connection, table definition, or job, a dialog box shows you where else the item is used so you can decide whether you still want to delete it.
You create a live connection or a job by clicking the + Create icon. When the live connection is used on the canvas, you can view the data for an existing table from the details card.
You can navigate jobs by category by selecting Group by > Category from the Jobs dashboard. This grouping is effectively a ‘folder view.’ You can then drill down into each category to lower-level categories. Use the breadcrumb trail to navigate the category levels.
You can rename jobs, connections, and table definitions from the corresponding dashboards by clicking the vertical ellipsis icon on each object.
You can import connections from the toolbar on the Connections tab or from the Live Connection Asset Browser.
You import table definitions by using command line tools.
Click the light bulb Take a tour icon to take a tour of dashboard features.

Jobs

In IBM DataStage Flow Designer, you can load jobs from DataStage or GitHub, edit, create, and publish jobs.

Creating a job

Click the + Create icon on the Jobs dashboard to create a job. You can create a parallel, sequence, or Spark job. Spark jobs can only be created if the InfoSphere® Information Server services tier is running on RHEL.

Using the canvas palette

You use the canvas palette to select connectors or stages to work with. You can add a source or target stage to the canvas in two ways:

Drag a database connector from the palette to the canvas, which opens the Table Definition Asset Browser. This browser shows a schema representation of the connection. You use the Table Definition Asset Browser to configure the connection.
Drag the ‘Connections’ connector from the palette to the canvas, which opens the Live Connection Asset Browser. From the browser, you pick an existing connection/schema/table/column. When you choose a table from the Live Connection Asset Browser, you also have the option of selecting an available view that was created by using the Data Virtualization service. Because the connection exists already, the connection details, such as URL, user ID, password, and table, are added to the connector automatically.

Find your connector or stage quickly by using the find icon in the palette.

To delete an object from the canvas, select the object on the canvas that you want to delete, then click the Delete icon on the toolbar. Or select the object and press DELETE on your keyboard if you are using Windows. Press FN + DELETE if you are using a Mac.

Using containers

If a job has many stages and links, you can use containers to describe a particular sequence of steps. Containers are linked to other stages or containers in the job by input and output stages and simplify a complex design visually, making it easier to understand on the job canvas. Both local and shared containers are available to you. A local container is only accessible to the job in which it is created. A shared container is accessible to multiple jobs.

To create a container, drag the container icon from the palette to the canvas, then double-click the container icon to open it. An input and output stage are automatically provided in the container. Add additional stages and links in between the input and output stages, then rename the input and output links to match the names of the links going into and coming out of the container itself.

Reviewing node details with the side bar

To review details for a node on the canvas, double-click the node and the side bar opens on the right of the canvas. From here you can set various configuration values for things like operations, keys, properties, and columns.

Saving a job

To save a job, click the Save icon in the upper left toolbar. Jobs that you save from IBM DataStage Flow Designer are visible in the DataStage thick client and jobs that you save in the thick client are visible in IBM DataStage Flow Designer.

Compiling and running a job

When you’re done adding items to the canvas, use the toolbar on the top of the canvas to compile the job. Messages indicate whether the compilation was successful or not, and if the compilation fails the error messages are shown for the whole job, not just one stage at a time. Hover over a node to see any error messages for that node.

Editing an existing job

To edit an existing job, open the Jobs dashboard, click the vertical ellipsis menu, and then click Edit. The menu is the same in both the tile view and list view of the job.

Scheduling a job

To schedule a job, open the job on the Jobs dashboard, click the vertical ellipsis menu, and click Schedule. Use the job scheduling feature to automatically run jobs at specified times every day, week, or month. Click Properties to set the threshold for canceling scheduled jobs after a certain number of warnings are encountered. You must successfully compile a job before you can create a schedule for it.

Importing jobs

To import a job, table definition, or connection, open the Jobs dashboard and switch to the list view. Click the vertical ellipsis menu and select Import. Drag and drop or select the .isx file that contains the assets that you want to import. Select Replace existing assets if you want to overwrite assets that already exist in your project. Optionally, review the assets and then select Import. Refresh the page to view the assets that are imported.

Exporting jobs

To export a job, open the Jobs dashboard and click the export icon An image of the export icon

. Select the job that you want to export and optionally include job dependencies, the job design, and job executables. Click Export. After the export processes, click the export icon again, and select the .isx file for your export. Click Save File to save the file in default downloads folder for your browser.

Renaming links and stages

You can rename links and stages from the detail card by clicking the pencil icon that is next to the name.

Changing link types

You can change the type of a link. For example, you can change a stream link into a reject link for nodes that support reject links. For more information on link types, see Linking parallel stages.

Loading columns

You can load columns from table definitions as part of a stage. You can also append or replace existing columns and they are automatically propagated to downstream stages.

Mapping columns

Use the Columns tab in the Details card to map columns for each column in an output link. Load new columns from table definitions.

Editing column metadata

You can define and edit column metadata. For example, for a column you can specify what delimiter character should separate text strings or you can set null field values.

Updating configuration files

You can create, edit, and delete configuration files. To work with a configuration file, open the Projects dashboard. Find the project that you want to create or modify a configuration file for, and click the vertical ellipsis menu. Select Configurations and select the configuration file that you want to work with from the drop-down menu. To create a new configuration file, select <new>. From there, you can save, check the configuration for, or delete the file.

A configuration file, such as the APT_CONFIG_File, is a parallel engine file that defines the processing and storage resources that belong to your system. Processing resources include nodes. Storage resources include both disks for the permanent storage of data and disks for the temporary storage of data (scratch disks). The parallel engine uses this information to determine how to arrange resources for parallel execution.

Configuring parameters

You can create, edit, and delete Job parameters. To access the Job parameter workspace, click the Jobs dashboard, click the vertical ellipsis menu, and then click Properties.

Job parameters allow you to specify properties and values for job parameters without having to define them first. At job runtime, the parameters can either use a defined default or a new value. This allows jobs to be reusable and flexible. For example, you might define a job for a particular test database and when the job is thoroughly tested and ready for production, key connector files such as database name, table name, user, and password can be substituted with job parameters and deployed to a production environment.

You can create, edit, and delete Job parameters of type encrypted, date, integer, float, pathname, date and time, and environment variable, as well as configuration files for parallel and sequence jobs.

For Spark jobs, all connector properties for a connection, such as host, port, database, user name, table name, row limit, and byte limit, are supported as a string type Job parameter. SSH parameter substitution is not supported for Spark jobs. The certificate must be explicitly identified in the property for specific connections such as the HDFS connector.

After you define Job parameters you are prompted to either accept the parameter defaults or replace them when the job is run.

You can also create, edit, and delete List type parameters. List type parameters can contain multiple string values, giving you more flexibility and efficiency in your jobs.

Configuring parameter sets

You can create, edit and delete parameter sets. To create a parameter set, click the + Create icon on the Parameter Sets dashboard. To edit a parameter set, click the vertical ellipsis menu, and then click Edit. To delete a parameter set, click the vertical ellipsis menu, and then click delete.

Use parameter sets to define job parameters that you are likely to reuse in different jobs, such as connection details for a particular database. Then, when you need this set of parameters in a job design, you can insert them into the job properties from the parameter set. You can also define different sets of values for each parameter set.

Previewing a data sample

You can preview a sample of data from relational connectors by using a live connection and sequential files.

Using the canvas settings

You can customize your canvas experience by changing various canvas settings. For example, you can add names to the links in your job. Settings for auto layout, show/hide annotations, show/hide link names, and show/hide node types are also available.

The Smart Palette feature arranges the connectors and stages in the most used fashion. This arrangement keeps your favorite connectors and stages on top.

The Suggested Stages feature suggests stages when you click a stage on the canvas that has no outputs. The suggested stages are highlighted in the palette and displayed on the canvas with dotted lines. You can select one of the suggestions by either dragging the suggested stage from the palette or clicking the suggested stage on the canvas. When you add a stage by either method, the added stage is automatically linked to the stage that you clicked that had no outputs.

You can configure partitioning for your data to help ensure an even load across your processors. For more information on partitioning, see Partitioning.

Preference settings are automatically saved for the next session.

Managing source control

You can integrate IBM DataStage Flow Designer with Git repositories to publish parallel and Spark jobs, as well as related artifacts, and load different job versions from Git onto the canvas. You can create new branches, map a Git version to a version in the metadata repository (xmeta), and satisfy requirements around continuous integration and continuous delivery (CICD) and auditing. The supported Git repositories are GitHub, BitBucket, Microsoft Team Foundation Server, and GitLab.

If you are an administrator, you set up access to Git by clicking your profile icon, then clicking Setup > Server > Git.

If you are a regular user, you manage your Git settings by clicking your profile icon, then clicking Setup > Git user. If you do not have a Git repository ID, you are given the option to clone the Git repository.

Moving jobs from Git branches to different environments by using the command-line interface

You can move InfoSphere DataStage parallel and Spark jobs from a Git repository into the Information Server XMETA repository by using the command line. This enables a continuous development and continuous deliver (CICD) model because it allows you to develop jobs in one development environment and move them to another. For example, you can move jobs from a quality assurance (QA) environment to a production environment for testing and deploying to production.

Clustering and identifying jobs based on machine learning

IBM DataStage Flow Designer can use a machine learning thread to identify similar jobs, for example, jobs that connect to the same source database, use the same form of processing, and connect to the same target database. IBM DataStage Flow Designer can then group those jobs into useful clusters. By reviewing these clusters, you can find and eliminate duplicates or forgotten jobs, freeing up computing resources.

Automatically adding a single column table definition for Kafka

IBM DataStage Flow Designer improves the usability of Kafka by automatically adding a table definition with a single column.

Table definitions

Table definitions are a key element to job design and specify the data to be used at each stage of a job. Table definitions are stored in the metadata repository and are shared by all the jobs in a project. You can load table definitions into supported stages by clicking Load on the Columns tab on the Inputs page for supported stages. You can create, import, or edit table definitions by using IBM DataStage Flow Designer.

Creating a table definition

Click the + Create icon on the Table Definitions dashboard to create a table definition. On the General tab, specify information about your data source. On the Columns tab, click Add to manually add new columns or click Load to select existing columns from your data source. The Columns tab has the following fields:

Column name

The name of the column.

Key

Indicates whether the column is part of the primary key.

SQL type

The SQL data type.

Extended

This column gives you further control over data types used in parallel jobs when National Language Support is enabled. The available values depend on the base data type:

For Char, VarChar, and LongVarChar, you can select Unicode to specify that these columns require mapping (by default, each character is assumed to represent an ASCII character that does not need mapping).
For Time, you can select microseconds to indicate that the field contains microseconds. (By default, a Time type comprises hours, minutes, and seconds)
For Timestamp, you can select microseconds to indicate that the field contains microseconds. (By default, the time part of a Timestamp type comprises hours, minutes, and seconds)
For integer types, you can select unsigned to specify that the underlying data type is a unit (unsigned integer)

Length: The data precision. This is the length for Char data and the maximum length for VarChar data.
Scale: The data scale factor. (For Sequential File stages the scale should not exceed 9.)
Nullable: Specifies whether the column can contain null values. This field is set to indicate whether the column is subject to a NOT NULL constraint. The field does not itself enforce a NOT NULL constraint.

Editing a table definition: To edit an existing table definition, open the Table Definitions dashboard, click the vertical ellipsis menu, and then click Edit. The menu is the same in both the tile view and list view of the job.

Importing table definitions: To import table definitions, open the Jobs dashboard and switch to the list view. Click the vertical ellipsis menu and select Import. Drag and drop or select the .isx file that contains the table definitions that you want to import. Select Replace existing assets if you want to overwrite assets that already exist in your project. Optionally, review the assets and then select Import. Refresh the page to view the assets that are imported.

InfoSphere DataStage on Spark

You can run IBM DataStage Flow Designer jobs on two run time engines. You can either run the jobs on the traditional parallel engine (PX) or run on a Spark engine.

To run Spark jobs, you must configure a connection to a remote Spark cluster in the Setup > Server > Spark menu.

Jobs with the job type parallel or sequence can only be run on a parallel engine. Jobs with the job type Spark can only be run on a Spark engine.