Dynamic workload management in DataStage

Use dynamic workload management to more quickly and efficiently run DataStage® workloads across available compute pod resources in each PX instance.

One of the great strengths of DataStage is that, when you are designing flows and running parallel jobs, you do not have to worry about the underlying structure of your system, beyond being aware of its parallel processing capabilities. If your compute pod resources change, if you upgrade or improve your system, or if you move the flow to another environment, you do not have to necessarily change your flow design.

DataStage learns about the shape and size of the system from a configuration file. The configuration file organizes the resources that are needed for a particular job according to what is defined in the configuration file. If more PX instance compute pods are added to the system, or you make other system changes, then you change the file but not the DataStage jobs or flows themselves.

With dynamic workload management, DataStage generates a parallel configuration file at job runtime based on the PX instance and the runtime environment definition. The number of compute pods that the parallel configuration file contains is determined by the load on the compute pods at job run time.

Dynamic workload management also handles, among other tasks, queuing jobs when instance limits are reached, allocating compute pods for APT_CONFIG_FILE environment variables, and auto-scaling compute pods.

If auto-scaling is enabled, more compute pods are deployed for the PX instance as needed to meet increasing work loads. When work loads decrease, the compute pods for the PX instance are scaled back down until they are needed again. When auto-scaling is enabled for a PX instance, a job that runs in this PX instance will automatically run across the available compute pods without any need for developer intervention.

You can also create your own dynamic or static parallel engine configuration files and set the APT_CONFIG_FILE environment variable to use that specific parallel engine configuration file. For more information, see Creating and setting the APT_CONFIG_FILE environment variable in DataStage.

Dynamic workload management generates parallel engine configuration files in real time based on the resources (memory and CPUs, not necessarily number of jobs) that are available on the currently running compute pods. If no compute pods are available because they are all at their resource limit and auto-scaling is disabled, then jobs are queued.

You can create a dynamic configuration file that automatically chooses which compute pod to run jobs on at run time, eliminating the need to make the configuration manually. You specify a configuration file by setting the APT_CONFIG_FILE environment variable.

An APT_CONFIG_FILE environment variable defines the logical nodes that a job uses to run along with disk resources such as scratch space. Because you can have many different parallel engine configuration files on disk, the environment variable tells the job which parallel engine configuration file to use. If you don't specify an APT_CONFIG_FILE, a configuration file is automatically generated for you when the job runs. For more information, see Creating and setting the APT_CONFIG_FILE environment variable in DataStage.

Dynamic workload management is automatically used by default in DataStage versions 4.0.2 and later.