Designing IBM InfoSphere DataStage and QualityStage jobs as services

Exposing IBM® InfoSphere® DataStage® jobs and IBM InfoSphere QualityStage® jobs as services implies a set of constraints and guidelines. The service-oriented architecture (SOA) platform supports three job topologies for different load and work style requirements: batch jobs, batch jobs with a service output stage, and jobs with service input and output stages.

The design of a job determines whether it is always up and running or runs once to completion. All jobs that are exposed as services process requests on an ad-hoc, 24x7 basis. The IBM InfoSphere Information Services Director server starts job instances on one or more InfoSphere DataStage servers for load balancing and scalability.

You must specify whether a job is always running or enabled for service in the job properties window of the IBM InfoSphere DataStage and QualityStage Designer.

Make sure the column names in your InfoSphere DataStage and IBM InfoSphere QualityStage jobs do not contain a numeric digit followed by lower case letter, such as 2b. This type of column name will cause your jobs to fail when you invoke them as IBM InfoSphere Information Services Director services.

A service accepts requests from client applications, mapping request data to input rows and passing them to the underlying jobs. A job instance can include database lookups, transformations, data standardization and matching, and other data integration tasks. A job instance can then return output rows that can be mapped to service response data and sent back to the client.

Batch jobs

Topology I uses new or existing batch jobs that are exposed as services. A batch job starts on demand. Each service request starts one instance of the job that runs to completion. This job typically initiates a batch process from a real-time process that does not need direct feedback on the results. It is tailored for processing bulk data sets and is capable of accepting job parameters as input arguments. Topology I jobs have the following characteristics:
Start and stop times
The elapsed time for starting and stopping a batch job, also known as latency, is high. This factor contributes to a low throughput rate in communication with the service client.
Job instances
The Information Service Framework (ISF) agent starts job instances on demand to process service requests, up to a maximum that you configure. For load balancing, you can run the jobs on multiple InfoSphere DataStage servers.
Input and output
An information service that is based on a batch job can use job parameters as input arguments. This type of service returns no output. If you design the information service, you can set values for job parameters. If the job ends abnormally, the service client receives an exception.

Batch jobs with a service output stage

Topology II uses an existing batch job and adds an output stage. The InfoSphere Information Services Director output stage is the exit point from the job, returning one or more rows to the client application as a service response. Its table definition maps to the output arguments of a service operation, such as the return value of a Web service operation. Return values can consist of an atomic value (one column), a structure (multiple columns), or an array of structures (multiple rows). As the following figure shows, these jobs typically initiate a batch process from a real-time process that requires feedback or data from the results. It is designed to process large data sets and is capable of accepting job parameters as input arguments. Requirements for a Topology II job are identical to those for Topology I jobs in all other respects.

Figure 1. Topology II job
This figure is described in the surrounding text.

Jobs with service input and output stages

In Topology III, jobs use both an InfoSphere Information Services Director input stage and a service output stage. The input stage is the entry point to a job, accepting one or more rows during a service request. These jobs are always running. This topology is typically used to process high volumes of smaller transactions where response time is important. It is tailored to process many small requests rather than a few large requests. The following figure shows an example of this topology.

Figure 2. Topology III job
This figure is described in the surrounding text.
Topology III jobs have the following characteristics:
InfoSphere Information Services Director input
The InfoSphere Information Services Director input stage supports one output link. Its table definition maps to the input arguments of a service operation, such as the input arguments of an EJB method. Input values can consist of an atomic value (single column), a structure (multiple columns), or an array of structures (multiple rows).
Defining stages
Jobs with passive stages that have both input and output links are not eligible to implement service operations. The job that has a data source stage with input and output links cannot be used to implement a service operation, because the data source stage acts as a synchronization point. If you need a data source stage with both input and output links, do not use an InfoSphere Information Services Director input stage. If you omit the InfoSphere Information Services Director input stage from a job, more processing time for a service request is needed, including the time to start and stop the job.

Also, a database query with results that are returned on an output link is run only one time, when the job instance is started. It is not re-run during every service call request. If you want to extract data from a data source, use a reference link to perform the lookup.

Always-on behavior
A job that conforms to Topology III is always running. If you design the service operation, you select the following options:
  • Minimum number of job instances
  • Maximum number of job instances
Each job instance handles multiple requests during its lifetime. The ASB agent starts job instances prior to service requests, which virtually eliminates latency. This factor contributes to a high throughput rate in communication with the service client. If the job stops, the service client receives an exception.

To avoid timeouts among your database connections, set the maximum lifetime of job instances just below the lowest timeout limit. This is a deployment step. At runtime, the ASB agent recycles instances to meet demand and to maintain the minimum required instances.

Job parameters
Any job parameters that belong to a Topology III job are static for all job instances. If you design the service operation, you can set job parameter values.
Reuse considerations
Jobs with an InfoSphere Information Services Director input stage carry design constraints.