Designing IBM InfoSphere DataStage and QualityStage jobs as services
Exposing IBM® InfoSphere® DataStage® jobs and IBM InfoSphere QualityStage® jobs as services implies a set of constraints and guidelines. The service-oriented architecture (SOA) platform supports three job topologies for different load and work style requirements: batch jobs, batch jobs with a service output stage, and jobs with service input and output stages.
The design of a job determines whether it is always up and running or runs once to completion. All jobs that are exposed as services process requests on an ad-hoc, 24x7 basis. The IBM InfoSphere Information Services Director server starts job instances on one or more InfoSphere DataStage servers for load balancing and scalability.
You must specify whether a job is always running or enabled for service in the job properties window of the IBM InfoSphere DataStage and QualityStage Designer.
Make sure the column names in your InfoSphere DataStage and IBM InfoSphere QualityStage jobs do not contain a numeric digit followed by lower case letter, such as 2b. This type of column name will cause your jobs to fail when you invoke them as IBM InfoSphere Information Services Director services.
A service accepts requests from client applications, mapping request data to input rows and passing them to the underlying jobs. A job instance can include database lookups, transformations, data standardization and matching, and other data integration tasks. A job instance can then return output rows that can be mapped to service response data and sent back to the client.
Batch jobs
- Start and stop times
- The elapsed time for starting and stopping a batch job, also known as latency, is high. This factor contributes to a low throughput rate in communication with the service client.
- Job instances
- The Information Service Framework (ISF) agent starts job instances on demand to process service requests, up to a maximum that you configure. For load balancing, you can run the jobs on multiple InfoSphere DataStage servers.
- Input and output
- An information service that is based on a batch job can use job parameters as input arguments. This type of service returns no output. If you design the information service, you can set values for job parameters. If the job ends abnormally, the service client receives an exception.
Batch jobs with a service output stage
Topology II uses an existing batch job and adds an output stage. The InfoSphere Information Services Director output stage is the exit point from the job, returning one or more rows to the client application as a service response. Its table definition maps to the output arguments of a service operation, such as the return value of a Web service operation. Return values can consist of an atomic value (one column), a structure (multiple columns), or an array of structures (multiple rows). As the following figure shows, these jobs typically initiate a batch process from a real-time process that requires feedback or data from the results. It is designed to process large data sets and is capable of accepting job parameters as input arguments. Requirements for a Topology II job are identical to those for Topology I jobs in all other respects.
Jobs with service input and output stages
In Topology III, jobs use both an InfoSphere Information Services Director input stage and a service output stage. The input stage is the entry point to a job, accepting one or more rows during a service request. These jobs are always running. This topology is typically used to process high volumes of smaller transactions where response time is important. It is tailored to process many small requests rather than a few large requests. The following figure shows an example of this topology.
- InfoSphere Information Services Director input
- The InfoSphere Information Services Director input stage supports one output link. Its table definition maps to the input arguments of a service operation, such as the input arguments of an EJB method. Input values can consist of an atomic value (single column), a structure (multiple columns), or an array of structures (multiple rows).
- Defining stages
- Jobs with passive stages that have both input and output links
are not eligible to implement service operations. The job that has
a data source stage with input and output links cannot be used to
implement a service operation, because the data source stage acts
as a synchronization point. If you need a data source stage with both
input and output links, do not use an InfoSphere Information Services Director input
stage. If you omit the InfoSphere Information Services Director input
stage from a job, more processing time for a service request is needed,
including the time to start and stop the job.
Also, a database query with results that are returned on an output link is run only one time, when the job instance is started. It is not re-run during every service call request. If you want to extract data from a data source, use a reference link to perform the lookup.
- Always-on behavior
- A job that conforms to Topology III is always running. If you
design the service operation, you select the following options:
- Minimum number of job instances
- Maximum number of job instances
To avoid timeouts among your database connections, set the maximum lifetime of job instances just below the lowest timeout limit. This is a deployment step. At runtime, the ASB agent recycles instances to meet demand and to maintain the minimum required instances.
- Job parameters
- Any job parameters that belong to a Topology III job are static for all job instances. If you design the service operation, you can set job parameter values.
- Reuse considerations
- Jobs with an InfoSphere Information Services Director input stage carry design constraints.