Introduction to batch programming using WebSphere Extended Deployment Compute Grid

Commonly thought of as a legacy "mainframe" technology, batch processing is showing itself to be a venerable workload style with growing demand in Java™ and distributed environments. This article introduces an exciting new capability for Java batch processing from IBM®, the leader in batch processing systems for the last 40 years. This content is part of the IBM WebSphere Developer Technical Journal.

Share:

Christopher Vignola (cvignola@us.ibm.com), WebSphere Architect, Java Batch and Compute Grid, IBM, Software Group

Chris Vignola is an IBM Senior Technical Staff Member with over 23 years industry experience in mainframe and distributed systems. Drawing from his diverse mainframe background, covering such areas as batch programming, operations, and workload management, and combined with ground-floor experience with Java, J2EE, distributed objects and parallel processing, Chris is a leading proponent behind IBM's Java Batch system offerings and Batch Modernization Strategy.



23 January 2008

Introduction

Compute Grid is a feature of IBM WebSphere Extended Deployment V6.1 that provides the most complete enterprise Java batch programming solution available. With Compute Grid, you have:

  • A concise, yet powerful POJO (plain old Java object) -based programming model.
  • Simple packaging.
  • A simple deployment model.
  • Full-feature job control language (JCL).
  • A sophisticated job scheduler.
  • A robust execution environment.
  • Comprehensive workload management and administrative tools.

While Compute Grid is designed to work with and leverage other WebSphere Extended Deployment features, you can also purchase and deploy it separately. In a production environment, Compute Grid operates with IBM WebSphere Application Server Network Deployment, which is a distributed, multi-machine configuration, but Compute Grid also provides a unit test environment that you can run on a standalone WebSphere Application Server. Compute Grid also offers an Eclipse-based development experience, and additionally supports IBM Rational® Application Developer as a full-function development environment.

This article explains how you can use Compute Grid for Java batch programming. First, though, it is essential that you understand the anatomy of a batch job and the programming model provided by Compute Grid for building batch applications. After covering both of these topics, this article will guide you through the development of a simple batch application using Compute Grid and a batch simulator test utility.

See Resources for more information about Compute Grid and its capabilities.


Anatomy of a batch job

At a high level, a batch job is a declarative construct that directs the execution of a series of one or more batch applications, and specifies their inputs and outputs. A batch job performs this set of tasks in sequence to accomplish a particular business function. Batch applications are programs designed to execute non-interactively in the background. Input and output is generally accessed as logical constructs by the batch application and are mapped to concrete data resources by the batch job definition.

Batch jobs commonly process large volumes of input/output data that are frequently record-oriented, usually representing critical business data, such as customer accounts, sales, and so on. Business processing tasks performed by batch jobs can range widely, including invoice generation, account optimization, opportunity analysis, and so on. Batch jobs have been used in the System z (mainframe) environment for decades and continue as a backbone of many large and medium sized businesses to this day.

The basic anatomy of a batch job includes the elements shown in Figure 1.

Figure 1. Batch job anatomy
Figure 1. Batch job anatomy

The job definition describes the batch steps to execute and the sequence in which they will run. Each step is defined with a particular batch application to invoke and its input and output data. Common sources and destinations for data include files, databases, transaction systems, message queues, and so on.


Batch programming model

A Compute Grid batch application consists of a set of POJOs and runs under the control of the Compute Grid batch container, which itself runs as an extension to a standard WebSphere Application Server. Figure 2 depicts the application components and their relationship to the batch container.

Figure 2. Batch application anatomy
Figure 2. Batch application anatomy

The batch container runs a batch job under the control of an asynchronous bean, which you can think of as a container-managed thread. The batch container processes a job definition and carries out its lifecycle, using an async bean as the unit of execution.

A batch application is made up of these user-provided components:

  • Batch job step: This POJO provides the business logic to execute as a step in a batch job. The batch container invokes the batch job step during the course of processing a job definition.

  • Batch data stream: This POJO provides the batch job step with access to data. A batch application can be written to access one batch data stream or several. A batch data stream can be written to provide access to any sort of data, including data from RDBs, file systems, message queues, through J2C connectors, and so on.

    The batch container is responsible for open, close, and checkpoint-related callbacks onto a batch data stream during the lifecycle of a job step. The batch job step itself calls methods on the batch data stream to get and put data.

A batch application can optionally include these user-provided components:

  • Checkpoint algorithm: The batch container provides a checkpoint/restart mechanism to support job restart from a known-point of consistency. A job might need to be interrupted and then subsequently restarted after a planned or unplanned outage. The batch container calls the checkpoint algorithm periodically to determine if it is time to take a checkpoint.

    Compute Grid provides two pre-built checkpoint algorithms, one that supports a time-based checkpoint interval, and another that supports a checkpoint interval based on record-count.

  • Results algorithm: Each batch job step supplies a return code when it is done. The results algorithm has visibility to the return codes from all steps in a batch job and returns a final, overall return code for the job as a whole.

    Compute Grid provides a pre-built results algorithm that returns the numerically highest step return code as the overall job return code.

With this brief introduction to the batch container, which is part of the Compute Grid runtime, the important things to keep in mind are that the batch container:

  • Orchestrates the lifecycle of a batch job, according to a job definition.
  • Generates a job log to capture the history of a job’s execution, including standard stream output from the batch job steps.
  • Collects performance and usage metrics to facilitate work load management and accounting functions.
  • Provides a powerful job failover model, based on checkpoint/restart semantics.

Additional details of the batch container are beyond the scope of this article. See Resources for further information.

Batch programming interfaces

The Compute Grid batch programming model consists of four principal interfaces, two of which are essential to building a batch application, and two which are optional and intended for advanced scenarios:

  • Essential interfaces
    • BatchJobStepInterface defines the interaction between the batch container and the batch application.

      Table 1. BatchJobStepInterface Methods
      DataMethod summary
      voidcreateJobStep()
      createJobStep is called by the Batch Container before calling processJobStep.
      intdestroyJobStep()
      destroyJobStep is called when Batch Container has finished processing the job step, any clean up code can be added here.
      java.util.PropertiesgetProperties()
      Returns the properties specified in xJCL for the batch job step.
      intprocessJobStep()
      processJobStep should contain all the business logic for the batch job step.
      voidsetProperties(java.util.Properties properties)
      Called by the Batch Container to make the properties specified in the xJCL available to the batch job step.
    • BatchDataStream abstracts a particular input source or output destination for a batch application and defines the interaction between Compute Grid and a concrete BatchDataStream implementation.

      Table 2. BatchDataStream Methods
      DataMethod summary
      voidclose()
      The close method is called by the Batch Container to indicate to the BDS that the user of the BDS is done working with the BDS.
      java.lang.StringexternalizeCheckpointInformation()
      The externalizeCheckpointInformation method is called by the Batch Container during the checkpoint completion phase of processing.
      java.lang.StringgetName()
      Returns logical name of this batch data stream.
      java.util.PropertiesgetProperties()
      Returns the properties specified in xJCL for this BDS.
      voidinitialize(java.lang.String logicalname, java.lang.String jobstepid)
      The initialize method is called by Batch Container during the initialization of the job step. This allows the BDS to initialize the stream for use by the batch step.
      voidintermediateCheckpoint()
      The intermediateCheckpoint method is called by the Batch Container to indicate to the BDS that a checkpoint has just completed.
      voidinternalizeCheckpointInformation(java.lang.String chkptinfo)
      The internalizeCheckpointInformation mehtod is called by the Batch Container during the restart of a batch step, this allows the BDS to restart its internal state to the point it was at when the last successfuly checkpoint was processed.
      voidopen()
      The open method is called by the Batch Container to indicate that the BDS is about to be used and to prepare the BDS for operation.
      voidpositionAtCurrentCheckpoint()
      The positionAtCurrentCheckpoint method is called by Batch Container to provide a signal to the BDS that it should start processing the stream at the point that was defined in the internalizeCheckpointInformation method set.
      voidpositionAtInitialCheckpoint()
      The positionAtInitialCheckpoint is called by Batch Container to provide a signal to the BDS that it should start processing the stream at the initial point as defined by the xJCL inputs.
      voidsetProperties(java.util.Properties properties)
      The setProperties is called by the Batch Container to pass BDS properties specified in xJCL to the BDS as a java.util.Properties object.
  • Optional interfaces
    • CheckpointPolicyAlgorithm defines the interaction between Compute Grid and a custom checkpoint policy implementation. A checkpoint policy is used to determine when Compute Grid will checkpoint a running batch job to enable restart after a planned or unplanned interruption. Compute Grid includes two ready-to-use checkpoint policies, shown in Table 3.

    • ResultsAlgorithm defines the interaction between Compute Grid and a custom results algorithm. The purpose of the results algorithm is to provide the overall return code for a job. The algorithm has visibility to the return codes from each of the job steps. Compute Grid includes one ready-to-use results algorithm, shown in Table 3.

      Table 3. Ready-to-use algorithms
      AlgorithmClass nameDescription
      Checkpoint policycom.ibm.wsspi.batch.checkpointalgorithms.recordbasedCheckpoints batch job step based on number of input records processed.
      Checkpoint policycom.ibm.wsspi.batch.resultsalgorithms.jobsumReturns highest step return code.
      Resultscom.ibm.wsspi.batch.checkpointalgorithms.recordbasedCheckpoints batch job step based on number of input records processed.

Developing a batch application

With the basic concepts of a batch job and the Compute Grid batch programming model under your belt, it’s time to apply these concepts with a simple exercise that spotlights the essential batch interfaces, BatchJobStepInterface and BatchDataStream. The remainder of this article walks through the steps for implementing a sample batch job step and batch data stream using the Eclipse development environment, and testing them using a utility, included with this article, called the Batch Simulator.

  1. Set up your environment

    Install Eclipse V3.2 or higher (if you do not have it already installed, and download the Batch Simulator utility. (The Batch Simulator utility is provided for demonstration purposes only for use within the context of this article.) Download and installation instructions for the utility can be found in the README PDF file included in the download file.

  2. Create batch data streams

    1. Since the Compute Grid batch framework provides only a BatchDataStream interface, it is useful to extend this framework with abstract classes that provide basic implementations of common patterns. For the purpose of this article, you will build a generic file-based batch data stream that:
      • Has an underlying file of line-oriented text.
      • Supports the basic checkpoint/restart model.
      • Supports read or write mode, but not read/write.
      • Writes strictly by appending to end of file.
      Call this batch data stream TextFileBatchDataStream. The class declaration looks like this:
      Listing 1
      package com.ibm.websphere.samples;
      import 
       com.ibm.websphere.batch.BatchContainerDataStreamException;
      import com.ibm.websphere.batch.BatchDataStream;
      public abstract class TextFileBatchDataStream 
       implements BatchDataStream {
      ...
      }
    2. This class abstract is declared to emphasize that it is a framework class, intended for extension by subclass implementers. Since you are supporting only either read or write mode, you will require the subclass to make this decision. Therefore, make the default constructor private, and require that the subclass implementer use this constructor:
      Listing 2
      protected enum ACCESS_MODE {R,W};
      protected ACCESS_MODE _access_mode;
      protected TextFileBatchDataStream(ACCESS_MODE access_mode) {
      _access_mode= access_mode;
      }
    3. Since this batch data stream will be based on an underlying text file, you need a way to know the file name. The file name will be passed into the batch data stream implementation as a batch data stream property. This property is set in the job definition and then passed into the batch data stream object by the batch container as part of the batch data stream initialization. For the purpose of receiving these properties, the BatchDataStream interface declares a setProperties method, which is called by the batch container:
      Listing 3
      public void setProperties(Properties properties) {
      props = properties;
      }

      The BatchDataStream interface further declares an initialize and open method. These are called in that order, after setProperties. The initialize method is called to give the batch data stream an opportunity to prepare to be opened. The container passes two unique identifiers to the batch data stream on the initialize method:
      • Logical name is the name by which the batch job step refers to a particular batch data stream. This name is mapped to a specific batch data stream implementation through the job definition. A late binding mechanism can then enable the same batch job step to process different batch data stream implementations, based on the job definition.
      • Job step id uniquely identifies the job instance, and specific step with the instance, that is currently accessing the batch data stream implementation.
      There is no defined purpose for these two values in a batch data stream implementation, but at a minimum, they are useful for inclusion in trace messages to aid in debugging.
    4. In your initialize method implementation, obtain and store the file name for this batch data stream. Remember, you will require that this property be passed through the job definition that specifies this batch data stream as a job step’s input:
      Listing 4
      public void initialize(String logicalname, String jobstepid) {
              Properties prop = getProperties();
              fn = prop.getProperty("FILENAME");
      }
    5. Since this batch data stream is based on a text file, use the Java RandomAccessFile class to interact with it. In the open method, setup this structure and open the file:
      Listing 5
      public void open() throws BatchContainerDataStreamException {
      try {
            if ( _access_mode == ACCESS_MODE.R) {
            	     	_file = new RandomAccessFile(_fn,"r");
                  }
                  else {
                  	_file = new RandomAccessFile(_fn,"rw");
                  }
      
      } catch (IOException e) {
              	throw new BatchContainerDataStreamException(e);
           	}
      }
    6. You need to declare methods to get and put data. The BatchDataStream interface does not declare a signature for this, since there can be great variations on this signature among different batch data stream types. Therefore, introduce these two signatures to your implementation:
      Listing 6a
      public String getNextRecord() throws
        BatchContainerDataStreamException {
      String input = null;
      try {
      input = _file.readLine();
      } catch (IOException e) {
                  throw new BatchContainerDataStreamException(e);
      }
      return input;
      }
      Listing 6b
      public void putNextRecord(String r) throws
        BatchContainerDataStreamException {
      try {
                  _file.writeBytes(r);
                  _file.write('\n');
                  this._position+=r.length()+1;
            } catch (IOException e) {
                  throw new BatchContainerDataStreamException(e);
      }

      The close and checkpoint-related methods are omitted here for simplicity, but you can view them in the sample code included in the Eclipse project that accompanies this article.
    7. Implement two subclasses of your base class to create an input and output batch data stream:
      Listing 7a
      package com.ibm.websphere.samples;
      public class TestInputBatchDataStream
       extends TextFileBatchDataStream {
      	public TestInputBatchDataStream() {
      		super(ACCESS_MODE.R);
      	}
      }
      Listing 7b
      package com.ibm.websphere.samples;
      public class TestOutputBatchDataStream 
       extends TextFileBatchDataStream {	 
      	public TestInputBatchDataStream() {
      		super(ACCESS_MODE.W);
      	}
      }
  3. Create batch job step

    Now that you have implemented your batch data streams, writing the batch job step is pretty simple. All batch job steps must implement BatchJobStepInterface, so declare your code like this:

    Listing 8
    package com.ibm.websphere.samples;
    import com.ibm.websphere.batch.BatchConstants;
    import com.ibm.websphere.batch.BatchJobStepInterface;
    import com.ibm.websphere.batch.BatchDataStreamMgr;
    import 
     com.ibm.websphere.batch.BatchContainerDataStreamException;
    import com.ibm.websphere.batch.JobStepID;
    import com.ibm.websphere.batch.context.JobStepContextMgr; 
    import com.ibm.websphere.batch.context.JobStepContext;
    public class TestBatchJobStep 
     implements BatchJobStepInterface {
    ...
    }

    In this example, there are five methods to implement; the ones of interest here are the createJobStep and processJobStep methods. (You can view the getProperties, setProperties, and destroyJobStep methods in the sample code.)

    • The createJobStep method will be used to setup the input and output batch data streams so they will be ready for use by the processJobStep method:
      Listing 9
      public void createJobStep() {
      try {
         JobStepID id = getJobStepID();
      
         _testInputBatchDataStream=  (TestInputBatchDataStream)
      BatchDataStreamMgr.getBatchDataStream
          ( "input", id.getJobstepid() );
      
         _testOutputBatchDataStream=  (TestOutputBatchDataStream) 
      BatchDataStreamMgr.getBatchDataStream
          ( "output", id.getJobstepid() );
            }
           catch (BatchContainerDataStreamException e) {
           throw new RuntimeException (e);
            }
      }

      JobStepID and BatchDataStreamMgr are further elements of the Compute Grid batch programming model:
      • BatchDataStreamMgr is a service class that provides access to a batch job step’s batch data streams.
      • JobStepID is a helper class, used to encapsulate the identity of a job step as it is known to the batch container.
      Notice that the createJobStep method uses a private method, called getJobStepID():
      Listing 10
      private JobStepID getJobStepID() {
        JobStepContext ctx= JobStepContextMgr.getContext();
        return ctx.getJobStepID();
      }

      This method reveals the final part of the batch programming model: job step context. The JobStepContextMgr service class enables the batch job step to obtain a reference to its JobStepContext object. JobStepContext provides two important functions:
      • Access to information that uniquely identifies the context in which the current batch job step is executing (like jobid).
      • A workarea where application specific information can be passed among the batch programming framework methods during the life of the batch job step.

      The JobStepContext object exposes this interface:

      Table 4. JobStepContext Methods
      DataMethod summary
      java.lang.StringgetJobID()
      Returns job name of current job.
      JobStepIDgetJobStepID()
      Returns JobStepID object for current step.
      java.lang.StringgetStepID()
      Returns step name of current step.
      java.lang.ObjectgetUserData()
      Returns the user data stored in this context.
      voidsetUserData(java.lang.Object o)
      Set user data object in this context.
    • The processJobStep method contains the business logic that processes the batch data stream. For this example, your business logic simply copies the input batch data stream to the output batch data stream:
      Listing 11
      public int processJobStep() {
          try {
              String l= _testInputBatchDataStream.getNextRecord();
              if ( l == null ) {
          	return BatchConstants.STEP_COMPLETE;
          	}
              else {
          	_testOutputBatchDataStream.putNextRecord(l);
             	return BatchConstants.STEP_CONTINUE;
          	}
            }
          catch (Exception e) {
             throw new RuntimeException("TestBatchJobStep: error in
              processJobStep: ",e);
           }
      }

    The entire sample code is included in the Eclipse project that accompanies this article.

  4. Test the code

    As mentioned earlier, Compute Grid provides a test environment that executes inside a standalone WebSphere Application Server, but to simplify things even further for this example (and avoid additional application packaging and deployment tasks), a testing aid called the Batch Simulator is included for your use with this article. The Batch Simulator enables you to initially test your batch POJOs right inside the Eclipse environment. Bear in mind, though, that the Batch Simulator runs in a J2SE environment, whereas the actual Compute Grid execution environment is Java EE, since it runs inside WebSphere Application Server. Still, the Batch Simulator is useful for testing basic flow among POJOs and testing the essential business logic in the batch job step.

    To run the Batch Simulator:

    1. You need to provide a job definition that describes the batch job step you want to execute and its batch data streams. A sample job definition that is ready to use for this sample batch job step can be found in the Eclipse workspace shown in Figure 3.
      Figure 3. Eclipse jobdefs folder
      Figure 3. Eclipse jobdefs folder
      Review the block comments in that file to learn about the properties used to describe the job definition. Understand that Compute Grid job definitions are typically XML files that conform to a proprietary job control language format called xJCL. The property-based approach used by the simulator is done so for simplicity. (The Batch Simulator also features an option to generate a proper Compute Grid xJCL file based on a Batch Simulator properties file.)
    2. You need an Eclipse run configuration that specifies:

      Table 5. Ready-to-use algorithms
      AttributeValue
      Main classcom.ibm.websphere.batch.BatchSimulator
      Program arguments"\${workspace_loc:${project_name}}/jobdefs/testbatchjobstep.properties"
      When you run this configuration, you should see output from the Batch Simulator and your sample application that looks like this:
      Listing 12
      BatchSimulator: start job JOB1
      ...
      BatchSimulator: end job JOB1 - RC=0

      The Batch Simulator supports an additional run option, writexJCL, that writes an xJCL file, based on the Batch Simulator input properties. When specified, it must be the second parameter, following the input file specification. This option takes the input Batch Simulator properties and writes out a proper Compute Grid xJCL file. You can use this later when you test your batch job step in the Compute Grid server environment. You can set system property com.ibm.websphere.batch.simulator.xjcldir to specify the output directory in which you want the xJCL written. A console message informs you of the xJCL name and location.
    3. For added flexibility, the Batch Simulator workspace also includes an Ant task that launches the Batch Simulator:

      com.ibm.websphere.batch.BatchSimulatorTask

      The workspace includes a pair of sample Ant scripts you can use to launch this task, as shown in Figure 4.
      Figure 4. Eclipse scripts folder
      Figure 4. Eclipse scripts folder

Conclusion

WebSphere Extended Deployment Compute Grid provides a simple abstraction of a batch job step and its inputs and outputs. The programming model is concise and straightforward to use. The built-in checkpoint/rollback mechanism makes it easy to build robust, restartable Java batch applications.

The Batch Simulator utility provided with this article offers an alternative test environment that runs inside your Eclipse (or Rational Application Developer) development environment. Its xJCL generator can help jump start you to the next phase of testing in the Compute Grid unit test server.


Download

DescriptionNameSize
Test demonstration utilityComputeGridBatchSimulatorPkg.zip4.3 MB

Resources

Learn

Get products and technologies

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=283191
ArticleTitle=Introduction to batch programming using WebSphere Extended Deployment Compute Grid
publish-date=01232008