Topic
  • 6 replies
  • Latest Post - ‏2011-08-07T18:21:28Z by SystemAdmin
SystemAdmin
SystemAdmin
783 Posts

Pinned topic Comparing Spring Batch to WebSphere XD Compute Grid

‏2008-01-22T20:43:15Z |
The following post provides some background on layers within the batch landscape


Spring Batch is a batch application container. The technology does not provide a transaction manager, security manager, connection management, log management, inherent high availability, and other such infrastructure services. The technology enables the configuration of a batch application via dependency injection and delegates to both the business logic and infrastructure services as needed during batch execution.



WebSphere XD Compute Grid on the other hand provides technology for three of the four batch layers described: a scheduler, tightly-integrated with the batch execution environment, for managing the execution of XD batch applications; a batch execution environment that provides infrastructure services like the transaction manager, security manager, high availability, and so on; and also a batch application container that delegates to the business logic and to the infrastructure services as appropriate.



The attached image, courtesy of Chris Vignola, provides a straightforward comparison of the two technologies.


Message was edited by: Snehal Antani
Updated on 2011-08-07T18:21:28Z at 2011-08-07T18:21:28Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    783 Posts

    Re: Comparing Spring Batch to WebSphere XD Compute Grid

    ‏2008-06-17T20:08:18Z  


    A number of customers have been asking how Spring Batch compares to WebSphere XD Compute Grid (aka WebSphere Batch). Think of a batch processing infrastructure as being composed of 2 key parts: a batch runtime infrastructure, whose role is to host and govern the execution of the batch job; and the actual programming model to which the batch job is written. (yes, I know I talk about "4 layers of batch" in previous posts, but perhaps that was not a very clear description :).


    WebSphere XD Compute Grid delivers both an end-to-end batch runtime infrastructure as well as a programming model. Spring Batch delivers just a programming model and requires a hosting environment. Spring Batch more directly competes with the Compute Grid programming model and more specifically, the BDS Framework.


    For the most part, the programming technique used to develop batch applications shouldn't really matter. Customers are in production today with Compute Grid where they develop applications using Spring, hook into Compute Grid via the BDS Framework, and deploy to Compute Grid running on a variety of platforms (Windows, Linux, z/OS). We've built batch applications using an early version of Spring Batch in the labs as well. Today there are specific places where you have to give Spring Batch control, this perhaps warrants a pattern in the BDS Framework, but the plan is to fully support hosting Spring Batch applications in the future. This way Compute Grid can focus on delivering an end-to-end batch processing infrastructure, providing features such as: OLTP + Batch processing interleave (working with workload management to be sure both workloads play nice); integrating with external schedulers like Tivoli Workload Scheduler; delivering a parallel processing infrastructure (that handles multi-node dispatching, fail over, load balancing among cluster members, etc); disaster recovery types of features; and so on.


    The most important fact to keep in mind when deciding on a batch processing infrastructure isn't the programming model (Spring Batch versus the BDS Framework). That's a decision based on the developer's preference. Instead the real challenge is choosing the infrastructure that will host the batch applications. I see two important factors here: first, where is the data located; and second, what batch assets already exist, where as they, and how can we integrate with them (enterprise schedulers for example). For now I'll just focus on the proximity to data topic. If the data is hosted on z/OS (DB2 for z/OS, datasets, etc), the batch processing infrastructure should be running on z/OS. Hands down. If the data is on a distributed machine (DB2 UDB, Oracle, etc), a technology like WebSphere eXtreme Scale should be examined and used to bring the data closer to the business logic for processing (we call this "eXtreme Batch" because using "eXtreme" seems to be the cool thing to do these days :)


    Ultimately the location of the data should dictate the platform on which the batch applications will execute. A very large percentage of enterprise data is located on z/OS, therefore most batch should be running on that platform. WebSphere XD Compute Grid is the only batch processing infrastructure that will run on every platform supported by WebSphere, including z/OS. Moreover, WebSphere XD Compute Grid integrates with z/OS in many ways and therefore treats the platform as a first-class citizen. The important point here though is that application developers shouldn't care which platform their apps are running on. Batch applications hosted by Compute Grid can run anywhere that Compute Grid can run, and this deployment decision can be completely transparent to the application developers.


    So when selecting a batch processing infrastructure, the choice should be based on where the data is located today, and where the data could be located tomorrow. The last thing a customer wants to deal with is rebuilding their batch processing infrastructure because their data is to be consolidated to z/OS. Which perhaps is the next trend.


    You can find more information on Compute Grid (WebSphere Batch) at:

  • SystemAdmin
    SystemAdmin
    783 Posts

    Re: Comparing Spring Batch to WebSphere XD Compute Grid

    ‏2008-06-30T20:20:10Z  


    A number of customers have been asking how Spring Batch compares to WebSphere XD Compute Grid (aka WebSphere Batch). Think of a batch processing infrastructure as being composed of 2 key parts: a batch runtime infrastructure, whose role is to host and govern the execution of the batch job; and the actual programming model to which the batch job is written. (yes, I know I talk about "4 layers of batch" in previous posts, but perhaps that was not a very clear description :).


    WebSphere XD Compute Grid delivers both an end-to-end batch runtime infrastructure as well as a programming model. Spring Batch delivers just a programming model and requires a hosting environment. Spring Batch more directly competes with the Compute Grid programming model and more specifically, the BDS Framework.


    For the most part, the programming technique used to develop batch applications shouldn't really matter. Customers are in production today with Compute Grid where they develop applications using Spring, hook into Compute Grid via the BDS Framework, and deploy to Compute Grid running on a variety of platforms (Windows, Linux, z/OS). We've built batch applications using an early version of Spring Batch in the labs as well. Today there are specific places where you have to give Spring Batch control, this perhaps warrants a pattern in the BDS Framework, but the plan is to fully support hosting Spring Batch applications in the future. This way Compute Grid can focus on delivering an end-to-end batch processing infrastructure, providing features such as: OLTP + Batch processing interleave (working with workload management to be sure both workloads play nice); integrating with external schedulers like Tivoli Workload Scheduler; delivering a parallel processing infrastructure (that handles multi-node dispatching, fail over, load balancing among cluster members, etc); disaster recovery types of features; and so on.


    The most important fact to keep in mind when deciding on a batch processing infrastructure isn't the programming model (Spring Batch versus the BDS Framework). That's a decision based on the developer's preference. Instead the real challenge is choosing the infrastructure that will host the batch applications. I see two important factors here: first, where is the data located; and second, what batch assets already exist, where as they, and how can we integrate with them (enterprise schedulers for example). For now I'll just focus on the proximity to data topic. If the data is hosted on z/OS (DB2 for z/OS, datasets, etc), the batch processing infrastructure should be running on z/OS. Hands down. If the data is on a distributed machine (DB2 UDB, Oracle, etc), a technology like WebSphere eXtreme Scale should be examined and used to bring the data closer to the business logic for processing (we call this "eXtreme Batch" because using "eXtreme" seems to be the cool thing to do these days :)


    Ultimately the location of the data should dictate the platform on which the batch applications will execute. A very large percentage of enterprise data is located on z/OS, therefore most batch should be running on that platform. WebSphere XD Compute Grid is the only batch processing infrastructure that will run on every platform supported by WebSphere, including z/OS. Moreover, WebSphere XD Compute Grid integrates with z/OS in many ways and therefore treats the platform as a first-class citizen. The important point here though is that application developers shouldn't care which platform their apps are running on. Batch applications hosted by Compute Grid can run anywhere that Compute Grid can run, and this deployment decision can be completely transparent to the application developers.


    So when selecting a batch processing infrastructure, the choice should be based on where the data is located today, and where the data could be located tomorrow. The last thing a customer wants to deal with is rebuilding their batch processing infrastructure because their data is to be consolidated to z/OS. Which perhaps is the next trend.


    You can find more information on Compute Grid (WebSphere Batch) at:

    An interesting post about Spring Batch, CommonJ work threads, and the ramifications on WAS z/OS (http://blog.springsource.com/main/2007/06/23/the-power-of-batch)


    When running in WebSphere, Spring Batch jobs will execute on CommonJ work threads (at least, according to the post).


    One thing to keep in mind with CommonJ threads is that with WebSphere Application Server for z/OS, CommonJ threads aren't viewable to the z Workload Manager (zWLM). There are a couple of ramifications here, two that are the most significant: first, zWLM doesn't 'see' CommonJ threads because they don't run within a WLM enclave, which means your server could be running 1000 CommonJ threads but the Servant appears to be idle to the system, so you could overrun your allocated CPU and other physical resources; and second, as a result of not running within a zWLM enclave, WAS z/OS Servant Regions can't be dynamically started by incoming CommonJ work, this can only be done by Dispatch Worker Threads which are viewable to zWLM.


    A 'hack' to fix could be to wrap the CommonJ work thread in a servlet or EJB, which runs within a Dispatch Worker thread (and runs within a WLM enclave and is viewable to zWLM). The problem with this workaround is the risk of transaction timeouts. Application servers have a single, process-wide timeout value for each protocol, for example http has its own timeout, default is 120 seconds I believe. So in order for you to use this workaround, you have to be sure that your CommonJ activity completes within the transaction timeout specified. The 'hack' to the 'hack' is to crank your transaction timeout values to be really high. This is a problem if you're running mixed workloads because if problems occur your application server will not be resilient (see: http://www.ibm.com/developerworks/webservices/library/ws-soa-resilient2/index.html#N1015A).


    In the context of batch processing, since a large volume of data to be processed in batch mode resides on z/OS, and proximity of data is critical to batch performance, the likely deployment target for the Spring Batch application will be WebSphere on z/OS.


    One way I've seen customers work around this is to combine Spring with the WebSphere's batch facility, aka WebSphere XD Compute Grid. The WebSphere batch facility runs as a dispatch worker thread with custom timeouts, so the problems I described are solved; more over, with the WebSphere Batch facility you can actually leverage WLM and classify your batch jobs with WLM transaction classes that map to WLM service classes. It may be possible to combine Spring Batch with Compute Grid today (and there are plans to ensure this works tomorrow), but I have seen plain Spring used with the BDS Framework (development tooling from Compute Grid, you can read more at: http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=190624&tstart=0) to build batch applications that run properly in the Compute Grid environment. I'm working on a developerworks article that describes the Spring + Compute Grid integration, that should hopefully be published soon.


    some links on the topic:

    http://www.ibm.com/developerworks/websphere/library/techarticles/0602_ashok/0602_ashok.html
    http://www.ibm.com/developerworks/webservices/library/ws-soa-resilient/index.html
    http://www.ibm.com/developerworks/webservices/library/ws-soa-resilient2/index.html
    Updated on 2008-06-30T20:20:10Z at 2008-06-30T20:20:10Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    783 Posts

    Re: Comparing Spring Batch to WebSphere XD Compute Grid

    ‏2009-09-22T08:51:03Z  
    The comparison jpg is awfully outdated, and even by the time it was posted it seems to have an IBM bias. Spring Batch has always had support for multithreaded execution. Now it also supports multi-jvm execution (remote chunking).
  • SystemAdmin
    SystemAdmin
    783 Posts

    Re: Comparing Spring Batch to WebSphere XD Compute Grid

    ‏2009-09-22T08:53:40Z  
    The comparison jpg is awfully outdated, and even by the time it was posted it seems to have an IBM bias. Spring Batch has always had support for multithreaded execution. Now it also supports multi-jvm execution (remote chunking).
    Spring Batch has also supported restart/failover since the beginning, as far as I know.
  • SystemAdmin
    SystemAdmin
    783 Posts

    Re: Comparing Spring Batch to WebSphere XD Compute Grid

    ‏2009-09-25T05:03:31Z  
    Spring Batch has also supported restart/failover since the beginning, as far as I know.
    Thanks for the comments. Let's re-evaluate the claims made in this thread and elsewhere:

    • ".jpg is outdated, and is skewed towards IBM". Spring Batch is a programming style for expression batch applications. This is clearly claimed by Spring in their documentation, forums, etc. They don't claim to be a batch processing platform, nor are they in any way. The .jpg comparison makes this clear, and the claims made in the image are still true today: Spring Batch requires a host to attach to (WAS, JBoss, Tomcat, etc); Multi-JVM management does not exist; no Job Scheduler; no complete high availability or multi-datacenter disaster recovery model; no Job Management console (or console of any sort); no notion of service level agreements (SLA) or SLA management; no z/OS integration of any sort, in fact, I've posted that since Spring Batch uses CommonJ under the covers, zWLM can't do much at all (like dynamically start WAS z/OS servant regions) with SB batch jobs; no job log management; no explicit external scheduler integration. With regard to checkpoint/restart/failover, that should probably be clarified, and say "transactional checkpoint/restart/failover", since SB's mechanism doesn't appear to be truly transactional, which I point out in my response to the following blog post: Link: http://www.cforcoding.com/2009/07/spring-batch-or-how-not-to-design-api.html

    • "multi-JVM execution and remote chunking is supported now". Multi-JVM execution and parallel processing are two very different things. Parallel processing is the coordinated execution of some complex task across a collection of resources. QoS like operational control (start/stop/cancel/restart/etc), job log management & aggregation, etc are expected. Multi-JVM support simply means that multiple instances of the container can be instantiated, aka clustered. Compute Grid provides a Parallel Job Manager component whose sole purpose is to apply partitioning algorithms to batch jobs, dispatch those parallel jobs across a cluster, and manage the parallel instances on behalf of the operator. Clustering (aka multi-JVM support) has been part of Compute Grid since 2005. Clustering is a fundamental requirement of any production topology, and something that is expected by customers, therefore not something that needs to be explicitly highlighted.

    • The following post in the SB forum mentions the lack of multi-threaded support in SB: Link: http://forum.springsource.org/showthread.php?p=252361.


    WebSphere Compute Grid is a batch processing platform. The programming style used to describe your batch applications is your decision, and you can certainly write your application using Spring Batch and leverage Compute Grid as the underlying execution engine. I strongly recommend against doing this for a variety of reasons: several customers have demonstrated to me that Spring Batch's checkpoint/restart is not truly transactional; splitting the business logic between Spring Batch .xml files and Java makes application development/debug/maintenance/life-cycle mgmt a nightmare for large projects; finally the better the contract between the container/platform and the application, the more interesting QoS can be provided, by running in SB, you will limited here.

    As I've described in the following post Link: https://www.ibm.com/developerworks/forums/thread.jspa?messageID=14216935&#14216935 using Spring Core (AOP and dependency injection specifically) is perfectly reasonable, and numerous customers I work with have combined Spring Core with the Batch Data Stream (BDS) framework.

    Finally, I recently posted a paper and presentation on designing batch applications, which may be of interest to you: Link: https://www.ibm.com/developerworks/forums/thread.jspa?threadID=275702&tstart=0
  • SystemAdmin
    SystemAdmin
    783 Posts

    Re: Comparing Spring Batch to WebSphere XD Compute Grid

    ‏2011-08-07T18:21:28Z  
    Thanks for the comments. Let's re-evaluate the claims made in this thread and elsewhere:

    • ".jpg is outdated, and is skewed towards IBM". Spring Batch is a programming style for expression batch applications. This is clearly claimed by Spring in their documentation, forums, etc. They don't claim to be a batch processing platform, nor are they in any way. The .jpg comparison makes this clear, and the claims made in the image are still true today: Spring Batch requires a host to attach to (WAS, JBoss, Tomcat, etc); Multi-JVM management does not exist; no Job Scheduler; no complete high availability or multi-datacenter disaster recovery model; no Job Management console (or console of any sort); no notion of service level agreements (SLA) or SLA management; no z/OS integration of any sort, in fact, I've posted that since Spring Batch uses CommonJ under the covers, zWLM can't do much at all (like dynamically start WAS z/OS servant regions) with SB batch jobs; no job log management; no explicit external scheduler integration. With regard to checkpoint/restart/failover, that should probably be clarified, and say "transactional checkpoint/restart/failover", since SB's mechanism doesn't appear to be truly transactional, which I point out in my response to the following blog post: Link: http://www.cforcoding.com/2009/07/spring-batch-or-how-not-to-design-api.html

    • "multi-JVM execution and remote chunking is supported now". Multi-JVM execution and parallel processing are two very different things. Parallel processing is the coordinated execution of some complex task across a collection of resources. QoS like operational control (start/stop/cancel/restart/etc), job log management & aggregation, etc are expected. Multi-JVM support simply means that multiple instances of the container can be instantiated, aka clustered. Compute Grid provides a Parallel Job Manager component whose sole purpose is to apply partitioning algorithms to batch jobs, dispatch those parallel jobs across a cluster, and manage the parallel instances on behalf of the operator. Clustering (aka multi-JVM support) has been part of Compute Grid since 2005. Clustering is a fundamental requirement of any production topology, and something that is expected by customers, therefore not something that needs to be explicitly highlighted.

    • The following post in the SB forum mentions the lack of multi-threaded support in SB: Link: http://forum.springsource.org/showthread.php?p=252361.


    WebSphere Compute Grid is a batch processing platform. The programming style used to describe your batch applications is your decision, and you can certainly write your application using Spring Batch and leverage Compute Grid as the underlying execution engine. I strongly recommend against doing this for a variety of reasons: several customers have demonstrated to me that Spring Batch's checkpoint/restart is not truly transactional; splitting the business logic between Spring Batch .xml files and Java makes application development/debug/maintenance/life-cycle mgmt a nightmare for large projects; finally the better the contract between the container/platform and the application, the more interesting QoS can be provided, by running in SB, you will limited here.

    As I've described in the following post Link: https://www.ibm.com/developerworks/forums/thread.jspa?messageID=14216935&#14216935 using Spring Core (AOP and dependency injection specifically) is perfectly reasonable, and numerous customers I work with have combined Spring Core with the Batch Data Stream (BDS) framework.

    Finally, I recently posted a paper and presentation on designing batch applications, which may be of interest to you: Link: https://www.ibm.com/developerworks/forums/thread.jspa?threadID=275702&tstart=0
    "his is clearly claimed by Spring in their documentation, forums, etc. They don't claim to be a batch processing platform, nor are they in any way. The .jpg comparison makes this clear, and the claims made in the image are still true today: Spring Batch requires a host to attach to (WAS, JBoss, Tomcat, etc); "

    Well, Spring is a lightweight framework, and does not impose a certain type of runtime. As such, it's overhead is minimal. You can run it as a simple java process in the JVM (java -jar) or you can run it in a managed environment such as Tomcat or JBoss. Generally, I don't see the great advantage of running it inside a container. You may be able to re-use some container services you are already relying on in your apps, but in that case the container is already given and putting your Spring Batch app on that container can be achieved in a few minutes.

    "no z/OS integration of any sort"

    What are you looking for here? It's java so it can integrate with JCA adapters and JCBD drivers, which would be my preferred means of integration to any system, staying with well defined interfaces. Of course, a OS-level integration won't be there in the same way for an open source framework as for a framework done by the same people who've done the OS...

    " Multi-JVM execution and parallel processing are two very different things. Parallel processing is the coordinated execution of some complex task across a collection of resources. QoS like operational control (start/stop/cancel/restart/etc), job log management & aggregation, etc are expected. Multi-JVM support simply means that multiple instances of the container can be instantiated, aka clustered. Compute Grid provides a Parallel Job Manager component whose sole purpose is to apply partitioning algorithms to batch jobs, dispatch those parallel jobs across a cluster, and manage the parallel instances on behalf of the operator. Clustering (aka multi-JVM support) has been part of Compute Grid since 2005. Clustering is a fundamental requirement of any production topology, and something that is expected by customers, therefore not something that needs to be explicitly highlighted."

    Please have a look at the reference documentation on scaling (http://static.springsource.org/spring-batch/reference/html-single/index.html#scalability). Spring Batch have four different scaling mechanisms, of which partitioning is one. Partitioning is there to do exactly what you describe. Allowing partitioning of datasets and executing on different nodes in a grid.

    "The following post in the SB forum mentions the lack of multi-threaded support in SB: Link: http://forum.springsource.org/showthread.php?p=252361."

    I suggest you look at the documentation instead of rants on forums. You've got a few questionable sources in your posts.

    All in all, I appreciate that you have some bias here, as do I. I think it would be courteous do declare your IBM affiliation though, before you post this kind FUD. While this is posted on ibm.com, I would have expected some more courtesy and professionalism from someone from IBM, especially when they are portraying the initial post as a un-biased comparison.