Anant Jhingran, CTO of Information Management, recently blogged about Hadoop in the cloud . . The challenge with Hadoop, as he described, is that it's not really a good option for elastic data (in its current form at least, it's not very good at spinning up new instances on the fly because data has grown). More over, map/reduce patterns are a good solution for semi-structured or unstructured data, but not necessarily relational data, which of course is the form that most enterprise data resides in currently. WebSphere Compute Grid is quite compelling for data-intensive applications running in the cloud, especially when processing relational data and honoring the transactional integrity requirements that are expected.
The following excerpts from papers I've written describe the value of WebSphere Compute Grid in the cloud, and reference images that are attached with this post:
WebSphere Compute Grid administrators define policies that influence how batch jobs are executed. These policies, coupled with autonomics built into the infrastructure, serve as the foundation for cloud-enabled batch. Figure1_ComputeGridiCloud.tiff depicts these elements:
Policies can be defined by WebSphere Compute Grid administrators that govern how jobs are executed. The life cycles of these policies can be managed through standard life-cycle management technologies, and have a life cycle that is independent of the applications.
- a. Dispatch policies can be configured that influence where batch jobs are executed. For example, a dispatch policy can be defined where all jobs of a certain type must run in a 64-bit JVM.
- b. Partitioning policies can be configured that define how batch jobs should be broken into parallel processing elements. These policies typically match how the data used by the batch applications have been partitioned. Dispatch policies can be defined in conjunction with the parallel portioning policies to create a highly parallel solution with data-aware routing.
- c. Job-specific qualities of service (QoS) can be configured that influence how jobs are executed within the batch container.
- Both the batch container tier, as well as the underlying infrastructure cloud, conveys capacity and execution metrics to the job-dispatching tier. The metrics are used by the job-dispatching tier to determine the best endpoint to execute the jobs on. This system uses autonomic algorithms to ensure that jobs are load-balanced across the system.
- As spikes in batch jobs occur, elasticity services, where the batch container and infrastructure cloud must scale up or down to meet demand, may be necessary. The z/OS platform has these services built in, and WebSphere Compute Grid takes full advantage of this service. On distributed platforms, WebSphere Compute Grid can be combined with WebSphere Virtual Enterprise to create an elastic infrastructure.
Writing multi-threaded code can be difficult, but can significantly reduce the elapsed time of a batch job. Maintaining and debugging multi-threaded code though can be expensive. By designing an application that is: data-injected (records are fed into the kernel record processor, single threaded, and operates on 1 record at a time, the application is able to scale dynamically and run in parallel without imposing the parallel processing details on the developer. As Figure2_ParallelBatchExample illustrates, the Input data stream (Input DS) can be designed to accept parameters that dictate the range of data that should be retrieved. In this example, a constraint in the predicate of the SQL query is used, but byte ranges in files, record ranges in fixed-block datasets, etc are also possible. An external parallel processing engine can create multiple instances of a batch job, in this case, the Parallel Job Manager component of WebSphere Compute Grid applies a partitioning algorithm that determines the range of data that each job instance will operate on. To maximize throughput, the partitioning strategy for a batch job should match how the data has been partitioned. More over, the partitioning strategy can independent of the application, where the application developer is not responsible for determining how many instances (or threads) must be executed, nor on what data each thread will operate. By shifting this burden to the infrastructure, workload management data, capacity metrics, service-level agreements, and database optimizations can be used to determine the ideal concurrency setting at a specific moment.
Batch Processing with WebSphere Compute Grid: Delivering Business Value
Designing Batch Applications
Data-intensive Processing with WebSphere