Skip to main content

skip to main content

developerWorks  >  WebSphere | Architecture  >

Perspectives on grid: Application programming and extreme-scale data infrastructure support in WebSphere Extended Deployment

A closer look at application patterns

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Matt Haynos (mph@us.ibm.com), Product Manager, WebSphere Extended Deployment, IBM 

12 Jun 2007

IBM® WebSphere® Extended Deployment (Extended Deployment) is infrastructure software for application servers, which many associated with the ability to optimize and lower costs of an application infrastructure, prioritize resources for use by the most important applications, and improve availability. It does this via virtualization and intelligent workload management. But WebSphere Extended Deployment also contains capabilities to effectively manage large amounts of data in extreme-scale data fabrics via ObjectGrid and to run additional application types beyond OLTP. Here, we take a closer look at WebSphere Extended Deployment's ObjectGrid and Java™ batch programming support.

Introduction

Application servers continue to be an important part of the enterprise IT landscape. With the maturation of application servers, their numbers continue to grow. With this growth comes a struggle to use and manage them efficiently. Servers are often added to run constrained applications, and while the server capacity might exist, it's dedicated to oher lower-priority or usage applications.

Add different types of application servers — you might, for example, add open source application servers or get new application servers through acquisition — and the situation gets even worse. The result can be low utilization, missed business opportunities, and infrastructure inefficiencies. Combine these results with the needs for essential management and qualities of service, plus the need to be able to increase application performance in the face of growing data volumes, and you'll likely discover that you need new application infrastructure capabilities.

IBM WebSphere Extended Deployment is not an application server, but it’s often characterized as one. It's software that complements application servers and execution environments to provide advanced qualities of service, infrastructure for dealing with growing data volumes, and application patterns beyond online transaction processing (OLTP).

In a previous article (see Resources), I said WebSphere Extended Deployment is an example of an integrated grid platform. It contains three components: Operations Optimization, Data Grid, and Compute Grid. In V6.1, it works not only with the WebSphere Application Server but also with BEA WebLogic, WebSphere Community Edition, JBoss, and Apache Tomcat. This is an important direction for WebSphere Extended Deployment — many people associate WebSphere Extended Deployment as being only applicable to the WebSphere Application Server.

Operations Optimization helps prioritize work and applications, manage applications and servers, and drive up utilization of application infrastructures through virtualization and intelligent workload management. It contains integrated health and operational management capabilities. Its name implies a lot about what it does. Data Grid helps deal with unrelenting volumes of data (up to terabytes) — you develop applications, Data Grid takes care of everything else (scalability, performance, and high availability). Compute Grid provides facilities for developing and running application types beyond OLTP on existing application infrastructures. One of these — Java batch — applications, which is particularly attractive in mainframe environments. We'll talk about the growing trend to use Java for batch programming and the benefits it can provide.

The main focus of this article is to dive deeper into both the Data Grid and Compute Grid components of WebSphere Extended Deployment. In particular, we'll take a closer look at the ObjectGrid component of Data Grid and concentrate on the Java batch support in Compute Grid. I'll focus on the benefits and the infrastructure provided by Data Grid and Compute Grid, and introduce, at a high level, the programming models. In successive articles, I'll provide a deeper treatment of the programming models, including code examples.



Back to top


ObjectGrid: Advanced data infrastructure support

Data volumes continue to grow exponentially. In many cases, the volume of data outstrips the ability of applications and the underlying infrastructure to deal with it. What usually results is slowing or inconsistent application performance. Traditional “scale-up” infrastructure approaches can be expensive and often remedy performance issues only temporarily. What's required is a different approach. ObjectGrid represents an increasingly popular infrastructure technique for dealing with unrelenting growth of data volumes.

ObjectGrid is a flexible infrastructure for realizing high-performance scalable-, data-intensive applications (see Figure 1). It can be used in very flexible ways, from a simple cache to an extreme-scale data grid where data is striped — and managed — across hundreds or thousands of data servers. In a previous article, I provided a sampling of some of these configurations. ObjectGrid has been designed to support a wide variety of usage scenarios.


Figure 1. ObjectGrid: Support for flexible data infrastructures an image
ObjectGrid: Support for flexible data infrastructures an image

Extreme-scale data infrastructures like ObjectGrid present an interesting trend in data management: It flips the traditional "application-first" mentality around. Traditionally, applications were the initial, or primary, focus, while data access and management were secondary. This is changing. Now, because the challenges associated with data volume growth are so acute, the first priority is to ensure a robust and scalable infrastructure for dealing with data. Then applications can be built around it.

Further, when information is spread across the memory provided by many distributed servers, the traditional role of persistent storage (in databases, for example) is called into question. Certainly, the permanent storage of information will always be required, but because information is usually held redundantly — using replication — in more than one location in the data grid, information can always be accessed as long as a representative set of servers are running. One problem with this approach — and the reason permanent storing of information is required — is that it doesn't adequately deal with the catastrophic failure of the entire data grid.



Back to top


ObjectGrid infrastructure and qualities of service

You can use ObjectGrid as a simple cache or in a tiered-cache environment where a subset obf data is accessed in a very large data cache, perhaps with a small grouping of servers. But we'll concentrate on larger, more distributed, and interesting, ObjectGrid configurations. Let's take a look at the infrastructure characteristics and the significant qualities of service provided by ObjectGrid.

From an infrastructure perspective, ObjectGrid consists of a set of servers that host data. Often, this data is referred to as mapsets, and the general presumption is that the data is stored in memory. Sometimes it's impossible to store all the requisite data in memory due to sheer size, so eviction rules are often specified to indicate when data should be removed from the memory cache and either evaporate or be written to permanent storage.

ObjectGrid can run in any J2SE-compliant container, it's very lightweight, and it has a small footprint (20 MB). It's usually embedded with a container, such as the WebSphere Application Server or JBoss, and because it requires only J2SE, it can run with a wide variety of Java hosting environments and application servers.

The mapsets have a data layout or schema associated with them. An ObjectGrid data grid can manage multiple mapsets. In ObjectGrid V6.1, mapsets can be dynamically modified. What this means is that new fields can be added to the schema, and applications that are executing can continue to run, although they obviously wouldn't be aware of the new field until being modified. This is extremely important because sometimes it's challenging or impossible to stop applications using the data grid.

Further, all updates to data in a partition, hosted on a server, are done with transactional integrity. In addition, the creation of replicas can be done with transactional semantics. ObjectGrid supports a single-phase commit transactional protocol, and it does so to ensure high levels of performance and scalability.

Availability

ObjectGrid supports asynchronous and synchronous replication of mapsets. Replication of data and the management of replicas is an important capability of ObjectGrid to support high availability and resiliency. Also, data servers act in a self-sufficient manner; upon startup, they contact a catalog service once to obtain bootstrap information. So the ObjectGrid architecture is inherently resilient. Below, we'll look at how replication is specified in the configuration file.

Performance

ObjectGrid continually rebalances and optimizes data placement across available servers to ensure optimal performance. Data is located in memory, and you can co-locate applications to realize extremely high levels of application performance.

Scalability

As data volumes grow, scalability is realized by simply adding additional servers to the data grid. When new servers are added, ObjectGrid will use the new server by moving data to it and optimizing the layout of data across all of the servers in the data grid. What this means is that as data volumes increase, application response times are consistent, which isn't normally the case.



Back to top


The ObjectGrid configuration file

Figure 2 depicts a visual representation of the ObjectGrid application configuration file. The application configuration file tells ObjectGrid how to manage the application, and based on the configuration, it does everything automatically.

What's really cool about ObjectGrid from an infrastructure perspective is that it automatically manages all of the qualities of service we outlined. You focus on the application concerns and ObjectGrid takes care of the infrastructure concerns. This is important given that a lot of planning and effort usually goes into dealing with and managing scalable and resilient data architecture. You can think of ObjectGrid as providing Google-like capabilities (support for map/reduce, for example) to your infrastructure.


Figure 2. The ObjectGrid application configuration file
The ObjectGrid application configuration file



Back to top


Application programming in ObjectGrid

So, what does an ObjectGrid application look like? Figure 3 shows a visual representation of how an ObjectGrid application is packaged.


Figure 3. The anatomy of an ObjectGrid application
The anatomy of an ObjectGrid application

ObjectGrid supports two programming styles: You can use the traditional JCache type Map API, or with V6.1, you can use the EntityManager API. The EntityManager API is more streamlined and has a conceptually simpler programming model for in memory data management than the lower-level Map API.

The Map API is based on the Java Map interface, extended to allow operations to be grouped into transactional blocks. It allows a set of keywords to be associated with a given key.


Listing 1. Mapping keywords
                
sess.begin();
mapA.insert(“Kevin”, someValue);
mapA.update(“Perry”, someOtherValue);
List i = new ArrayList();
i.add(“Raj”);
i.add(“Ken”);
List l = mapA.getAll(i);
Sess.commit();

The EntityManager API allows graphs of objects to be annotated with metadata and be both read from and written to the data grid. Each entity/object in the graph corresponds to a map, and ObjectGrid automatically maintains relationships and detects changes to those objects. You simply indicate to persist the graph to the data grid or retrieve the graph from the data grid. This is much more abstract and easier to use than the Map API.



Back to top


Compute grid

Now, let's turn our attention to Compute Grid. Compute Grid is the package in WebSphere Extended Deployment for developing, executing, and managing asynchronous application types. By "asynchronous," we mean not online, often referred to as batch programs. But the term batch has certain connotations, so we'll refer to them as asynchronous.

Traditionally, application server infrastructures are dedicated to OLTP workload. Compute Grid is intended to broaden the type of applications you can develop and run beyond OLTP and to make the programming models very easy to use. Compute Grid supports three types of application patterns: Java batch, compute-intensive, and native execution. Java batch is the one we want to emphasize here, particularly its utility on mainframe environments using z/OS® (a significant platform for batch applications).

An important part of Compute Grid is not only the application patterns themselves but the additional infrastructure capabilities it offers. Asynchronous applications are by their very nature different from OLTP applications and require different approaches to executing and managing them. For example, while a long response time would be problematic in an OLTP environment, it's expected of asynchronous applications that are often long-running. So workload management capabilities need to be able to understand this.



Back to top


Compute grid application patterns

Java batch

The Java batch application pattern allows the development of traditional record-processing batch applications using Java and an easy-to-use Plain Old Java Object (POJO) programming model. Like ObjectGrid, the value here is on offloading infrastructure concerns so you can focus on the development of applications. You don’t have to worry about check-pointing, managing log files, etc.

Services include:

  • Check-pointing — The ability to resume batch work at a selected interval
  • Result processing — The ability to intercept step and job return codes and processing using any Java Enterprise Edition (JEE) facility
  • Batch data stream management — The ability to handle reading and positioning data streams to files, relational databases, and many other input/output sources

Java as a batch programming language has some interesting benefits. These are particularly attractive on mainframe (z/OS) environments. First, by standardizing on a single language and development environment across OLTP and batch programming, efficiencies are realized. It's just far simpler to support a single language and associated tooling across what was traditionally two worlds.

Related to this, and a very interesting trend, is that organizations can now move toward writing a single code base for both OLTP and batch environments. Traditionally, these two worlds have been distinct. But the use of a single language (the Java language) allows a move toward unification of these two environments. What I mean by unification is a single code base, sometimes used in a services context, and sometimes used in an asynchronous context. Certainly, there are considerations when doing this — architecture and performance, for example — but the usage of Java in both environments facilitates this trend.

Further, we've seen a number of examples of customers (mostly on z/OS) who, wanting to use Java as their batch development language, have pieced together architectural elements — message queues, for example — to realize their goals. Often, these approaches are cumbersome and inefficient. The Java batch support in Compute Grid provides an entire infrastructure framework for developing and executing Java batch programs with integrated qualities of service, and in many cases, this is far more efficient from both a cost and performance standpoint.

In Figure 4, we see a high-level representation of steps (or methods) developers write when creating Java batch applications. The main method here is processJobStep(), and you can see the simple, yet powerful, nature of the programming approach. Further, packaging and deployment use standard Java and JEE practices.


Figure 4. The Java batch programming framework
The Java batch programming framework

Compute-intensive

The compute-intensive application pattern is for high-performance divide-and-conquer types of applications, such as those seen in financial services (portfolio analysis, portfolio optimization, or risk mitigation) or in drug discovery (compound analysis). As multicore architectures and, in turn, parallel processing become increasingly popular, these types of applications are becoming more commonplace.

Native execution

Native execution isn't really an application pattern. There isn't a programming model associated with it. Rather, it's a way to specify a job with constituent parts which consist of any executable program. So it's an easy way to run non Java (really any executable) on your application infrastructure.



Back to top


Compute Grid infrastructure

There are two key elements to the Compute Grid infrastructure. First is the notion of a job. To execute an application using Compute Grid, a job must be described. This is done through a job control language called xJCL. The job is then submitted to the second element of the Compute Grid infrastructure: the job scheduler.

xJCL

xJCL is the Compute Grid job control language (JCL). The format of xJCL is XML. It's similar to the z/OS JCL, hence the name. In xJCL, at a minimum the job type (Java batch, for example) and the job steps need to be specified. In each of the job steps, XML variables can be set to specify things like the executable name, environment variables that can be set, and arguments to pass to the executable. We won't present an exhaustive treatment of the xJCL syntax, but Figure 5 provides an xJCL snippet as an introduction.


Listing 2. xJCL job control language
                
<?xml …>
<job name="Batch">
    <env-entries>
    <env-var name="PATH" value="…"/>
    <env-var name="CLASSPATH" value="…"/>
    </env-entries>
    <exec executable="java">
    <arg line="tryit"/>
    </exec>
    </job-step>
</job>

Job scheduler

The Compute Grid job scheduler is responsible for submitting work for execution which is specified by xJCL. It works in conjunction with WebSphere Extended Deployment's management of OLTP workload and service policy to optimize OLTP and Compute Grid work across a shared infrastructure. A server can execute OLTP and Compute Grid workloads concurrently.

Some people might want to replace the WebSphere Extended Deployment job scheduler with another scheduler. This is relevant if you already have an existing scheduling solution in place, such as Tivoli® Workload Scheduler or Platform Computing LSF. These schedulers can then submit, monitor, and control Compute Grid jobs and integrate them into an enterprise or more global workload schedule. Further, there are facilities to make the Compute Grid job scheduler highly available using traditional clustering techniques.

Job classes and classification rules

Asynchronous applications are by their very nature quite different from online applications. One concern is resource consumption. Because asynchronous applications usually run for a longer time than online transactions and there is no interactivity, it's important to provide controls that ensure that a job does not potentially consume more resources than it should.

Therefore, Compute Grid contains two elements — job classes and classification rules — that provide important capabilities for administering Compute Grid jobs. Job classes provide administrative control over resource consumption and are defined via a scheduler configuration. Job classes are named policies that control:

  • Maximum execution time
  • Maximum number of concurrent jobs per endpoint
  • Maximum job log size
  • Job log retention
  • Execution record retention
  • The job class for a job is assigned via a class= keyword in xJCL

You can see from these policies the type of administrative control WebSphere Extended Deployment Compute Grid provides.

Classification rules provide administrative rules for service policy assignment. They are also defined via a scheduler configuration and are represented as an ordered list of rules evaluated in the specified order where the first match assigns the service policy. Rules are a Boolean expression formed using:

  • Job name
  • Job class
  • Submitter identity, group
  • Job type (batch, for example)
  • Time, date
  • Platform (z/OS, for example)

z/OS enhancements

We spoke about the particular utility of Java batch for mainframe batch environments. There are also particular Compute Grid enhancements to better integrate with the z/OS workload management. Compute Grid leverages and integrates with native z/OS workload management (WLM) to enhance job execution and management:

  • SMF accounting records for batch jobs — SMF 120 (JEE) records tailored to jobs that include the job ID, user and CPU time
  • Dynamic servants for batch job dispatch — Exploits z/OS WLM to start new servants to execute batch jobs on demand
  • Service policy classification and delegation — Uses the Compute Grid job classification to select the z/OS service class by propagating the transaction class from the scheduler to the z/OS application server for job registration with z/OS WLM

Conclusion

In this article, we've taken a closer look at the application patterns supported by WebSphere Extended Deployment and the infrastructure capabilities behind them. We’ve concentrated on WebSphere Extended Deployment's ObjectGrid and support for Java batch. In upcoming articles, we'll cover the associated programming models more extensively and with code examples.



Resources

Learn

Get products and technologies

Discuss


About the author

Matt Haynos

Matt Haynos is the product manager for IBM WebSphere Extended Deployment. Before that, he was on IBM's grid computing team since its inception as an emerging business opportunity, and he had various responsibilities covering a broad range of initiatives and strategies related to building IBM's grid computing business. He has held a variety of technical and managerial positions within IBM in the application development, program direction, and business development areas. He holds a bachelor's degree in computer science/applied mathematics and cognitive science, with honors, from the University of Rochester; and a master's degree in computer science from the University of Vermont. He lives with his wife and two sons in Connecticut.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top