Level: Intermediate Matt Haynos (mph@us.ibm.com), Product Manager, WebSphere Extended Deployment, IBM
12 Jun 2007 IBM® WebSphere® Extended Deployment (Extended Deployment) is infrastructure software for application servers,
which many associated with the ability to optimize and lower costs of an application infrastructure, prioritize resources for use by the most
important applications, and improve availability. It does this via virtualization and intelligent workload management. But
WebSphere Extended Deployment also contains capabilities to effectively manage large amounts of data in extreme-scale data fabrics via
ObjectGrid and to run additional application types beyond OLTP. Here, we take a closer look at WebSphere Extended Deployment's ObjectGrid and Java™ batch
programming support.
Introduction
Application servers continue to be an important part of the enterprise IT landscape. With the maturation of application servers,
their numbers continue to grow. With this growth comes a struggle to use and manage them efficiently. Servers are often added to run
constrained applications, and while the server capacity might exist, it's dedicated to oher lower-priority or usage applications.
Add different types of application servers — you might, for example, add open source application servers or get new
application servers through acquisition — and the situation gets even worse. The result can be low utilization, missed
business opportunities, and infrastructure inefficiencies. Combine these results with the needs for essential management and qualities of
service, plus the need to be able to increase application performance in the face of growing data volumes, and you'll likely discover that you
need new application infrastructure capabilities.
IBM WebSphere Extended Deployment is not an application server, but it’s often characterized as one. It's software that complements application
servers and execution environments to provide advanced qualities of service, infrastructure for dealing with growing data volumes, and
application patterns beyond online transaction processing (OLTP).
In a previous article (see Resources), I said WebSphere Extended Deployment is an example of an integrated
grid platform. It contains three components: Operations Optimization, Data Grid, and Compute Grid. In V6.1, it works not only with the
WebSphere Application Server but also with BEA WebLogic, WebSphere Community Edition, JBoss, and Apache Tomcat. This is an
important direction for WebSphere Extended Deployment — many people associate WebSphere Extended Deployment as being only applicable to the
WebSphere Application Server.
Operations Optimization helps prioritize work and applications, manage applications and servers, and drive up utilization of application
infrastructures through virtualization and intelligent workload management. It contains integrated health and operational management capabilities.
Its name implies a lot about what it does. Data Grid helps deal with
unrelenting volumes of data (up to terabytes) — you develop applications, Data Grid takes care of everything else (scalability,
performance, and high availability). Compute Grid provides facilities for developing and running application types beyond OLTP on existing
application infrastructures. One of these — Java batch — applications, which is particularly attractive
in mainframe environments. We'll talk about the growing trend to use Java for batch programming and the benefits it can provide.
The main focus of this article is to dive deeper into both the Data Grid and Compute Grid components of WebSphere Extended Deployment.
In particular, we'll take a closer look at the ObjectGrid component of Data Grid and concentrate on the Java batch support in
Compute Grid. I'll focus on the benefits and the infrastructure provided by Data Grid and Compute Grid, and introduce, at a high level, the
programming models. In successive articles, I'll provide a deeper treatment of the programming models, including code examples.
ObjectGrid: Advanced data infrastructure support
Data volumes continue to grow exponentially. In many cases, the volume of data outstrips the ability of applications and the underlying
infrastructure to deal with it. What usually results is slowing or inconsistent application performance. Traditional “scale-up” infrastructure
approaches can be expensive and often remedy performance issues only temporarily. What's required is a different approach. ObjectGrid
represents an increasingly popular infrastructure technique for dealing with unrelenting growth of data volumes.
ObjectGrid is a flexible infrastructure for realizing high-performance scalable-, data-intensive applications (see Figure 1). It can be used in
very flexible ways, from a simple cache to an extreme-scale data grid where data is striped — and managed —
across hundreds or thousands of data servers. In a previous article, I provided a sampling of some of these configurations. ObjectGrid has been
designed to support a wide variety of usage scenarios.
Figure 1. ObjectGrid: Support for flexible data infrastructures an image
Extreme-scale data infrastructures like ObjectGrid present an interesting trend in data management: It flips the traditional "application-first"
mentality around. Traditionally, applications were the initial, or primary, focus, while data access and management were secondary. This is
changing. Now, because the challenges associated with data volume growth are so acute, the first priority is to ensure a robust and scalable
infrastructure for dealing with data. Then applications can be built around it.
Further, when information is spread across the memory provided by many distributed servers, the traditional role of persistent storage
(in databases, for example) is called into question. Certainly, the permanent storage of information will always be required, but because
information is usually held redundantly — using replication — in more than one location in the data grid,
information can always be accessed as long as a representative set of servers are running. One problem with this approach —
and the reason permanent storing of information is required — is that it doesn't adequately deal with the catastrophic failure
of the entire data grid.
ObjectGrid infrastructure and qualities of service
You can use ObjectGrid as a simple cache or in a tiered-cache environment where a subset obf data is accessed in a very large data
cache, perhaps with a small grouping of servers. But we'll concentrate on larger, more distributed, and interesting, ObjectGrid configurations. Let's take a look
at the infrastructure characteristics and the significant qualities of service provided by ObjectGrid.
From an infrastructure perspective, ObjectGrid consists of a set of servers that host data. Often, this data is referred to as mapsets,
and the general presumption is that the data is stored in memory. Sometimes it's impossible to store all the requisite data in
memory due to sheer size, so eviction rules are often specified to indicate when data should be removed from the memory cache and either
evaporate or be written to permanent storage.
ObjectGrid can run in any J2SE-compliant container, it's very lightweight, and it has a small footprint (20 MB). It's usually embedded with a
container, such as the WebSphere Application Server or JBoss, and because it requires only J2SE, it can run with a wide variety of Java hosting
environments and application servers.
The mapsets have a data layout or schema associated with them. An ObjectGrid data grid can manage multiple mapsets. In ObjectGrid V6.1,
mapsets can be dynamically modified. What this means is that new fields can be added to the schema, and applications that are
executing can continue to run, although they obviously wouldn't be aware of the new field until being modified. This is extremely important
because sometimes it's challenging or impossible to stop applications using the data grid.
Further, all updates to data in a partition, hosted on a server, are done with transactional integrity. In addition, the creation of replicas can
be done with transactional semantics. ObjectGrid supports a single-phase commit transactional protocol, and it does so to ensure high levels
of performance and scalability.
Availability
ObjectGrid supports asynchronous and synchronous replication of mapsets. Replication of data and the management of replicas is an
important capability of ObjectGrid to support high availability and resiliency. Also, data servers act in a self-sufficient manner; upon startup,
they contact a catalog service once to obtain bootstrap information. So the ObjectGrid architecture is inherently resilient. Below, we'll look at
how replication is specified in the configuration file.
Performance
ObjectGrid continually rebalances and optimizes data placement across available servers to ensure optimal performance. Data is located
in memory, and you can co-locate applications to realize extremely high levels of application performance.
Scalability
As data volumes grow, scalability is realized by simply adding additional servers to the data grid. When new servers are added,
ObjectGrid will use the new server by moving data to it and optimizing the layout of data across all of the servers in the data grid. What
this means is that as data volumes increase, application response times are consistent, which isn't normally the case.
The ObjectGrid configuration file
Figure 2 depicts a visual representation of the ObjectGrid application configuration file. The application configuration file tells ObjectGrid
how to manage the application, and based on the configuration, it does everything automatically.
What's really cool about ObjectGrid from an infrastructure perspective is that it automatically manages all of the qualities of service we outlined.
You focus on the application concerns and ObjectGrid takes care of the infrastructure concerns. This is important given that a lot of planning
and effort usually goes into dealing with and managing scalable and resilient data architecture. You can think of ObjectGrid as providing
Google-like capabilities (support for map/reduce, for example) to your infrastructure.
Figure 2. The ObjectGrid application configuration file
Application programming in ObjectGrid
So, what does an ObjectGrid application look like? Figure 3 shows a visual representation of how an ObjectGrid application is packaged.
Figure 3. The anatomy of an ObjectGrid application
ObjectGrid supports two programming styles: You can use the traditional JCache type Map API, or with V6.1, you can use the
EntityManager API. The EntityManager API is more streamlined and has a conceptually simpler programming model for in memory data
management than the lower-level Map API.
The Map API is based on the Java Map interface, extended to allow operations to be grouped into transactional blocks. It allows a set of keywords
to be associated with a given key.
Listing 1. Mapping keywords
sess.begin();
mapA.insert(“Kevin”, someValue);
mapA.update(“Perry”, someOtherValue);
List i = new ArrayList();
i.add(“Raj”);
i.add(“Ken”);
List l = mapA.getAll(i);
Sess.commit();
|
The EntityManager API allows graphs of objects to be annotated with metadata and be both read from and written to the data grid. Each
entity/object in the graph corresponds to a map, and ObjectGrid automatically maintains relationships and detects changes to those objects.
You simply indicate to persist the graph to the data grid or retrieve the graph from the data grid. This is much more abstract and easier to use
than the Map API.
Compute grid
Now, let's turn our attention to Compute Grid. Compute Grid is the package in WebSphere Extended Deployment for developing, executing,
and managing asynchronous application types. By "asynchronous," we mean not online, often referred to as batch programs. But the
term batch has certain connotations, so we'll refer to them as asynchronous.
Traditionally, application server infrastructures are dedicated to OLTP workload. Compute Grid is intended to broaden the type of applications
you can develop and run beyond OLTP and to make the programming models very easy to use. Compute Grid supports three types of application
patterns: Java batch, compute-intensive, and native execution. Java batch is the one we want to emphasize here, particularly its utility on mainframe
environments using z/OS® (a significant platform for batch applications).
An important part of Compute Grid is not only the application patterns themselves but the additional infrastructure capabilities it offers.
Asynchronous applications are by their very nature different from OLTP applications and require different approaches to executing and
managing them. For example, while a long response time would be problematic in an OLTP environment, it's expected of asynchronous
applications that are often long-running. So workload management capabilities need to be able to understand this.
Compute grid application patterns
Java batch
The Java batch application pattern allows the development of traditional record-processing batch applications using Java and an easy-to-use
Plain Old Java Object (POJO) programming model. Like ObjectGrid, the value here is on offloading infrastructure concerns so you can focus on
the development of applications. You don’t have to worry about check-pointing, managing log files, etc.
Services include:
-
Check-pointing
— The ability to resume batch work at a selected interval
-
Result processing
— The ability to intercept step and job return codes and processing using any Java Enterprise Edition
(JEE) facility
-
Batch data stream management
— The ability to handle reading and positioning data streams to files, relational
databases, and many other input/output sources
Java as a batch programming language has some interesting benefits. These are particularly attractive on mainframe (z/OS) environments.
First, by standardizing on a single language and development
environment across OLTP and batch programming, efficiencies are realized. It's just far simpler to support a single language and associated
tooling across what was traditionally two worlds.
Related to this, and a very interesting trend, is that organizations can now move toward writing a single code base for both OLTP and batch
environments. Traditionally, these two worlds have been distinct. But the use of a single language (the Java language) allows a move toward
unification of these two environments. What I mean by unification is a single code base, sometimes used in a services context, and
sometimes used in an asynchronous context. Certainly, there are considerations when doing this — architecture and
performance, for example — but the usage of Java in both environments facilitates this trend.
Further, we've seen a number of examples of customers (mostly on z/OS) who, wanting to use Java as their batch development language, have pieced
together architectural elements — message queues, for example — to realize their goals. Often, these
approaches are cumbersome and inefficient. The Java batch support in Compute Grid provides an entire infrastructure framework for
developing and executing Java batch programs with integrated qualities of service, and in many cases, this is far more efficient from both a
cost and performance standpoint.
In Figure 4, we see a high-level representation of steps (or methods) developers write when creating Java batch applications. The main
method here is processJobStep(), and you can see the simple, yet powerful, nature of the programming approach.
Further, packaging and deployment use standard Java and JEE practices.
Figure 4. The Java batch programming framework
Compute-intensive
The compute-intensive application pattern is for high-performance divide-and-conquer types of applications, such as those seen in financial
services (portfolio analysis, portfolio optimization, or risk mitigation) or in drug discovery (compound analysis). As multicore architectures and,
in turn, parallel processing become increasingly popular, these types of applications are becoming more commonplace.
Native execution
Native execution isn't really an application pattern. There isn't a programming model associated with it. Rather, it's a way to specify a job
with constituent parts which consist of any executable program. So it's an easy way to run non Java (really any executable) on your application
infrastructure.
Compute Grid infrastructure
There are two key elements to the Compute Grid infrastructure. First is the notion of a job. To execute an application using Compute Grid,
a job must be described. This is done through a job control language called xJCL. The job is then submitted to the second element of the
Compute Grid infrastructure: the job scheduler.
xJCL
xJCL is the Compute Grid job control language (JCL). The format of xJCL is XML. It's similar to the z/OS JCL, hence the name.
In xJCL, at a minimum the job type (Java batch, for example) and the job steps need to be specified. In each of the job steps, XML variables can be
set to specify things like the executable name, environment variables that can be set, and arguments to pass to the executable. We won't present
an exhaustive treatment of the xJCL syntax, but Figure 5 provides an xJCL snippet as an introduction.
Listing 2. xJCL job control language
<?xml …>
<job name="Batch">
<env-entries>
<env-var name="PATH" value="…"/>
<env-var name="CLASSPATH" value="…"/>
</env-entries>
<exec executable="java">
<arg line="tryit"/>
</exec>
</job-step>
</job>
|
Job scheduler
The Compute Grid job scheduler is responsible for submitting work for execution which is specified by xJCL. It works in conjunction with
WebSphere Extended Deployment's management of OLTP workload and service policy to optimize OLTP and Compute Grid work across a
shared infrastructure. A server can execute OLTP and Compute Grid workloads concurrently.
Some people might want to replace the WebSphere Extended Deployment job scheduler with another scheduler. This is relevant if you already have an
existing scheduling solution in place, such as Tivoli® Workload Scheduler or Platform Computing LSF. These schedulers can
then submit, monitor, and control Compute Grid jobs and integrate them into an enterprise or more global workload schedule. Further, there
are facilities to make the Compute Grid job scheduler highly available using traditional clustering techniques.
Job classes and classification rules
Asynchronous applications are by their very nature quite different from online applications. One concern is resource consumption. Because
asynchronous applications usually run for a longer time than online transactions and there is no interactivity, it's important to provide controls
that ensure that a job does not potentially consume more resources than it should.
Therefore, Compute Grid contains two elements — job classes and classification rules — that provide
important capabilities for administering Compute Grid jobs. Job classes provide administrative control over resource consumption and are
defined via a scheduler configuration. Job classes are named policies that control:
- Maximum execution time
- Maximum number of concurrent jobs per endpoint
- Maximum job log size
- Job log retention
- Execution record retention
- The job class for a job is assigned via a
class= keyword in xJCL
You can see from these policies the type of administrative control WebSphere Extended Deployment Compute Grid provides.
Classification rules provide administrative rules for service policy assignment. They are also defined via a scheduler configuration and
are represented as an ordered list of rules evaluated in the specified order where the first match assigns the service policy. Rules are a
Boolean expression formed using:
- Job name
- Job class
- Submitter identity, group
- Job type (batch, for example)
- Time, date
- Platform (z/OS, for example)
z/OS enhancements
We spoke about the particular utility of Java batch for mainframe batch environments. There are also particular Compute Grid
enhancements to better integrate with the z/OS workload management. Compute Grid leverages and integrates with native z/OS workload
management (WLM) to enhance job execution and management:
-
SMF accounting records for batch jobs
— SMF 120 (JEE) records tailored to jobs that include the job ID, user and CPU time
-
Dynamic servants for batch job dispatch
— Exploits z/OS WLM to start new servants to execute batch jobs on demand
-
Service policy classification and delegation
— Uses the Compute Grid job classification to select the z/OS service class
by propagating the transaction class from the scheduler to the z/OS application server for job registration with z/OS WLM
Conclusion
In this article, we've taken a closer look at the application patterns supported by WebSphere Extended Deployment and the infrastructure
capabilities behind them. We’ve concentrated on WebSphere Extended Deployment's ObjectGrid and support for Java batch. In upcoming
articles, we'll cover the associated programming models more extensively and with code examples.
Resources Learn
Get products and technologies
Discuss
About the author  | 
|  | Matt Haynos is the product manager for IBM WebSphere Extended Deployment. Before that, he was on IBM's grid computing team since its inception as an emerging business opportunity, and he had various responsibilities covering a broad range of initiatives and strategies related to building IBM's grid computing business. He has held a variety of technical and managerial positions within IBM in the application development, program direction, and business development areas. He holds a bachelor's degree in computer science/applied mathematics and cognitive science, with honors, from the University of Rochester; and a master's degree in computer science from the University of Vermont. He lives with his wife and two sons in Connecticut. |
Rate this page
|