Level: Introductory Frank De Gilio (degilio@us.ibm.com), Senior complex solutions architect, IBM , IBM Design Center for On Demand Business
21 Sep 2004 Grid resource managers manage workload from requesters to the available grid
engines. What happens when there's more work than available engines can handle?
Traditionally, this condition causes queuing and additional wait times for the user
community. What would happen if the grid resource manager could appeal to some
outside entity to add engine resources? What if there were multiple grids within an
enterprise and this outside entity could determine which grid most needed resources?
This article discusses how resources can be managed into and out of a grid
environment using an example infrastructure.
Introduction
What's the biggest roadblock to acceptance of an on demand business environment?
It isn't technology. The e-business world provides many examples of technology
evolving quickly to support new needs. The biggest roadblock is politics. By
itself, on demand business is an apolitical model: it looks at what's needed to
ensure that all resources are used to the best benefit of the enterprise. But the
enterprise is extremely political. Since these two models -- the apolitical on
demand business environment and the political enterprise -- are diametrically
opposed, we need technology that encourages enterprise "kingdoms" to share their
resources. As grid computing moves from purely scientific and mathematical use to
a more utility-based model, the technology to leverage the proper use of servers
in this environment must be in place.
In this article, I'll draw examples from some work we did in the IBM Design
Center for On Demand Business for a financial institution. The models used were
based on grid workloads that followed trading examples, but they're representative
of basic grid workload models we've seen for a number of different business
clients.
For the financial institution, we used IBM Tivoli® Intelligent Orchestrator
(TIO) software because it enables an organization to add and remove servers from a
processing environment based on the needs of that environment. Traditionally, TIO
has been deployed in Web-based environments to ensure the best use of servers
throughout multiple tiers. This has been accomplished by analyzing the CPU use of
the server and the rate of work to the server from the network. If TIO can be
adapted to serve the grids as well, it would be a powerful tool for managing
servers across multiple heterogeneous environments. Suddenly, servers become
commodities to share across departments -- hoarding of departmental server
resources can become a thing of the past. This article defines the methodology
used to transform the TIO product in its traditional Web-based world into one of
looking across multiple worlds.
The business problem
Enterprises constantly struggle to find the best way to manage their hardware,
software, and management resources. Often new applications drag with them a new
set of servers. To ensure servers will handle expected demand, capacity planners
frequently overestimate the load to ensure there is enough room for growth as the
application usage rises. If the estimate is too low, performance suffers. If the
estimate is too high, resources are wasted. Since the typical political climate
discourages sharing of resources, the wasted resources are never used. The
disturbing thing to CIO and IT organizations is that such wasted resources can
never be brought to bear on resource-starved applications. Some users can be stuck
with poorly performing applications while perfectly useful resources lie idle.
The solution
Several technologies are coming together now to solve this problem. The advent of
efficient Web services, the proliferation of J2EE underpinnings for those Web
services, and the power of grid computing allow application components to be
efficiently deployed within a heterogeneous environment. Applications have become
less platform- and infrastructure-dependent and more focused on solving business
problems. This paves the way for using grids in a utility model. In environments
where multiple grids must contend for the same resources or must share resources
with non-grid environments, we need something outside the grid to ensure proper
deployment of servers. In our work with the financial institution, we used IBM
Tivoli Intelligent Orchestrator (TIO) to fill this need. TIO also provides the
ability to track deployments of server environments, which allows a person to keep
track of how and for whom a server deployment is carried out. Simply put, TIO
provides the framework for removing the political as well as technical barriers
that might stand in the way of an enterprise becoming more of an on demand
business.
In our work with the financial institution, we combined Tivoli Intelligent
Orchestrator with DataSynapse GridServer. Let's look briefly at these products so
you can get an idea of how they work together.
About DataSynapse GridServer
DataSynapse GridServer is an application environment for grid computing. It
provides a component architecture for distributing compute-based workloads.
GridServer delivers adaptive, non-deterministic load balancing, dynamic
scheduling, and a job-task paradigm. GridServer has three basic components:
- The Director -- This component is the base contact point for the grid. It's
the primary contact point for clients and knows all the brokers in the
environment and what applications those brokers are managing. When a client
contacts the director with work, it refers the client to the proper broker
associated with the application.
- The Broker -- This component schedules work to engines in the grid. It takes
requests from the client and provides job tasks to engines performing the work
in this environment.
- The Engine -- Engines are the components that perform the required
tasks.
When an engine daemon is started on a server, it contacts the broker to tell it
that it is available. At initialization time it gets any updates that are
available. Once registered with a broker, this daemon is ready to receive work.
Engines normally have a home broker to which they are attached, but they can be
moved from broker to broker based on the needs of the grid environment. Management
of this environment is through either a Web interface, which provides you granular
views or controls of the execution environment, as well as a Web services
interface. Overall, DataSynapse GridServer is a very flexible, lightweight
infrastructure for grid computing.
About Tivoli Intelligent Orchestrator
There are two primary components of TIO: the Provisioning Manager and the
Orchestrator. These two constructs share a Data Center Model (DCM), which contains
all relevant data about the server and network environment. Information about the
type of network devices, network connections between them and servers, server
details, and relevant software stacks provide a detailed picture of what the data
center controls.
Tivoli Provisioning Manager (TPM)
TPM is a framework for creating rules that govern how servers will be configured.
Through the TPM tool set, users create workflows that define the steps to set up a
server. Workflows use a Java™ technology-like semantic that allows you to
tie predefined Java plug-ins together to perform configuration tasks on servers
and network appliances within TPM's domain. In addition to the provided Java
plug-ins, users can create their own Java plug-ins to perform additional tasks in
the workflows. In most cases this is unnecessary since TPM already has plug-ins
for communicating with servers, cluster managers (like IBM Direct and Cluster
Server Manager), and network appliances.
Since workflows can call other workflows, each workflow can be focused on a
particular task. Efficient modularization in workflows can create a very
well-structured and reusable deployment infrastructure.
Listing 1 shows a simple workflow used for adding a
server to a DataSynapse Grid.
Notice that TPM has a sophisticated flow structure that allows for workflows to
catch errors in the deployment process and recover from them. While TPM is
described here in the TIO context, it is also available separately.
Tivoli Orchestrator
The Orchestrator component uses TPM to deploy the resources in the DCM to ensure
that the Service Level Agreements (SLAs) are met. The Orchestrator gets data from
servers and network devices and, using a technology called an Objective Analyzer
(OA), determines how the application of servers in this environment will affect
the SLA for a particular workload. An SLA defines an agreement on operational
characteristics between a service provider and a consumer. For example, an SLA
could define a response time for a query. (All queries to database A will be
returned in one second or less). Often SLAs are in place to define all major
requirements a consumer has on a provider of service.
In addition to understanding the effect of servers on a particular workload, the
Orchestrator must determine if the application of servers to a particular workload
is advisable based on the needs of the entire datacenter. Thus, a particular
workload might require additional resources to run faster; it still might not get
the resources it wants if more important work requires them.
It's important to understand that the Orchestrator doesn't manage workloads
within a server's context; rather, it ensures enough servers are available to meet
the SLA assigned to the workload in general. Stated another way, the Orchestrator
doesn't pick which server should be used to perform a task, but it ensures enough
servers are available to service the scheduled work.
TIO applied to grid
The OAs shipped with Orchestrator were based on Capacity on Demand models that
were well suited to Web workloads but not to grid workloads. Capacity on Demand
predicts utilization by tying the CPU data received from the server with arrival
rates received from network appliances. Given the "transactional" request response
nature of the Web, this model characterizes the workload effectively. The chaotic
nature of grid workload tied with grid's tendency to completely consume all
available server CPUs, makes the traditional Capacity on Demand model ineffective
in the grid environment.
TIO's Objective Analyzers at work
Each OA builds a "probability of breach surface" which defines the probability of
missing (breaching) an SLA. This probability is based on the current workload and
the resources available to that workload. If the "probability of breach" is high
enough, the Orchestrator asks the OA what effect the addition of servers will
cause. It then determines the optimal server allocation to provide to a workload.
To do this accurately for grid, we looked at the information the DataSynapse
GridServer could provide an OA to determine what workload requirements were
needed. After some experimentation, DataSynapse provided the OA with queue depth,
the length of time each unit of work was taking, as well as a measure of the
current workload.
Applying OAs to the grid
We also determined that there's a need to accurately categorize the type of grid
workload to be managed. For example, the financial institution we were working
with had two basic workloads to be modeled: one was very bursty, the other very
consistent. The bursty workload characterized an environment where a number of
users would dump a set of calculations into the grid with unexpected think times
between runs. The consistent workload was characterized by a consistent submission
of calculations that could be predicted and managed uniformly. The bursty workload
characterized an active user workload, and the consistent workload characterized a
batch workload.
TIO provides the ability to stack OAs to build a complex set of rules defining
when additional servers are required to ensure an SLA is not breached. This allows
stacking of multiple OAs, each monitoring specific aspects of an environment. This
way, no single OA becomes too complex and OAs can be reused in multiple
environments with similar characteristics. For the grid, we created an OA
responsible for delaying the release of servers from a grid. This ensures that
servers would not be prematurely removed from the grid that was serving the bursty
workload. Even though there was no work in the grid, this OA made sure that
servers would not be released in case new work was imminent. This also precluded
unnecessary undulation in server allocations to the grid.
Grid orchestration -- moving servers in
and out of grids
Now that grids are moving into the mainstream of the IT environment, their
effective inclusion is dependent on the ability to determine how much resource
should be applied against particular workloads. Since the discrete usage of
servers in the grid environment can be tracked, owners of these resources can
demonstrate their servers' contribution to the enterprise. Thus, the owner can
recoup individual expenses or share expenses among the grid community. While
traditional CPU scavenging is viable in grid environments, it tends to be more
parasitic in nature. The more controlled usage of servers within an orchestrated
environment benefits not only the resource users but the resource providers as
well. Since the servers' owner benefits from someone else using the resource, he
is more motivated to share normally unused available processing cycles.
Resources
About the author  | 
|  | Frank De Gilio has been an IBM employee since 1985. He worked in MVS system development, tools and middleware development, and has worked on projects that tied MVS to workstations in client/server and Internet environments. In 1997, he joined the IBM S/390® new technology center, where his experience in UNIX®, MVS, and Microsoft® Windows® application/middleware development was key in showing customers how to use the latest OS/390® technology in Web-enabled environments. Currently in the new IBM Design Center for On Demand Business, he shows customers how the latest on demand technologies can be used to energize their infrastructures. |
Rate this page
|