High availability topologies for IBM PureApplication System

A frequent question from customers regarding IBM® PureApplication System is "How do you set it up for high availability?" This article provides an overview and recommendations on this topic.

Kyle Brown (brownkyl@us.ibm.com), Distinguished Engineer, IBM  

Kyle BrownKyle Brown is a Distinguished Engineer with IBM Software Services for WebSphere and specializes in SOA and emerging technologies. Kyle provides consulting services, education, and mentoring on SOA, object-oriented topics, and J2EE technologies to Fortune 500 clients. He is a co-author of Java Programming with IBM WebSphere and Persistence in the Enterprise. He is also a frequent conference speaker on the topics of SOA, Enterprise Java, OO design, and design patterns.


developerWorks Professional author
        level

Andre Tost, Senior Technical Staff Member, IBM

Andre TostAndre Tost works as a Senior Technical Staff Member in the IBM WebSphere organization, where he has helped IBM customers establish service-oriented architectures for the past eight years. His special focus at the moment is on cloud computing platforms and integrated expert systems. Before his current assignment, he spent over ten years in various consulting, development, and architecture roles in IBM. He has worked with large IT organizations across the globe on SOA and BPM and has acted as Lead Architect for many large IT projects, specifically around the Enterprise Service Bus pattern. He started his career at IBM as a C++ and Java developer and still enjoys developing code.


developerWorks Master author
        level

Rohith Ashok, Senior Technical Staff Member, IBM

Photo: Rohith AshokRohith Ashok works in IBM Software Group as the Lead Architect of the PureApplication System and PureSystems family of products. Before focusing on integrated systems, Rohith worked on IBM Workload Deployer and WebSphere Cloudburst Appliance. Rohith has spent 10 years working on everything from zSeries, to Power systems, to highly integrated systems. In his spare time, Rohith loves spending time with his family and watching technology unfold into everyday life.



27 June 2012

Also available in Chinese Japanese Portuguese

Introduction

One of the most commonly asked questions we have heard from customers regarding IBM PureApplication System is "How do you set it up for high availability?" This article provides a brief introduction to this topic along with recommendations. Note that we are not covering issues related to continuous availability in this article. Those issues are more complex and will be covered separately. What's more, we are only considering high availability (HA) cases involving virtual systems built on top of WebSphere® Application Server (hereafter called Application Server) and DB2®. The HA cases for virtual applications, especially inter-rack cases, and other IBM middleware options, such as WebSphere MQ and WebSphere Message Broker, will be covered in future articles.

Note that this is an introductory article that describes high-level solutions to the problems described. A future set of articles will provide detailed examples to achieve the levels of availability described in this article.


High availability

There are two different mechanisms for high availability that we need to consider. First are intra-rack mechanisms - those are built into the firmware and hardware of PureApplication System. The management environment itself is also redundant within the rack as each PureApplication System device contains two management nodes, one of which is a backup for the other.

PureApplication System has been carefully designed to not have any single points of failure within each of its rack. Thus, if you consider a WebSphere and DB2 virtual system that has multiple WebSphere Application Server cluster members running entirely inside the rack, you are protected from failure by any one piece of hardware (a compute node, a hard drive, or even a top of rack (TOR) switch) by the redundancy built into the rack. If a compute node fails, then the WebSphere HTTP Server plugin itself will notice since the JVM running on it will disappear, and all traffic will be seamlessly rerouted to other cluster members.

PureApplication System itself will then notice that the compute node is down, and move the virtual machine to another compute node within the same cloud group, which will eventually then be rejoined back into the cluster by the plugin and start taking traffic again. Similarly, if you have a DB2 HA/DR system built entirely within the frame and the primary database fails, the system seamlessly starts directing requests to the backup database. Finally, the placement algorithms of PureApplication System are intelligent enough that, in most cases, it tries very hard never to place two cluster members on a single compute node if the configuration of the cloud group and the availability of compute resources within the cloud group allow that.

These mechanisms can probably take care of 90% of all failover needs. However, as great as that is, for many of our customers this is not enough to cover what they mean when they say "HA". What they specifically mean are inter-rack mechanisms - how do you provide for situations where the entire rack fails, due to some catastrophic failure (for example, when a pipe in the water-cooled door fails and sprays water all over the rack)? That's where you can again take advantage of standard facilities in WebSphere, DB2, and IBM Workload Deployer to provide this level of high availability as well.

What you can do is to first use standard DB2 HADR patterns to set up a primary database on one rack, with the backup database on the other. That provides the ability to handle database requests even when the primary database fails, either due to a hardware failure within the rack (failure of a compute node), or a failure of the entire rack (the broken pipe example). Likewise, you can take advantage of traditional mechanisms for HA in Application Server to provide two different failover mechanisms. Either you can create Application Server instances in the "second" rack and federate them to a Deployment Manager (DMgr) in the "first" rack, or you can create two separate cells (one per rack) and manage load distribution (such as an external DataPower device using the Application Optimization load balancing capabilities) between the cells externally to the rack. All of these mechanisms will require some level of manual intervention on the part of the PureApplication System administrator who is creating the system. The amount of manual intervention needed differs from component to component.

There are two different scenarios to examine:

  1. The first is inside the data center, where we are trying to provide HA across two (or potentially more) PureApplication System racks.
  2. The second is across two geographically distributed data centers.

Inside the data center

In the first scenario, what you are protecting is failure of an entire single PureApplication System. Given the redundancy of all of the components inside the rack, this is a highly unlikely event. However, there are other reasons why this may be desirable, such as those cases in which there may be bursts of more traffic than a single rack, even a high-performance rack, can sustain. Figure 1 shows the final system configuration that you want to achieve.

Figure 1. HA configuration inside a DC with a single shared cell
HA configuration inside a DC with a single shared cell

In this scenario (referred to as the "single cell" model), you begin by creating a virtual system pattern that defines a cell consisting of a DMgr, IBM HTTP Server (IHS) nodes, and WebSphere Application Server nodes in the first rack (IPAS A in Figure 1). You then create a second virtual system pattern on IPAS Rack B that contained only IHS nodes and WebSphere Application Server nodes that you then manually associate into the cell created previously. This defines the cell boundary as crossing both machines, as shown in Figure 1. You likewise create a virtual system pattern for the primary DB2 HADR node in IPAS A, and a second virtual system pattern for the secondary DB2 HADR node in IPAS B. Note that in order for this to work, you need to configure an external load balancer to be aware of all of the IHS instances in the two racks. You also have to consider HTTP session management across the two racks. The simplest case in this approach is to enable database session persistence to the shared database.

In this configuration, you are now tolerant of a complete failure of either rack. If Rack A fails, then the IHS instances and WebSphere Application Server nodes on Rack B continues to take requests from the external load balancer, and the DB2 HADR secondary takes over from the failed primary node. The only function that is lost is the ability to deploy changes to the cluster members on Rack B since the DMgr is no longer available. If Rack B fails, then Rack A continues to function as normal, taking requests from the external load balancer as usual.

Across two data centers

The case of two geographically separated PureApplication System racks is a bit more complicated. In this scenario (which we will call the "dual-cell" model), you need to create at least two different cells using a shared database, as shown in Figure 2.

Using HTTP session replication across two cells via shared database is possible, but it is rarely done. In most cases, session affinity is configured in the external load balancer. In other words, requests for a session that was started in a particular cell will always be routed to that cell. If you can tolerate lost session data in cases of a failover, you can set up a separate local database for session persistence.

Figure 2. Two Active-Active WebSphere cell
Two Active-Active WebSphere cell

Take note of the configuration of this cell as shown in Figure 2. In this scenario, you create individual, but identical cells in each PureApplication System. As noted, the cell boundaries are entirely contained within each system. In this way, the WebSphere cells are configured in an Active-Active mode, but the DB2 HADR database is configured in an Active-Passive mode as they were in the previous scenario. The difference here is that the cells are independent from each other. There is no communication between the WebSphere Application Server nodes between the racks.

Perhaps the easiest way in which to implement this approach is to create the first cell with a virtual system in IPAS A, export the virtual system pattern and import it in IPAS B, and finally create a new instance of that pattern in IPAS B. Likewise, you need to create a DB2 HADR primary using a virtual system pattern in IPAS A and create its secondary with another virtual system pattern in IPAS B just as in the previous example. In this case, the external load balancer is set up to again feed traffic to the full set of IHS instances in both cells. If either rack fails completely, then the other continues to take traffic uninterrupted.

The first reason why you want to create two cells in this instance as opposed to joining all the instances into a single cell as you did in the previous example is that you generally do not want a cluster member to be geographically distributed from its DMgr. The communication necessary between the DMgr and its cluster members (for management, code distribution, and so on) is not efficient over a wide area network, and so we do not recommend federating cells across long distances as a best practice.


Comparing the scenarios

In the single-cell scenario, all of the WebSphere server instances are managed from one central point. That is, there is one Deployment Manager and there is one admin console to apply changes to the environment across both racks. Moreover, additional load balancing and server affinity settings are available from the IHS plugins on both racks, ensuring that resources are efficiently utilized. In essence, the fact that you are using two separate racks is transparent to your WebSphere setup.

However, this comes at the cost of a more complex initial setup. The nodes on Rack B are configured via separate pattern, which connects them to the DMgr on Rack A. All of the IHS instances have to be configured into the external load balancer, and that step has to be repeated when a new IHS instance is created and started on either rack. Also, from a virtual system standpoint, you have two distinct virtual systems, one on each rack, and all required changes have to be applied to both, ideally simultaneously. Finally, there is an additional (ideally very fast) network layer between the racks, and that adds to the cost of, for example, making calls to the database from the second rack to the first, or passing requests between IHS nodes on one rack and cluster member JVMs on the other, or communication between the DMgr and the individual node agents on the second rack. But again, assuming that you have established a very fast network connection between both, this overhead is tolerable.

In the dual-cell scenario, the setup is simpler. In fact, you can easily execute this scenario in one data center. Both cells are managed separately, both from a WebSphere admin console perspective as well as from a PureApplication System management perspective. For example, this means that you can take full advantage of PureApplication System mechanisms, such as routing and scaling policies, without the need to consider cross-rack implications. This allows the same pattern to be deployed on each rack, each one utilizing different IP addresses.

At the same time, there is a downside, which is all administrative changes need to be made to two WebSphere cells through two separate consoles. You need to manually configure the external load balancer just like in the first scenario. And in both cases, there is only one database server active at any given time, meaning that at least some of the resources on the rack with the secondary database are not utilized under normal operational conditions.


Conclusion

This article described how you can achieve high availability in WebSphere Application Server and DB2 applications in an intra-data center environment by using either a single-cell or dual-cell approach, and how you can achieve it in an inter-data center environment using the dual-cell approach.

Note that additional configurations are possible, especially in cases where you have more than two racks. For example, you can have two racks per data center, with a cell defined on each, bringing the total number of cells to four. Or, you can define more than one cell per rack. Moreover, you may have varying numbers of clusters within your cells.

However, all of these cases follow the same steps and best practices that you use in a non-PureApplication environment. We have simply added considerations that apply when running PureApplication System, and in that context, the scenarios described above included those alternative setups. A scenario that we deliberately have not covered is to run a WebSphere cell across two data centers because doing so is highly discouraged.

While the examples we have discussed are limited to WebSphere Application Server and DB2 applications, you can extrapolate these approaches in a limited way to other product configurations as well. We will discuss approaches for other enterprise solutions, such as WebSphere MQ and WebSphere Message Broker, in future articles.

Resources

Learn

Get products and technologies

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere, Cloud computing
ArticleID=822856
ArticleTitle=High availability topologies for IBM PureApplication System
publish-date=06272012