One more time, again
Welcome back to the fifth installment of this series, in which I have attempted to either provide answers to questions I have been asked repeatedly about WebSphere Application Server, or offer information to help you determine what's best for you in cases where a definitive answer isn't possible.
This time, I'm only going to cover one question that came about as the result my discussion (make that two recent discussions, in Part 2 and Part 3) on WebSphere Application Server cells spanning data centers. The question is:
Q: I want to run multiple data centers. How should I deploy WebSphere Application Server across these data centers for high availability?
A: For starters, when I get asked this question, I typically say you don't need multiple data centers for high availability -- you only need one, with sufficient capacity. Of course, the usual response to this is something along the lines of but what happens if the data center goes down? -- which is exactly what I want to hear. Why, you ask? Because now the question is framed properly, which is for disaster recovery, not high availability. It's really critical to make that distinction, since these are two very different problems spaces with different answers.
Additionally, I'm going to assume that you are not going to run a single WebSphere Application Server cell across two data centers; this is a very bad idea and should not be considered for all the reasons I've mentioned before.
As far as what I would recommend for multiple data centers, each with one or more WebSphere Application Server cells, let's first outline the common options for such a topology:
Active/Passive: One data center is serving requests, the other is not performing work but is standing by in case it is needed. Application data and application state are replicated asynchronously between data centers so that when an outage occurs, failover to the surviving data center can be accomplished quickly with a minimal outage (minimal could mean some number of minutes or even a few hours in this context).
Active/Active: Both data centers are serving requests. Because each data center is active, application data (minimally) and application state (possibly) must be shared between the data centers, and replication of the application data and application state needs to be synchronous, with a minimal latency, in order to avoid inconsistencies.
Reformed Active/Active: The key difference in this variation to Active/Active is that the data centers do not share application state and, if possible, there is no sharing of application data -- or it is at least minimized, if it can't be avoided altogether. For this to work, you need to provide affinity for requests to a specific data center, which is best accomplished with a network switch that distributes the load to the data centers (which, naturally, means it doesn't reside in either data center).
I will discuss these further in a minute, but the tradeoffs for these options look like this:
Active/Passive is the easiest and safest to implement since only one is serving requests -- but it is also the most expensive, at least in terms of hardware, since you have two data centers with 100 percent capacity.
Reformed Active/Active is slightly more complex than Active/Passive -- but with some additional effort, it can work in a reliable manner.
Active/Active is the most complex topology to implement; to do so requires a sizable investment in suitable network capacity between the data centers, since you're looking to not only minimize latency, but also to provide adequate network capacity.
Network capacity and latency are key issues that impact data center failover design. To make this a bit more concrete, let's look at these two attributes:
Network latency: It is certainly possible to invest in your fiber optic network so that latency is not a barrier. Briefly, the speed of light is 300,000 km/sec, so that's your theoretical maximum speed -- but since we live in the real world, you need to consider what networks can realistically achieve. The speed of light in fiber optics is 200,000 km/sec. Thus, for data transmitted over fiber a distance of 1000 km, a latency of 5 ms is possible (1000 (km) / 200000 (km/sec) = 5 ms), and for a round trip (the reply and response) the latency over this distance would be 10 ms. There are studies that document that the Internet backbone is operating at the theoretical best possible round-trip delay, within a factor of 2, so in the example case of two data centers, 1000 km distant at a latency of 20 ms is likely achievable.
I will add that you can validate that the latencies as I describe them are readily achievable -- with a sufficient investment in network infrastructure. You can access the current AT&T Global IP Network statistics for the WWW backbone with the latency metrics. (When I looked one recent afternoon, the latency for San Francisco to Seattle, a distance of 1090 KM, was 23 ms).
Network capacity: Of course, sufficient latency is only half of what needs to be considered; adequate capacity is also required. In speaking with customers, I have heard mention of an OC-1 or OC-3 connection between data centers; at first glance, it's easy to think that this is sufficient, but let's examine this a bit more closely.
An OC-1 line provides for transmission speeds of up to 51.84 Mbit/s, with a payload 50.112 Mbit/s, and an overhead of 1.728 Mbit/s. (This base rate is multiplied for use by other OC-n standards. For example, an OC-3 connection is 3 times the rate of OC-1.) For comparison, the maximum data rate over an 802.11g wireless connection is 54 Mbit/s. In other words, the (theoretical) maximum data rate over wireless from a single computer is more than the capacity of an OC-1 line! As a result, unless you only run a single computer in each data center, it's paramount that the network between datacenters has adequate capacity to provide the desired level of data synchronization. How much capacity is going to depend on the how much data is changing (inserts, updates, deletes) for both application data and application state, in each data center, as well as the level dependency between the two data centers. Stated another way, the greater the dependencies (or sharing of data and state) between the data centers, the greater the network capacity required.
As an example, one customer I recently worked with was using an OC-3 connection between data centers and felt that should be adequate. However, it turned out they were updating more than 25 MB of application state data per second, which, when you perform a capacity calculation, meant that they were trying to transfer 25% more data than they had capacity for (application state data changes were occurring at over 200 Mbit/s and the network only had effective capacity for ~153Mbit/s) ...and this was only the application state data (for example, HTTP sessions); there was also the application data from the database servers that needed to be transmitted on top of this. It should come as no surprise that they were encountering latency problems when trying to run the two data centers in an Active/Active configuration. The customer was left with two options, either to invest in a significant upgrade of network capacity, or stop running the data centers as Active/Active. They chose the latter. Before moving on, I want to reemphasize that I do not advocate sharing an HTTP session between cells, regardless of whether the cells are in the same data center or not. I say "reemphasize" because I discussed this topic in Part 2.
As you might have guessed by now, I'm not an advocate of trying to run multiple data centers in an Active/Active configuration; not only can it be expensive, but it's also complex. For example, if you share sessions across cells, this means that you're relying on a single database server (or database server and failover server), or trying to replicate data between data centers. As a result, maintenance on the database server(s) of any type results in an outage (or the steps to avoid an outage) across two cells. This adds complexity. In a like fashion, updates to the application server runtime can also make planning and executing an outage more difficult, as a software update (for example, to WebSphere Application Server) could result in the existing database server version no longer being supported. On the other hand, if each cell is independent, including the cells between data centers and the database server, then maintenance and updates can be applied on a cell by cell basis (meaning the WebSphere Application Server cell and its associated infrastructure), and an outage in one cell won't impact the other cell, which can continue to run and service requests. Returning to the issue of complexity, the more complex this infrastructure and the greater the dependency between data centers (or cells), the greater the likelihood of an outage. In other words, the very thing you're trying to avoid -- an outage -- is more likely to occur. While this may seem like a paradox, it's not totally unexpected. It's no accident that the fundamental tenet of reliability engineering is making things simpler -- or, stated another way, the more complex something is, the more likely it is to experience an outage!
To avoid outages, or at least an extended outage, my advice to rely on an Active/Passive dual data center configuration. This recommendation is made not only because of the complexities noted above, but also in recognizing that an unplanned outage, or loss of a data center, or a catastrophic event is a rarity such that some interruption of service is ultimately acceptable (even by the most demanding customers). In the postmortem to any such event, the fact that application state (such as HTTP session) wasn't shared between data centers and a customer had to log back in as a result will not be anywhere near the top of the priority list. I recommend to customers that it's better to tolerate a relatively small number of outages, as viewed from the end user perspective; trying to totally eliminate these outages by adding dependencies (such as shared application state) between data centers or cells could add such a level of complexity that service interruptions actually increase in frequency. You also need to consider that it's better to have an expected outage of a short duration than to have to try and deal with an unplanned outage of indeterminate duration because something you didn't consider actually occurred.
The preferred option?
Of course, there is the (not unexpected) perception that running in Active/Passive means you have a large amount of capacity standing by "doing nothing," and this doesn't always resonate well with management. If this is your situation, then I'd suggest you look at the variation on Active/Active, which I call Reformed Active/Active.
Thus, from a capacity perspective, both data centers are being utilized. However, in the event of an unplanned outage (that is, a "disaster"), there will be a loss of service while users fail and then attempt to reconnect (and are implicitly moved to the secondary data center). Yes, this also means that since application state isn't being shared, customers will have to log back in -- but you have just experienced a disaster, so having to manually log back in probably isn't the highest priority. There is also the fact that, by design, application state is intended to be transient, so the loss of this data shouldn't be viewed as critical, and there are application tactics that can minimize the loss, such as maintaining application state via URL rewriting, or via smart serialization. A far more important issue that needs to be addressed is the implication of losing 50% of your system capacity should one data center go out of service. This might mean that some less critical applications run in degraded mode (or perhaps don't run at all). Since it's important that your critical applications continue to function in order to provide for continuity of business -- which usually doesn't require "continuous availability" for all applications -- you must categorize your applications so that you're prepared for such an occurrence.
Q: In the unlikely event of a true disaster, what is most important?
A: For the purposes of this discussion I have assumed that answer is "business continuity," and that you wouldn't really be all that bothered about losing a few application sessions. However, if you are seriously concerned about losing application sessions, then you are likely making inappropriate use of application state (for example, HTTP session). In other words, if state liability is high, you should use a appropriate persistence mechanism, such as JDBC.
Q: If you are in a disaster recovery state, what percentage of normal operating capacity do you need to stay in business?
A: If the answer is 100 percent, then you need two data centers, each at full capacity, and thus there is absolutely no reason to do anything other than Active/Passive. If the response is that you can't afford to provision two redundant data centers, and they could limp along fine with one at 50 percent, then you are in an Active/Active topology during normal business, and the question then becomes what should be shared, and how? Hopefully, the answer here is only the application database, using database facilities, but no session state, and all users are pinned to one and only one data center -- except during disaster recovery.
When asked, my standard recommendation is that organizations not pursue an Active/Active configuration for dual (or multiple) data centers, and instead concentrate on improving reliability within each data center, plus decrease recovery time for failover between data centers running in an Active/Passive (or Reformed Active/Active) configuration. That would be my approach, and I say this having experienced a disaster (in my case, an earthquake) that took out our data center. Rest assured, all of us evacuating the building weren't immediately concerned about users having to log back in again! Of course, it doesn't have to be something as devastating as an earthquake to bring down a data center, since something as small as a single berserk router can do the trick, and I know of a couple of places where this has occurred.
There may be internal reasons that require you to attempt an Active/Active configuration; hopefully, it's not because you're experiencing outages in your current single data center or multiple independent data centers. If you are, it's doubtful that adding dependencies between data centers will improve availability for the reasons discussed above. My advice in this case is to expend the effort on eliminating the root causes of the current outages, rather than add additional potential causes of an outage.
I know that in the past these FAQ columns have dealt with multiple questions on different topics, and that this one only deals with a single topic -- and on a topic that I already discussed twice, no less -- but I hope that the coverage of this one, very popular subject is comprehensive enough so that you don't feel too shortchanged.
Thanks to Keys Botzum, Paul Ilechko and Alex Polozoff for their suggestions and comments.
- Everything You Wanted to Know About WebSphere Application Server, but were afraid to ask, Part 2
- Everything You Wanted to Know About WebSphere Application Server, but were afraid to ask, Part 3
- Improving HttpSession Performance with Smart Serialization