In each column, The WebSphere® Contrarian answers questions, provides guidance, and otherwise discusses fundamental topics related to the use of WebSphere products, often dispensing field-proven advice that contradicts prevailing wisdom.
One more time, again
While I never seem to notice a shortage of questions related to high availability or disaster recovery, the previous installment of The WebSphere Contrarian, which dealt with WebSphere Application Server management high availability options, seems to have prompted an increase in the number of these questions as of late. Therefore, I’m going to continue with the high availability theme here, plus add some thoughts on what’s required to achieve high availability (HA) and continuous availability (CA). But before we start discussing these, let’s make sure we all have a common understanding of these two terms:
- High availability: The infrastructure (or the applications running on it) cannot undergo an unplanned outage for more than a few seconds or minutes at a time, but might do so without serious impact to an enterprise’s business. In addition, it’s acceptable to the enterprise to bring down the application on occasion for a few hours for scheduled maintenance.
- Continuous availability: The infrastructure (or the applications running on it) cannot be interrupted at all. Essentially, there is no allowance for any outage, either unplanned or planned. This availability level is often referred to as the "Five 9s" or 99.999% availability, which translates into just over 5 minutes per year of planned or unplanned outages in total.
I’ll add that oftentimes someone will state that they "only" require "Four 9s" (99.99%) availability or some similar figure, thinking that this categorizes them as HA, when in fact there’s little meaningful difference between 99.99% and 99.999% availability over the course of a year. If you do the math, you’ll see that with 99.99% availability you’re still requiring total outages to be just over 5 hours per year in total; stated another way, it’s unlikely that you’re anymore tolerant of an unplanned outage than with 99.999% availability, and it’s equally unlikely that you’re going make an allowance for a planned outage.
I’m not going to go into the specific operational procedures for HA or CA when running WebSphere Application Server Network Deployment here, since they are already documented in both this book and this article. After reading one or both of these references it should be apparent that while a single Network Deployment cell can provide HA when carefully managed with good procedures and careful planning, CA is all but certain to require dual Network Deployment cells. Moreover, while dual cells slightly increase the administration effort (because you have two cells to administer), the complexity of the administration is greatly simplified from an operational perspective for either an HA or a CA environment by virtue of these benefits.
While two (or more) cells do not require separate hardware -- you can employ coexistence to run multiple cells on the same hardware -- it is better to dedicate hardware to each cell. This is because if each cell has separate hardware, it provides complete hardware and network isolation between the cells. If a server, server frame, router, or other device in one cell were to become inoperable, isolation of the hardware insures that it doesn’t impact another cell. This way, a cell with the failed device can be taken offline, while production is serviced by the remaining cell (or cells). Any repair work done on the failed device will not affect the other cells and, once repaired, it can be independently tested without impact to production. Multiple cells, each with independent hardware, also provides a way to perform hardware upgrades. This is because a cell can be rotated out of production, with the other cells handling the load while servers or server frames are swapped out so that memory and CPU upgrades take place. As with the case of a repair, once the upgrade is completed, the hardware/software combination can be tested without negatively affecting any users if some problem occurred with the upgrade and caused a hardware malfunction.
With multiple cells, the ability to rotate a cell out of production enables maintenance software (for example, fixpacks, patches, and so on) to be applied either to the OS, infrastructure middleware, or the application itself. Once the upgrade is completed, the hardware/software combination can be tested without negatively affecting any users if some problem occurred with the upgrade.
Obviously, this scenario is much more complex if application upgrades require corresponding changes to a shared database schema. Additional database update strategies need to be considered when using more than one cell for this type of update, because the cell running in production will not be aware of the new database schema.
Speaking of databases updates, the operation complexity that is introduced when trying to update a single database server while still serving application data requests is analogous to trying to employ a single cell to satisfy an HA or CA service level requirement while simultaneously applying hardware or software maintenance. This is why the additional administrative effort (which should be as simple as running the same administrative scripts twice, once is each cell) should be an obvious tradeoff for the simplified operational procedures that result.
Insurance against a catastrophic outage
While scripting of all administrative actions for a production environment is fundamental in providing the repeatable processes that are requisite for maintaining an HA or CA environment, changes can introduce mistakes or other disastrous results. Inadvertent keystrokes by fatigued system administrators have resulted in files being deleted from file systems, whole applications uninstalled, or memory upgrades frying hardware internals. When running a highly available site, you must assume that a catastrophic incident will occur one day in a cell. This is where the independent cell(s) are invaluable. If a change doesn’t go as planned in one cell -- one that had hopefully been rehearsed in a pre-production environment -- the other cell can continue to service production while corrective action is taken, both to the cell and to the source of the problem.
Related to administering separate cells and the additional effort involved, it’s not uncommon for me to encounter a client who is using disk replication to maintain a mirror image for a second cell (or site). If you’re using or contemplating such an approach, think carefully about what happens in the case of a mistake, such as those described above, and the impact of having a mistake automatically replicated from your production cell -- one that takes it out of service -- to your standby cell. It’s not that I’m against using an automated mechanism such as disk replication for propagating changes or data from one environment to another -- this technology works quite well -- but make sure that you have a file system backup or "snapshot" of the environment before applying changes so that you have a recovery point in the case of problems. I consider running scripts multiple times to be the easiest approach to maintaining consistent environments and the effort involved to be a minor cost, but you may decide that disk replication with a backup is an approach that works better in your environment. The point here is that if you do rely on an automated mechanism, make sure that you have recovery plans in case you end up propagating a problem across your entire infrastructure Keep in mind that disk replication is often the only feasible way to create a disaster recovery site that is transactionally consistent with the primary site. Thus, if you have a requirement for both disaster recovery and continuous availability as described here, one good approach is two cells in one data center, each being managed using scripts, where the contents of the cells (configuration, logs, data, and so on) are replicated to a remote data center.
The more the merrier?
Related to running multiple cells is the question: Are two cells enough? The reason I mention this is the "rule of 3." Essentially, if you have two of anything -- cells, servers, routers, and so on -- and one is removed from service (either for maintenance or the result of breakage), the single remaining cell or component is now a single point of failure. Additionally, you’re now running at one-half capacity. You’ll need to carefully consider how many levels of redundancy are required in order to meet the operational requirements of your enterprise, perhaps by buying three (or more) levels for your infrastructure. Obviously there are limits to how much redundancy you can obtain; aside from financial constraints, there’s also the fact that You can’t have everything, where would you put it?
Thanks for Alex Polozoff for his suggestions and comments.
- The WebSphere Contrarian: Run time management high availability options, redux
- IBM WebSphere: Deployment and Advanced Configuration by Roland Barcia, Bill Hines, Tom Alcott and Keys Botzum, IBM Press, 2004
- Maintain continuous availability while updating WebSphere Application Server enterprise applications
- Information Center:
- IBM developerWorks WebSphere