My first exposure to high availability (HA) was in the 1980s, working on a commercial geosynchronous communications satellite. To allow signals to pass through the satellite reliably, the primary goal of the platform was pointing. Pointing refers to the requirement to point the satellite (or an instrument on the satellite) toward some ground object. When a satellite loses pointing, the data path through the satellite can be lost, as well, resulting in loss of revenue.
Satellites commonly use a variety of methods for availability, from radiation hardening their components to incorporating redundant systems and buses. But fault tolerance went even deeper in this platform, within the processor itself. The main processor was a 16-bit avionics-class processor that incorporated two internal CPUs that executed instructions in lock-step. The outputs of the dual internal CPUs were compared, and if a difference were found (commonly as the result of an alpha particle causing an upset event), then the processor was deemed transiently unreliable and a hot swap occurred to the redundant processor. The redundant processor (or hot spare) was kept up to date on the overall operation through a simple shared memory (shared between the master and slave), so it could immediately take control of the satellite to ensure no loss of pointing.
These concepts were certainly not new in the 1980s, and they continue to be applied today to support HA in all aspects of systems design. The goal in designs such as this built around redundancy is the elimination (to an extent) of the single point of failure. This article explores the concepts of redundancy and failover as applied to virtualization environments in addition to the open source offerings for software HA.
The approaches discussed here are capable of operating with commodity hardware. This still implies the failure levels consistent with commodity hardware, but the approaches do improve on it. True enterprise-level fault tolerance, as evident in IBM mainframes, is a complex and detailed science covered by a combination of hardware and software. This article explores only those techniques implemented in pure software.
HA with failover
Redundancy is a common approach to increasing the availability of a system. This approach works well at all levels of a system, from redundancy of network connections (to manage outages and general mis-cabling) to redundancy in power supplies (to minimize loss of power) to redundancy in storage (through replication such as mirroring or with the use of error codes such as Exclusive OR [XOR] with redundant array of independent disks [RAID]-5).
This approach also works at higher levels, where entire servers are redundant to minimize downtime. With redundant servers, the process of swapping from a failed server to a spare server is called failover (or hot-swapping). The server being failed to is called the spare—a hot-spare if the server is active or cold-spare if the server is off and required to be powered and initialized before the failover process is performed. Although a hot-spare minimizes the time required for failover, it also consumes power and is therefore less efficient.
Let's look at two approaches to failover (stateless and stateful) and the various models they support.
Stateless failover is the simplest approach but also the least useful approach given that the approach maintains no state between the failed server and the backup server.
In the simplest case, a server is paired with a backup that serves as the failover target. Using some scheme to identify when the active server (or its application) has failed (such as through a heartbeat), the backup server takes control. In the case of a web server, a load balancer removes the failed server and replaces it with the backup server as the target for web requests. This process is shown in Figure 1.
Figure 1. Simple stateless failover example
This type of solution is implemented through the Linux-HA project. Linux-HA is an umbrella project over a set of HA open source building blocks for the development of HA cluster systems. It consists of a messaging layer, a set of resource agents (which provide standardized interfaces for cluster resources), a heartbeat daemon (to permit registration and notification of new or failed resources), and a cluster resource manager (to provide service orchestration).
Resources in this context refer to a range entities, such as an instance of an Apache web server or an instance of a file system mounted on a device. The resource agent is commonly implemented as a shell script that exposes a set of operations for the resource (such as start, stop, and monitor). These operations provide a standardized interface to the resource manager. The Linux-HA project provides a repository for a large number of resource agents (see Resources for more details). Figure 2 provides view of this software stack.
Figure 2. HA cluster software stack
The Resources section also provides a list of the open source packages that can be used for the various cluster stack layer (such as messaging, management, resources). You'll also find links to other useful open source packages, such as policy engines and other resource management daemons.
Although stateless failover works in a large number of usage models, it's certainly not ideal in cases were statefulness is required. For example, if an operating system or application maintains information relevant to the failover server (such as application state or operating system level information such as network connections), a more complex method is required. This approach would provide a transparent and timely migration without the operating system or application being aware that it occurred.
One such method is implemented in the Xen hypervisor, following the work of a research project called Remus, and in Kernel-based Virtual Machine (KVM), through the work of Kemari. Remus and Kemari define a stateful VM migration method that transparently migrates a VM from a failed server to a new server but in a continuous fashion. Both approaches (Xen and KVM) rely on the existing capability of live migration, which I describe first, and then explore stateful VM failover.
Live (or online) migration is a feature implemented in most hypervisors that permits a VM to be moved from one physical server to another without stopping it. From the perspective of the VM, the migration is transparent, with the only visible side effect being a small amount of latency for external communication. Note that live migration is not a feature that's restricted to platform virtualization (through a hypervisor). You can also find live migration in the context of operating system virtualization through open source products like OpenVZ. Operating system virtualization utilizes existing suspend and restore features.
The process of migration centers around packaging the suspended state of the VM (represented by memory) and moving those pages to a new host, and then restoring the VM to execute. The process of suspending the VM to migrate it to the new host is generally called stop and copy, because the VM is halted and migrated (though pre-copy may also occur). This results in a noticeable delay as a function of the number of pages (dirty working set) to migrate from the old to new host.
Continuous live migration
The Remus project (from the University of British Columbia) took the idea of live migration and tweaked it to provide HA in the context of a VM. This was provided through the process of continuous migration to another server. The backup or shadow server is a continuous checkpoint of the active server, which is implemented through two primary changes to the traditional process of live migration.
The first change is related to the migration of memory pages between hosts. Recall in live migration that pages are migrated while the VM continues to run, called the pre-copy; when a threshold of dirty pages is reached, live migration enters the stop-and-copy phase, where the VM is temporarily suspended to move all dirty pages. In continuous live migration, the active host performs pre-copy with checkpoints on a continuous basis (at the stop-and-copy phase). In essence, the active hypervisor performs a live migration (with stop-and-copy) but keeps the VM running on the current active host. The result of the stop-and-copy phase is a checkpoint in itself. When a failure occurs, the backup hypervisor (on an independent physical server) uses the most recent checkpoint to restart the image (see Figure 3).
Figure 3. VM HA through continuous live migration
The second change relates to I/O in the active VM. Although memory is trivial to manage, I/O synchronization is a more complex problem. Network packets are saved until a checkpoint occurs. Once the backup host indicates that the checkpoint is complete, they are freed, allowing the external state of the host to be preserved for networking. Storage is a more complex matter and has been solved by simple mirroring in the backup host. When a block is written to the active disk, a copy of the block is also written to the backup host. Once a checkpoint occurs, the disk buffer is released.
Checkpoint frequency is configurable but documented to occur every 25 ms. Having the checkpoints occur at this high a frequency minimizes the number of dirty pages that must be transferred and also minimizes the buffering cost for network and storage frames.
You can find continuous live migration today in open source hypervisors. Xen was the first target for these ideas from project Remus, but KVM also implements this functionality through the Kemari project. Both approaches operate on an unmodified operating system. You can learn more about these in Resources section.
Although the approaches discussed here have made their way into the leading open source hypervisors, a plethora of work is going on in this area.
In 1995, Thomas Bressoud and Fred Schneider researched the implementation of fault tolerance in hypervisor environments through replication protocols. Before Remus, there was a project called SecondSite, which provided the foundation and the core ideas for Remus. SecondSite also found that compression of dirty pages could yield as much as an 80% improvement in bandwidth for a single VM replication.
But one of the most interesting approaches is a project called ExtraVirt, which focuses on managing transient processor faults through virtualization. ExtraVirt defines a hypervisor that sits above a multi-core processors, running replicas of VMs each on a separate core. Replication at the VM level is managed by the hypervisor and monitors their outputs (before they are visible) to detect, and then recover from internal processor faults. VMs are kept consistent through VM logging and replay. ExtraVirt is a implemented as an extension to the Xen hypervisor.
ExtraVirt is a virtualization-based solution for fault tolerance, but other schemes like this exist for processor-level fault tolerance. Some of the interesting approaches include Software Implemented Fault Tolerance (SWIFT) and Error Detection by Duplicated Instructions (EDDI). EDDI duplicates an instruction stream for the purpose of checking the resulting stream's result. If the results agree, then no fault has occurred; otherwise, when the stream results disagree, then a fault is found and handled accordingly. The compiler is responsible for generating, and then intertwining the duplicated instruction streams, making it a software-only solution. The SWIFT project takes a similar approach but implements a more efficient scheme by reclaiming unused instruction-level resources for the duplicated instructions.
Xen and KVM continue to evolve, with new features addressing new markets like those requiring fault tolerance. This type of feature is ideal for cloud environments, as it provides yet another knob (or service level agreement) for enterprise-level features on commodity servers. Although live migration was an important feature for load balancing over a cluster of servers, failover built on live migration opens new applications within the cloud for HA-dependent applications.
- High availability with the Distributed Replicated Block Device (M. Tim Jones, developerworks, August 2010) provides an introduction to one method for data mirroring. The DRBD is an open source project that provides a RAID-1-type functionality split between two servers and two storage devices.
- The Linux-HA project is a collection of projects that has implemented HA for Linux® and other platforms since 1999. The Linux-HA project is an umbrella over a number of HA building blocks for constructing HA cluster systems.
- The messaging layer of a cluster system provides messaging and membership capabilities that are necessary for understanding the presence (or lack) of a service. Examples of messaging layers include Heartbeat, Corosync, and OpenAIS. There also exists what is called Cluster Glue, which is fundamentally a set of miscellaneous components, including error reporting, libraries, and utilities.
- Resource agents act as standardized interfaces to entities in the cluster ecosystem. These can be software services such as an Apache web server or physical services such as an audible alarm. The link provides a list of the current resource agents for a wide variety of services.
- Pacemaker is a cluster resource manager that orchestrates the management activities of the cluster (starting and stopping services, monitoring, and so on). Pacemaker is a key part of the Linux-HA solution, and you can learn more about its architecture, including an introduction to other components such as the pengine policy engine, and various resource agents. You can also learn about the pacemaker CLI, its design goals, and command list.
- HA failover through continuous VM synchronization is implemented in Xen and KVM. You can learn more about their implementations at their project websites (Remus and Kemari).
- Other work around fault tolerance can be found in Hypervisor-based Fault Tolerance, SecondSite (the precursor to Remus), ExtraVirt (for managing transient processor failures through virtualization), and SWIFT (for software-specific fault tolerance through duplicated instruction streams).
- In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
- Browse the technology bookstore for books on these and other technical topics.
- Follow developerWorks on Twitter. You can also follow this author on Twitter at M. Tim Jones.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
Get products and technologies
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.
- Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.