High availability with the Distributed Replicated Block Device
High-availability storage with Linux and DRBD
The Distributed Replicated Block Device (DRBD) provides a networked version of data mirroring, classified under the redundant array of independent disks (RAID) taxonomy as RAID-1. Let's begin with a quick introduction to high availability (HA) and RAID, and then explore the architecture and use of the DRBD.
Introducing high availability
High availability is a system design principle for increased availability. Availability, or the measure of a system's operational continuity, is commonly defined as a percentage of uptime within the span of a year. For example, if a given system is available 99% of the time, then its downtime for a year is measured as 3.65 days. The value 99% is usually called two nines. Compare this to five nines (99.999%), and the maximum downtime falls to 5.26 minutes per year. That's quite a difference and requires careful design and high quality to achieve.
One of the most common implementations for HA is redundancy with failover. In this model, for example, you can define multiple paths to a given resource, with the available path being used and the redundant path used upon failure. Enterprise-class disk drives illustrate this concept, as they provide two ports of access (compared to one access port for consumer-grade drives).
As I write this, I'm sitting on a Boeing 757. Each wing includes its own jet engine. Although the engines are extremely reliable, one could fail, and the plane could continue to fly safely with that remaining single engine. That's HA (via redundancy) and applies to many applications and scenarios.
My first job was for a large defense company building geosynchronous communications satellites. At the core of these satellites was a radiation-hardened computing system that was responsible for command and telemetry (a satellite's user interface), power and thermal management, and pointing (otherwise known as keeping telephone conversations and television content flowing). For availability, this computing system was a redundant design, with two sets of processors and buses and the ability to switch between a master and a slave if the master was found to be unresponsive. To make a long story short, redundancy in systems design is a common technique to increase availability at the cost of additional hardware (and software).
Redundancy in storage
Not surprisingly, using redundancy in storage systems is also common, particularly in enterprise-class designs. It's so common that a standard approach—RAID—exists with a variety of underlying algorithms, each with different capabilities and characteristics.
RAID was first defined in 1987 at the University of California, Berkeley. Traditional RAID levels include RAID-0, which implements striping across disks for performance (but not redundancy), and RAID-1, which implements mirroring across two disks so that two copies of information exist. With RAID-1, a disk can fail, and information can still be acquired through the other copy. Other RAID levels include RAID-5, which includes block-level striping with distributed parity codes across disks, and RAID-6, which includes block-level striping with double distributed parity. Although RAID-5 can support failure of a single drive, RAID-6 can support two drive failures (though more capacity is consumed through parity information). RAID-1 is simple, but it's wasteful in terms of capacity utilization. RAID-5 and RAID-6 are more frugal with respect to storage capacity, but they typically require additional hardware processing to avoid burdening the processor with the parity calculations. As usual, trade-offs abound. Figure 1 provides a graphical summary of these RAID-0 and RAID-1 schemes.
Figure 1. Graphical summary of RAID schemes for levels 0 and 1 Linux
RAID technologies continue to evolve, with a number of so-called nonstandard techniques coming into play. These techniques include Oracle's RAID-Z scheme (which solves RAID-5's write-hold problem); NetApp's RAID-DP (for diagonal parity), which extends RAID-6; and IBM's RAID 1E (for enhanced), which implements both striping (RAID-0) and mirroring (RAID-1) over an odd number of disks. Numerous other traditional and nontraditional RAID schemes exist: See the links in Related topics for details.
Now, let's look at the basic operation of the DRBD prior to digging into the architecture. Figure 2 provides an overview of DRBD in the context of two independent servers that provide independent storage resources. One of the servers is commonly defined as the primary and the other secondary (typically as part of a clustering solution). Users access the DRBD block devices as a traditional local block device or as a storage area network or network-attached storage solution. The DRBD software provides synchronization between the primary and secondary servers for user-based Read and Write operations as well as other synchronization operations.
Figure 2. Basic DRBD model of operation
In the active/passive model, the primary node is used for Read and Write operations for all users. The secondary node is promoted to primary if the clustering solution detects that the primary node is down. Write operations occur through the primary node and are performed to the local storage and secondary storage simultaneously (see Figure 3). DRBD supports two modes for Write operations called fully synchronous and asynchronous.
In fully synchronous mode, Write operations must be safely on both nodes' storage before the Write transaction is acknowledged to the writer. In asynchronous mode, the Write transaction is acknowledged after the write data is stored on the local node's storage; the replication of the data to the peer node occurs in the background. Asynchronous mode is less safe, because a window exists for a failure to occur before data is replicated, but it is faster than fully synchronous mode, which is the safest mode for data protection. Although fully synchronous mode is recommended, asynchronous mode is useful in situations where replication occurs over longer distances (such as over the wide area network for geographic disaster recovery scenarios). Read operations are performed using local storage (unless the local disk has failed, at which point the secondary storage is accessed through the secondary node).
Figure 3. Read/Write operations with DRBD
DRBD can also support the active/active model, such that Read and Write operations can occur at both servers simultaneously in what's called the shared-disk mode. This mode relies on a shared-disk file system, such as the Global File System (GFS) or the Oracle Cluster File System version 2 (OCFS2), which includes distributed lock-management capabilities.
DRBD is split into two independent pieces: a kernel module that implements the DRBD behaviors and a set of user-space administration applications used to manage the DRBD disks (see Figure 4). The kernel module implements a driver for a virtual block device (which is replicated between a local disk and a remote disk across the network). As a virtual disk, DRBD provides a flexible model that a variety of applications can use (from file systems to other applications that can rely on a raw disk, such as a database). The DRBD module implements an interface not only to the underlying block driver (as defined by the disk configuration item in drbd.conf) but also the networking stack (whose endpoint is defined by an IP address and port number, also in drbd.conf).
Figure 4. DRBD in the Linux architecture
In user space, DRBD provides a set of utilities for managing replicated
disks. You use the
drbdsetup utility to
configure the DRBD module in the Linux kernel and
drbdmeta to manage DRBD's metadata structures.
A wrapper utility that uses both of these utilities is
drbdadm. This high-level administration tool is
the one most commonly used (grabbing details from the DRBD configuration
file in /etc/drbd.conf). As a front end to the previously discussed
drbdadm is the most commonly used to
Using the disk model, DRBD exports a special device (/dev/drbdX) that you can use just like a regular disk. Listing 1 illustrates building a file system and mounting the DRBD for use by the host (though it omits other necessary configuration steps, which are referenced in the Related topics section).
Listing 1. Building and mounting a file system on a primary DRBD disk
# mkfs.ext3 /dev/drbd0 # mkdir /mnt/drbd # mount -t ext3 /dev/drbd0 /mnt/drbd
You can use the virtual disk that DRBD provides like any other disk, with the replication occurring transparently underneath. Now, take a look at some of the major features of DRBD, including its ability to self-heal.
DRBD major features
Although the idea of a replicated disk is conceptually simple (and its development relatively straightforward), there are inherent complexities in a robust implementation. For example, replicating blocks to a networked drive is fairly simple, but handling failures and transient outages (and the resulting synchronization of the drives) is where the real solution begins. This section describes the major features that DRBD provides, including the variety of failure models that DRBD supports.
Earlier, this article explored the various methods for replicating data between nodes (two in particular—fully synchronous and asynchronous). DRBD supports a variation on each method that provides a bit more data protection than asynchronous at a slight cost in performance. The memory (or semi-) synchronous mode is a variation of both synchronous and asynchronous. In this mode, the Write operation is acknowledged after the data is stored on the local disk and mirrored to the peer node's memory. This mode provides more protection, because the data is mirrored to another node, just in volatile memory instead of the non-volatile disk. It's still possible to lose data (for example, if both nodes failed), but failure of the primary node will not cause data loss, because the data has been replicated.
Online device verification
DRBD permits the verification of local and peer devices in an online fashion (while input/output occurs). This verification means that DRBD verifies that the local and remote disk are replicas of one another, which can be a time-consuming operation. But rather than move data between nodes to validate, DRBD takes a much more efficient approach. To preserve bandwidth between the nodes (likely a constrained resource), DRBD doesn't move data between nodes to validate but instead moves cryptographic digests of the data (hash). In this way, a node computes a hash of a block; transfers the much smaller signature to the peer node, which also calculates the hash; and then compares them. If the hashes are the same, the blocks are properly replicated. But if the hashes differ, the out-of-date block is marked as out of sync, and subsequent synchronization ensures that the block is properly synchronized.
Communicating between nodes has the potential to introduce errors into the replicated data (either from a software or firmware bug or from any other error not detected by TCP/IP's checksum). To provide data integrity, DRBD calculates message integrity codes to accompany data moving between nodes. This allows the receiving node to validate its incoming data and request retransmission when an error is found. DRBD uses the Linux crypto application programming interface and is therefore flexible on the integrity algorithm used.
DRBD can recover from a wide variety of errors, but one of the most insidious is the so-called "split brain" situation. In this error scenario, the communication link fails between the nodes, and both nodes believe that they are the primary node. While primary, each node permits Write operations, without those operations being propagated to the peer node. This leads inconsistent storage in each node.
In most cases, split-brain recovery is performed manually, but DRBD provides several automatic methods for recovering from this situation. The recovery algorithm used depends on how the storage is actually used.
The simplest approach to synchronizing storage after split-brain is when one node saw no changes occur while the link was down. In this case, the node that had changes simply synchronizes with the latent peer. Another simple approach is to discard changes from one node that had the lesser number of changes. This permits the node with the largest change-set to continue but means that changes to one host will be lost.
The other two approaches discard changes based on the temporal states of the nodes. In one approach, changes are discarded from the node that switched to primary last. In the other, changes are discard from the oldest primary (the node that switched to primary first). You can manipulate each of these nodes within the DRBD configuration file, but their use ultimately depends upon the application using the storage and whether data can be discarded or manual recovery is necessary.
A key aspect of a replicated storage device is an efficient method for synchronizing data between nodes. Two of the schemes that DRBD uses are activity logs and the quick-sync bitmap. The activity log stores blocks that were recently written to and define which blocks need to be synchronized after a failure is resolved. The quick-sync bitmap defines the blocks that are in sync (or out of sync) during a time of disconnection. When the nodes are reconnected, synchronization can use this bitmap to quickly synchronize the nodes to be exact replicas of one another. This time is important, because it represents the window during which the secondary disk is inconsistent.
DRBD is a great asset if you're looking to increase the availability of your data, even on commodity hardware. It can be easily installed as a kernel module and configured using the available administration tools and wrappers. Even better, DRBD is open source, allowing you to tailor it to your needs (but check the DRBD road map first to see whether your need is in the works). DRBD supports a large number of useful options, so you can optimize it to uniquely fit your application.
- The DRBD website provides the latest information on DRBD, its current feature list, a road map, and a description of the technology. You can also find a list of DRBD papers and presentations. Although DRBD is part of the mainline kernel (since 2.6.33), you can grab the latest source tarball at LINBIT.
- High availability is a system property that ensures a degree of operation. This property typically involves redundancy as a way to avoid a single point of failure. Fault-tolerant system design is another important aspect for increasing availability.
- The concept of RAID was born at the University of California, Berkeley in 1987. RAID is defined by levels, which specify the storage architecture and characteristics of the protection. You can learn more about the original RAID concept in the seminal paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)."
- Ubuntu provides a useful page for configuring and using DRBD. This page illustrates configuration of DRBD on primary and secondary hosts as well as testing DRBD in a number of failure scenarios.
- DRBD is most useful in conjunction with clustering applications. Luckily, you can learn more about these applications and others (such as Pacemaker, Heartbeat, Logical Volume Manager, GFS, and OCFS2) and how they integrate with DRBD in the DRBD-enabled applications section of the DRBD manual.
- This article referenced two shared-disk file systems—namely, the GFS and the OCFS2. Both are cluster file systems that embody high performance and HA.
- In the developerWorks Linux zone, find hundreds of how-to articles and tutorials, as well as downloads, discussion forums, and a wealth of other resources for Linux developers and administrators.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.