Maintaining maximum system uptime is increasingly critical to the success of on demand computing. Unfortunately, many off-the-shelf solutions for high availability (HA) are expensive and require expertise. This series of five articles offers a lower-cost alternative to achieving HA services using publicly available software.
The step-by-step procedures in this series show how to build a highly available Apache Web server, WebSphere® MQ queue manager, LoadLeveler cluster, WebSphere Application Server cluster, and DB2® Universal Database on Linux™. A systems administrator can learn to use and maintain this system with minimal time investment. The techniques described in this series also apply to any number of services on Linux.
To get the most out of this series, you should have a basic understanding of WebSphere MQ, WebSphere Application Server, IBM LoadLeveler, DB2 Universal Database, and high-availability clusters.
Using any software product in a business-critical or mission-critical environment requires that you consider availability, a measure of the ability of a system to do what it is supposed to do, even in the presence of crashes, equipment failures, and environmental mishaps. As more and more critical commercial applications move onto the Internet, providing highly available services becomes increasingly important.
This article highlights implementation issues that you may encounter when implementing HA solutions. We'll review HA concepts, available HA software, hardware to use, and installation and configuration details about heartbeat (open source HA software for Linux) -- and we'll see how a Web server can be made highly available using heartbeat.
The test scenarios described in this series require the following hardware:
- Four systems that support Linux, with Ethernet network adapters
- One shared external SCSI hard drive (twin tail disk)
- One IBM serial null modem cable
In my setup, I used IBM eServer™ xSeries® 335 machines with 1 GB of RAM. For shared disk, I used one of these machines as an NFS server. The software requirements for the complete setup are as follows, although for this article you need only Red Hat Enterprise Linux and heartbeat:
- Red Hat Enterprise Linux 3.0 (2.4.21-15.EL)
- heartbeat 1.2.2
- IBM Java 2 SDK 1.4.2
- WebSphere MQ for Linux 126.96.36.199 with Fix Pack 7
- LoadLeveler for Linux 3.2
- WebSphere Base Edition 5.1.1 for Linux with Cumulative Fix 1
- WebSphere ND 5.1 for Linux with Fixpack 1
- DB2 Universal Database Enterprise Server Edition 8.1 Linux
You can get the test scenarios by downloading the code package listed in the Download section below. Table 1 describes the directories in hahbcode.tar.gz.
Table 1. What's in the sample code package
|heartbeat||Sample configuration files for heartbeat|
|www||HTML files for testing HA for Apache Web Server|
|mq||Scripts and code for WebSphere MQ HA: |
|loadl||The loadl file to start and stop LoadLeveler as a Linux service|
|was||Scripts and code for WebSphere Application Server HA: |
|db2||Scripts to check database availability, create a table, insert rows in table, and select rows from a table|
High availability concepts
High availability is the system management strategy of quickly restoring essential services in the event of system, component, or application failure. The goal is minimal service interruption rather than fault tolerance. The most common solution for a failure of a system performing critical business operations is to have another system waiting to assume the failed system's workload and continue business operations.
The term "cluster" has different meanings within the computing industry. Throughout this article, unless noted otherwise, cluster describes a heartbeat cluster, which is a collection of nodes and resources (such as disks and networks) that cooperate to provide high availability of services running within the cluster. If one of those machines should fail, the resources required to maintain business operations are transferred to another available machine in the cluster.
The two main cluster configurations are:
- Standby configuration: The most basic cluster configuration, in which one node performs work while the other node acts only as standby. The standby node does not perform work and is referred to as idle; this configuration is sometimes called cold standby. Such a configuration requires a high degree of hardware redundancy. This series of articles focuses on cold standby configuration.
- Takeover configuration: A more advanced configuration in which all nodes perform some kind of work, and critical work can be taken over in the event of a node failure. In a one-sided takeover configuration, a standby node performs some additional, non-critical, non-movable work. In a mutual takeover configuration, all nodes are performing highly available (movable) work. This series of articles does not address takeover configuration.
You must plan for several key items when setting up an HA cluster:
- The disks used to store the data must be connected by a private interconnect (serial cable) or LAN to the servers that make up the cluster.
- There must be a method for automatic detection of a failed resource. This is done by a software component referred to as a heartbeat monitor.
- There must be automatic transfer of resource ownership to one or more surviving cluster members upon failure.
Available HA software
Much currently available software performs heartbeat monitoring and resource takeover functionality. Here is a list of available software for building high-availability clusters on various operating systems (see Resources for links):
- heartbeat (Linux)
- High Availability Cluster Multiprocessing - HACMP (AIX)
- IBM Tivoli System Automation for Multiplatforms (AIX, Linux)
- Legato AAM 5.1 (AIX, HP-UX, Solaris, Linux, Windows)
- SteelEye LifeKeeper (Linux, Windows)
- Veritas Cluster Server (AIX, HP-UX, Solaris, Linux, Windows)
This series describes the open source HA software heartbeat. However, you can apply the concepts you learn here to any of the above software systems.
High-Availability Linux project and heartbeat
The goal of the open source project called High-Availability Linux is to provide a clustering solution for Linux that promotes reliability, availability, and serviceability (RAS) through a community development effort. The Linux-HA project is widely used and is an important component in many interesting high-availability solutions.
Heartbeat is one of the publicly available packages at the Linux-HA project Web site. It provides the basic functions required by any HA system such as starting and stopping resources, monitoring the availability of the systems in the cluster, and transferring ownership of a shared IP address between nodes in the cluster. It monitors the health of a particular service (or services) through either a serial line or Ethernet interface or both. The current version supports a two-node configuration where special heartbeat "pings" are used to check the status and availability of a service. Heartbeat provides the foundations for far more complex scenarios than the ones described in this series of articles, such as active/active configurations, where both nodes work in parallel and perform load balancing.
For more information on heartbeat and projects where it is being used, visit the Linux-HA project Web site (see Resources for a link).
The test cluster configuration for these articles is shown in Figure 1. The setup consists of a pair of clustered servers (ha1 and ha2), both of which have access to a shared disk enclosure containing multiple physical disks; the servers are in cold standby mode. The application data needs to be on a shared device that both nodes can access. It can be a shared disk or a network file system. The device itself should be mirrored or have data protection to avoid data corruption. Such a configuration is frequently referred to as a shared disk cluster, but it is actually a shared-nothing architecture, as no disk is accessed by more than one node at a time.
Figure 1. Heartbeat cluster configuration in a production environment
For the test setup, I use NFS as the shared disk mechanism as shown in Figure 2, although I recommend using the option shown in Figure 1, especially in a production environment. A null modem cable connected between the serial ports of the two systems is used to transmit heartbeats between the two nodes.
Figure 2. Heartbeat cluster configuration using NFS for shared file system
Table 2 shows the configuration I used for both nodes. In your case, the host names and IP addresses should be known to either the DNS or the /etc/hosts files on both nodes.
Table 2. Test cluster configuration
|Shared (cluster) IP||ha.haw2.ibm.com||188.8.131.52|
|Node 3 (not shown)||ha3.haw2.ibm.com||184.108.40.206|
Set up the serial connection
Use a null modem cable to connect the two nodes through their serial ports. Now test the serial connection, as follows:
On ha1 (receiver), type:
cat < /dev/ttyS0
On ha2 (sender) type:
echo "Serial Connection test" > /dev/ttyS0
You should see the text on the receiver node (ha1). If it works, change their roles and try again.
Set up NFS for a shared file system
As mentioned, I used NFS for shared data between nodes for the test setup.
- The node nfsha.haw2.ibm.com is used as an NFS server.
- The file system /ha is shared.
To get NFS up and running:
- Create a directory /ha on nfsha node.
- Edit the /etc/exports file. This file contains a list of entries; each entry indicates a
volume that is shared and how it is shared. Listing 1 shows the relevant portion of the exports
file for my setup.
Listing 1. exports file
... /ha 220.127.116.11(rw,no_root_squash) /ha 18.104.22.168(rw,no_root_squash) /ha 22.214.171.124(rw,no_root_squash) /ha 126.96.36.199(rw,no_root_squash) /ha 188.8.131.52(rw,no_root_squash) ...
- Start the NFS services. If NFS is already running, you should run the command
/usr/sbin/exportfs -rato force nfsd to re-read the /etc/exports file.
- Add the file system /ha to your /etc/fstab file, on both the HA nodes -- ha1 and ha2 -- the
same way as local file systems. Listing 2 shows the relevant portion of the fstab file for my
Listing 2. fstab file
... nfsha.haw2.ibm.com:/ha /ha nfs noauto,rw,hard 0 0 ...
Later on, we will configure heartbeat to mount this file system.
- Extract the code sample, hahbcode.tar.gz, on this file system using the commands shown in
Listing 3. (First download the code sample from the Download section
Listing 3. Extract sample code
cd /ha tar xvfz hahbcode.tar.gz
Download and install heartbeat
Download heartbeat using the link in Resources, then install it on both ha1 and ha2 machines by entering the commands in Listing 4 (in the order given).
Listing 4. Commands for installing heartbeat
rpm -ivh heartbeat-pils-1.2.2-8.rh.el.3.0.i386.rpm rpm -ivh heartbeat-stonith-1.2.2-8.rh.el.3.0.i386.rpm rpm -ivh heartbeat-1.2.2-8.rh.el.3.0.i386.rpm
You must configure three files to get heartbeat to work: authkeys, ha.cf, and haresources. I'll show you the specific configuration I used for this implementation; if you need more information, please refer to the heartbeat Web site and read their documentation (see Resources).
1. Configure /etc/ha.d/authkeys
This file determines your authentication keys for the cluster; the keys must be the same on both nodes. You can choose from three authentication schemes: crc, md5, or sha1. If your heartbeat runs over a secure network, such as the crossover cable in the example, you'll want to use crc. This is the cheapest method from a resources perspective. If the network is insecure, but you're either not very paranoid or concerned about minimizing CPU resources, use md5. Finally, if you want the best authentication without regard for CPU resources, use sha1, as it's the hardest to crack.
The format of the file is as follows:
<number> <authmethod> [<authkey>]
For the test setup I chose the crc scheme. Listing 5 shows the /etc/ha.d/authkeys file. Make sure its permissions are safe, such as 600.
Listing 5. authkeys file
auth 2 2 crc
2. Configure /etc/ha.d/ha.cf
This file will be placed in the /etc/ha.d directory that is created after installation. It tells heartbeat what types of media paths to use and how to configure them. This file also defines the nodes in the cluster and the interfaces that heartbeat uses to verify whether or not a system is up. Listing 6 shows the relevant portion of the /etc/ha.d/ha.cf file for my setup.
Listing 6. ha.cf file
... # File to write debug messages to debugfile /var/log/ha-debug # # # File to write other messages to # logfile /var/log/ha-log # # # Facility to use for syslog()/logger # logfacility local0 # # # keepalive: how long between heartbeats? # keepalive 2 # # deadtime: how long-to-declare-host-dead? # deadtime 60 # # warntime: how long before issuing "late heartbeat" warning? # warntime 10 # # # Very first dead time (initdead) # initdead 120 # ... # Baud rate for serial ports... # baud 19200 # # serial serialportname ... serial /dev/ttyS0 # auto_failback: determines whether a resource will # automatically fail back to its "primary" node, or remain # on whatever node is serving it until that node fails, or # an administrator intervenes. # auto_failback on # ... # # Tell what machines are in the cluster # node nodename ... -- must match uname -n node ha1.haw2.ibm.com node ha2.haw2.ibm.com # # Less common options... # # Treats 10.10.10.254 as a pseudo-cluster-member # Used together with ipfail below... # ping 184.108.40.206 # Processes started and stopped with heartbeat. Restarted unless # they exit with rc=100 # respawn hacluster /usr/lib/heartbeat/ipfail ...
3. Configure /etc/ha.d/haresources
This file describes the resources that are managed by heartbeat. The resources are basically just start/stop scripts much like the ones used for starting and stopping resources in /etc/rc.d/init.d. Note that heartbeat will look in /etc/rc.d/init.d and /etc/ha.d/resource.d for scripts. The script file httpd comes with heartbeat. Listing 7 shows my /etc/ha.d/haresources file:
Listing 7. haresources file
ha1.haw2.ibm.com 220.127.116.11 Filesystem::nfsha.haw2.ibm.com:/ha::/ha::nfs::rw,hard httpd
This file must be the same on both the nodes.
This line dictates that on startup:
- Have ha1 serve the IP 18.104.22.168
- Mount the NFS shared file system /ha
- Start Apache Web server
I will be adding more resources to this file in later articles. On shutdown, heartbeat will:
- Stop the Apache server
- Unmount the shared file system
- Give up the IP
This assumes that the command
uname -n displays ha1.haw2.ibm.com; yours may well
produce ha1, and if it does, use that instead.
Configure the Apache HTTP server for HA
In this step I will make a few changes to the Apache Web server setup so that it will serve files from the shared system and from filesystems local to the two machines ha1 and ha2. The index.html file (included with the code samples) will be served from the shared disk, and the hostname.html file will be served from a local file system on each of the machines ha1 and ha2. To implement HA for the Apache Web server:
- Log in as root.
- Create the following directories on the shared disk (/ha):
- Set appropriate permissions on the shared directories using commands shown below on the node
chmod 775 /ha/www
chmod 775 /ha/www/html
- On both the primary and backup machines, rename the html directory of the Apache Web server:
mv /var/www/html /var/www/htmllocal
- Create symbolic links to the shared directories using the following commands on both the
ln -s /ha/www/html /var/www/html
- Copy the index.html file to the /ha/www/html directory on the node ha1:
cp /ha/hahbcode/www/index.html /var/www/html
You will have to change the cluster name in this file.
- Copy the hostname.html file to the /ha/www/htmllocal directory on both the machines:
cp /ha/hahbcode/www/hostname.html /var/www/html
Change the cluster name and the node name in this file.
- Create symbolic links to the hostname.html file on both the machines:
ln -s /var/www/htmllocal/hostname.html /ha/www/html/hostname.html
Now you are ready to test the HA implementation.
Test HA for the Apache HTTP server
To test the high availability of the Web server:
- Start the heartbeat service on the primary and then on the backup node using this command:
If it fails, look in /var/log/messages to determine the reason and then correct it. After heartbeat starts successfully, you should see a new network interface with the IP address that you configured in the ha.cf file. Once you've started heartbeat, take a peek at your log file (default is /var/log/ha-log) on the primary and make sure that it is doing the IP takeover and then starting the Apache Web server. Use the
pscommand to make sure the Web server daemons are running on the primary node. Heartbeat will not start any Web server processes on the backup. This happens only after the primary fails.
- Verify that the two Web pages are being served correctly on the ha1 node by pointing the
browser at the following URLs (yours will differ if you use a different host name):
Note that I am using the cluster address in the above URLs and not the address of the primary node.
The browser should display the following text for the first URL:
Hello!!! I am being served from a High Availability Cluster ha.haw2.ibm.com
The browser should display the following text for the second URL:
Hello!!! I am being served from a node ha1.haw2.ibm.com in a High Availability Cluster ha.haw2.ibm.com
- Simulate failover by simply stopping heartbeat on the primary system using the command shown
You should see all the Web server processes come up on the second machine in under a minute. If you do not, look in /var/log/messages to determine the problem and correct it.
- Verify that the two Web pages are being served correctly on the ha2 node by pointing the
browser at the following URLs:
The browser should display the following text for the first URL:
Hello!!! I am being served from a High Availability Cluster ha.haw2.ibm.com
The browser should display the following text for the second URL:
Hello!!! I am being served from a node ha2.haw2.ibm.com in a High Availability Cluster ha.haw2.ibm.com
Note that the node serving this page now is ha2.
- Restart the heartbeat service back on the primary. This should stop the Apache server processes on the secondary and start them on the primary. The primary should also take over the cluster IP.
Thus, by putting the Web pages on the shared disk, a secondary machine can serve them to a client in the event of failure of the primary machine. The failover is transparent to the client accessing the Web pages. This technique can be applied to serving CGI scripts as well.
I hope you will try this technique for setting up a very highly available Web server using inexpensive hardware and readily available software. In the next article in this series, you'll see how to build a highly available messaging queue manager using WebSphere MQ.
|Sample code package for this article||hahbcode.tar.gz||25 KB|
- Read the other articles in this series:
- Check out the High-Availability Linux project Web site for more information on heartbeat, including heartbeat success stories.
- You can download most of the software needed for this series of articles at these locations (note that not all of the downloads are free):
- Get more information on the IBM eServer 335 series.
- André Bonhôte shows how to build an HA NFS server in his article "Inner Pulse" (in PDF format) in the August 2003 issue of the European publication Linux Magazine.
- Find more information on the other high-availability solutions mentioned in this article:
- Learn about the features in DB2 Universal Database that provide high-availability capabilities in "An Overview of High Availability and Disaster Recovery for DB2 UDB" (developerWorks, April 2003).
- For a detailed discussion of availability and how to plan for and maintain it in an enterprise middleware environment, read "Planning for Availability in the Enterprise" (developerWorks, December 2003).
- Get more information on load balancing and failover support for Linux on POWER in the article "Creating a WebSphere Application Server V5 cluster" (developerWorks, January 2004).
- Find more resources for Linux developers in the developerWorks Linux zone.
- Get involved in the developerWorks community by participating in developerWorks blogs.
- Browse for books on these and other technical topics.
- Innovate your next Linux development project with IBM trial software, available for download directly from developerWorks.
Dig deeper into Linux on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.