DB2 Version 10.1 for Linux, UNIX, and Windows

Solaris Operating System cluster support

DB2® supports two cluster managers available for the Solaris operating system: Sun Cluster; and Veritas Cluster Server (VCS).

Note: When using Sun Cluster 3.0 or Veritas Cluster Server, ensure that DB2 instances are not started at boot time by using the db2iauto utility as follows:

   db2iauto -off InstName

where InstName is the login name of the instance.

High availability

The computer systems that host data services contain many distinct components, and each component has a "mean time before failure" (MTBF) associated with it. The MTBF is the average time that a component will remain usable. The MTBF for a quality hard drive is in the order of one million hours (approximately 114 years). While this seems like a long time, one out of 200 disks is likely to fail within a 6-month period.

Although there are a number of methods to increase availability for a data service, the most common is an HA cluster. A cluster, when used for high availability, consists of two or more machines, a set of private network interfaces, one or more public network interfaces, and some shared disks. This special configuration allows a data service to be moved from one machine to another. By moving the data service to another machine in the cluster, it should be able to continue providing access to its data. Moving a data service from one machine to another is called a failover, as illustrated in Figure 1.

Figure 1. Failover. When Machine B fails its data service is moved to another machine in the cluster so that the data can still be accessed.

This diagram illustrates failover as described above.

The private network interfaces are used to send heartbeat messages, as well as control messages, among the machines in the cluster. The public network interfaces are used to communicate directly with clients of the HA cluster. The disks in an HA cluster are connected to two or more machines in the cluster, so that if one machine fails, another machine has access to them.

A data service running on an HA cluster has one or more logical public network interfaces and a set of disks associated with it. The clients of an HA data service connect via TCP/IP to the logical network interfaces of the data service only. If a failover occurs, the data service, along with its logical network interfaces and set of disks, are moved to another machine.

One of the benefits of an HA cluster is that a data service can recover without the aid of support staff, and it can do so at any time. Another benefit is redundancy. All of the parts in the cluster should be redundant, including the machines themselves. The cluster should be able to survive any single point of failure.

Even though highly available data services can be very different in nature, they have some common requirements. Clients of a highly available data service expect the network address and host name of the data service to remain the same, and expect to be able to make requests in the same way, regardless of which machine the data service is on.

Consider a web browser that is accessing a highly available web server. The request is issued with a URL (Uniform Resource Locator), which contains both a host name, and the path to a file on the web server. The browser expects both the host name and the path to remain the same after failover of the web server. If the browser is downloading a file from the web server, and the server is failed over, the browser will need to reissue the request.

Availability of a data service is measured by the amount of time the data service is available to its users. The most common unit of measurement for availability is the percentage of "up time"; this is often referred to as the number of "nines":

   99.99% => service is down for (at most) 52.6 minutes / yr
   99.999% => service is down for (at most) 5.26 minutes / yr
   99.9999% => service is down for (at most) 31.5 seconds / yr

When designing and testing an HA cluster:

Ensure that the administrator of the cluster is familiar with the system and what should happen when a failover occurs.
Ensure that each part of the cluster is truly redundant and can be replaced quickly if it fails.
Force a test system to fail in a controlled environment, and make sure that it fails over correctly each time.
Keep track of the reasons for each failover. Although this should not happen often, it is important to address any issues that make the cluster unstable. For example, if one piece of the cluster caused a failover five times in one month, find out why and fix it.
Ensure that the support staff for the cluster is notified when a failover occurs.
Do not overload the cluster. Ensure that the remaining systems can still handle the workload at an acceptable level after a failover.
Check failure-prone components (such as disks) often, so that they can be replaced before problems occur.

Fault tolerance

Another way to increase the availability of a data service is fault tolerance. A fault tolerant machine has all of its redundancy built in, and should be able to withstand a single failure of any part, including CPU and memory. Fault tolerant machines are most often used in niche markets, and are usually expensive to implement. An HA cluster with machines in different geographical locations has the added advantage of being able to recover from a disaster affecting only a subset of those locations.

An HA cluster is the most common solution to increase availability because it is scalable, easy to use, and relatively inexpensive to implement.