High availability and hardware availability

High availability is sometimes confused with simple hardware availability. Fault tolerant, redundant systems (such as RAID) and dynamic switching technologies (such as DLPAR) provide recovery of certain hardware failures, but do not provide the full scope of error detection and recovery required to keep a complex application highly available.

A modern, complex application requires access to all of these components:

  • Nodes (CPU, memory)
  • Network interfaces (including external devices in the network topology)
  • Disk or storage devices.

Recent surveys of the causes of downtime show that actual hardware failures account for only a small percentage of unplanned outages. Other contributing factors include:

  • Operator errors
  • Environmental problems
  • Application and operating system errors.

Reliable and recoverable hardware simply cannot protect against failures of all these different aspects of the configuration. Keeping these varied elements, and therefore the application, highly available requires:

  • Thorough and complete planning of the physical and logical procedures for access and operation of the resources on which the application depends. These procedures help to avoid failures in the first place.
  • A monitoring and recovery package that automates the detection and recovery from errors.
  • A well-controlled process for maintaining the hardware and software aspects of the cluster configuration while keeping the application available.