How to choose between automated and manual transaction peer recovery

Your type of file system is the dominant factor in deciding which kind of transaction peer recovery to use. Different file systems have different behaviors, and the file locking behavior in particular is important when choosing between automated and manual peer recovery.

WebSphere® Application Server high availability (HA) support uses a heartbeat mechanism to determine whether servers are still running. Servers are considered failed if they stop responding to heartbeat requests. Some scenarios, such as system overloading and network partitioning (explained elsewhere in this topic), can cause servers to stop responding to heartbeats, even though the servers are still running. WebSphere Application Server uses file locking technology to prevent such events from causing concurrent access to transaction recovery logs, because access to a recovery log by more than one server can lead to loss of data integrity.

However, not all file systems provide the necessary file locking semantics, specifically that file locks are released when a server fails. For example, Network File System Version 4 (NFSv4) provides this release behavior, whereas Network File System Version 3 (NFSv3) does not.

You can test whether a shared file system can support the failover of transaction logs by running the File System Locking Protocol Test for WebSphere Application Server. To run the test see,

NFSv4 releases locks held on behalf of a host in case that host fails. Peer recovery can occur automatically without restarting the failed hardware. Therefore, this version of NFS is better suited for use with automated peer recovery.

NFSv3 holds file locks on behalf of a failed host until that host can restart. In this context, the host is the physical machine running the application server that requested the lock and it is the restart of the host, not the application server, that eventually triggers the locks to release.

To illustrate file locking on NFSv3, consider the behavior when a cluster member fails:
  1. Server H is running on host H and holds an exclusive file lock for its own recovery log files.
  2. Server P is running on host P and holds an exclusive file lock for its own recovery log files.
  3. Host H fails, taking server H with it. The NFS lock manager on the file server holds the locks that are granted to server H on its behalf.
  4. A peer recovery event is triggered in server P for server H by WebSphere Application Server.
  5. Server P attempts to gain an exclusive file lock for this peer recovery log, but is unable to do so as it is held on behalf of server H. The peer recovery process is blocked.
  6. At an unspecified time, host H is restarted. The locks held on its behalf are released.
  7. The peer recovery process in server P is unblocked and granted the exclusive file locks that are needed to undertake peer recovery.
  8. Peer recovery takes place in server P for server H.
  9. Server H is restarted.
  10. If peer recovery is still in progress in server P, the recovery is halted.
  11. Server P releases the exclusive lock on the recovery logs and returns ownership of the recovery logs back to server H.
  12. Server H obtains the exclusive lock and can now undertake standard transaction logging.

Because of this behavior, on NFSv3 you must disable file locking to use automated peer recovery. Disabling file locking can lead to concurrent access to recovery logs so it is vital that you protect your system from system overloading and network partitioning first. Alternatively, you can configure manual peer recovery, where you prevent concurrent access by manually triggering peer recovery processing only for servers that have failed.

System overloading
System overloading occurs when a machine becomes very heavily loaded such that response times are extremely poor and requests begin to time out. Several potential causes exist for such overloading, including:
  • The server is underpowered and cannot handle the workload.
  • The server received a temporary surge of requests.
  • Insufficient physical memory is available. As a result, the operating system is too busy paging to give the application server the required CPU time.
Network partitioning
Network partitioning occurs when a communications failure in a network results in two smaller networks that are independent and cannot contact each other.
Figure 1. Heartbeats in a system running normally, compared to heartbeats after the apparent server failures of system overloading and network partitioning
Heartbeats in a system running normally, compared to heartbeats after the apparent server failures of system overloading and network partitioning

During normal running, two servers on the network exchange heartbeats. During system overloading, heartbeat operations time out, giving the appearance of a server failure. After network partitioning, each server is in a separate network and heartbeats cannot pass between them, also giving the appearance of a server failure.