Recovery domain

Within IBM® i clusters technology, a recovery domain is a subset of cluster nodes that are grouped together in a cluster resource group (CRG) for a common purpose such as performing a recovery action or synchronizing events.

There are two basic recovery domain models that can be used in high availability environments. These models are based on the type of cluster resource group that is created and the roles that are defined in the recovery domain. With the primary-backup model, users must define the node as either a primary, backup, or replicate role. Device, application and data CRGs support these role definitions. These roles are defined and managed within the recovery domain.

If a node has been defined as the primary access point for the resource, then other nodes provide backup if the primary node fails. Nodes defined as backups are nodes capable of being the access point for the resource. There is a specified order of backup nodes, which determines which backup would be first in line to be the primary should the existing primary fail. For primary-backup models, IBM i clusters will automatically respond when a node fails or switches over, based on these role definitions. For example, if Node A, which is designated as the primary, fails, Node B, which is defined as the first backup, becomes the new primary. Other nodes defined as backups will be reordered accordingly.

A replicate node is similar to a backup node but is not capable of being an access point for a resource (i.e. can not become a primary). The most common use of a replicate node is in a data CRG, where the data could be made available on a replicate node for report generation, although that node would never become the primary node.

The second recovery domain model is peer. With peer model, there is no ordered recovery domain. For a peer model, nodes can be defined as either peer or replicate. Peer CRGs support these role definitions. If nodes are defined as peer, then all the nodes in the recovery domain are equal and can provide the access point for the resource. However, there is no specified order during an outage of a peer node. The recovery domain nodes are notified when other nodes fail or have outages, but since there is no automatic response to these events, it is necessary for an application to provide actions for those events.

The four types of roles a node can have in a recovery domain are:

Primary
The cluster node that is the primary point of access for the cluster resource.
  • For a data CRG, the primary node contains the principle copy of a resource.
  • For an application CRG, the primary node is the system on which the application is currently running.
  • For a device CRG, the primary node is the current owner of the device resource.
  • For a peer CRG, the primary node is not supported.

If the primary node for a CRG fails, or a manual switchover is initiated, then the primary point of access for that CRG is moved to the first backup node.

Backup
The cluster node that will take over the role of primary access if the present primary node fails or a manual switchover is initiated.
  • For a data CRG, this cluster node contains a copy of that resource which is kept current with replication.
  • For a peer CRG, the backup node is not supported.
Replicate
A cluster node that has copies of cluster resources, but is unable to assume the role of primary or backup. Failover or switchover to a replicate node is not allowed. If you ever want a replicate node to become a primary, you must first change the role of the replicate node to that of a backup node.
  • For peer CRGs, nodes defined as replicate represent the inactive access point for cluster resources.
Peer
A cluster node which is not ordered and can be an active access point for cluster resources. When the CRG is started, all the nodes defined as peer will be an active access point.
  • For a peer CRG, the access point is controlled entirely by the management application and not the system. The peer role is only supported by the peer CRG.