The Autonomic Ownership Takeover Manager

The Autonomic Ownership Takeover Manager (AOTM) is a method by which either Read-Only Takeover (ROT) or Write-Only Takeover (WOT) can be automatically enabled against a failed TS7700 Cluster through internal negotiation methods. AOTM is optionally configurable.

When communication between two or more TS7700 clusters is disrupted, the clusters (local and remote) are no longer able to negotiate ownership of virtual volumes. In this scenario, ownership takeover, or human intervention, is sometimes utilized to establish temporary access to data resources. Ownership takeover occurs when an operator, working with the knowledge that one cluster is in a failed state, physically intervenes to obtain permission to access that cluster's data. When network problems cause a communication failure, ownership takeover is not the correct solution to reestablish access.

One solution is an automated process (AOTM) that permits a local cluster to access data from a remote cluster if normal communication is interrupted. Communication between clusters can be interrupted by a failure of the Grid Network, or by failure of a cluster. Since data can be lost or compromised if a local cluster is allowed to access data from a working remote cluster, AOTM only automatically grants a local cluster permission access to a remote cluster when normal communication between the clusters is disrupted and the local cluster can verify that the remote cluster is offline or otherwise not operating.
Note: Ownership takeover, including ROT, WOT, and AOTM, should only be enabled when the TS7700 Cluster in question has actually failed. If communication is only interrupted via a network failure, a takeover mode should not be enabled. AOTM attempts to distinguish between a cluster failure and a network failure.

Before AOTM intervenes to allow a working cluster access to data from a remote cluster, it must determine whether the remote cluster is inaccessible due to failure of the cluster itself or failure of the Grid network. AOTM does this by sending a status message across a network connecting the TSSC associated with the clusters. This network is referred to as the TSSC Grid Network and illustrated by the following figures.

Figure 1. Autonomic Ownership Takeover Manager configuration for four clusters
Four nodes connect to network, and to Master Consoles by second network. Master Consoles connect by third network.
Figure 2. Autonomic Ownership Takeover Manager configuration for four clusters in a hybrid grid
Four nodes connect to network, and to Master Consoles by second network. Master Consoles connect by third network.

If a working cluster in a TS7700 Grid is unable to process transactions with a remote cluster because communication with the remote cluster has been lost, the local cluster starts an AOTM grace period timer. When the grace period configured by you expires, the AOTM cluster outage detection process is initiated. The local cluster communicates with the local TSSC, which then forwards a request to a remote TSSC. The remote TSSC then attempts to communicate with the remote cluster. Only when the remote TSSC request returns and agrees that the remote cluster has failed is the configured takeover mode enabled. When enabled, access to data that is owned by the failed cluster is allowed by using the enabled takeover mode.

Conditions required for takeover

When AOTM is enabled on multiple systems in a TS7700 Grid environment, takeover occurs when one TS7700 Cluster fails if all TSSC system consoles attached to the TS7700 Clusters remain in communication with one another.
Note: To complete the communication path between TS7700 Clusters, each TSSC system console must have a designated IP address.
Figure 3 illustrates AOTM as configured on a three-cluster Grid. In an optimal situation, WAN1 and WAN2 would reside on different physical networks. Any outage within WAN1 might appear as a TS7700 Cluster outage. This failure can be validated through WAN2 with the help of the TSSCs. If the secondary TSSC path shows the TS7700 Cluster to be available, then the failure can be isolated to the network. If both TSSCs show the TS7700 Cluster as unavailable, then it is safe to assume that the TS7700 Cluster is in a failed state.
Note: Each TS7700 Cluster requires a regionally local TSSC. However, clusters in a TS7700 Grid can share a TSSC and AOTM is supported with a single, shared TSSC when the clusters are in close proximity.
Figure 3. Autonomic Ownership Takeover Manager configuration in a three-cluster Grid
WAN consists of three connected clusters and attached TSSCs. Second WAN also connects to each TSSC.
Table 1 displays the conditions under which AOTM takes over a failed cluster.
Table 1. Conditions under which AOTM takes over a failed cluster
Remote cluster appearance Remote cluster actual state Does present third peer recognize cluster as down? Status of Grid links Status of links between local TSSC and local cluster Status of links between remote TSSC and remote cluster Status of links between TSSCs Notes
Down Down Yes Not applicable Connected Connected Connected  
Down Down Yes Not applicable Connected Down Connected  
Down Online Not present Down Connected Down Connected Cluster is assumed down since last network path is down.
Down Offline Yes Not applicable Connected Not applicable Connected  
Table 2 displays the conditions under which AOTM does NOT takeover a failed cluster.
Table 2. Conditions under which AOTM does not takeover a failed cluster
Remote cluster appearance Remote cluster actual state Does present third peer recognize cluster as down? Status of Grid links Status of links between local TSSC and local cluster Status of links between remote TSSC and remote cluster Status of links between TSSCs Notes
Down Down Yes/not present Not applicable Connected Not applicable Down  
Down Down Yes/not present Not applicable Down Not applicable Not applicable  
Down Online No Down Not applicable Not applicable Not applicable Third cluster prevents takeover
Down Online Not present Down Connected Down Connected Cluster is assumed down since last network path is down.
Down Online Not present Down Connected Connected Connected  
Down Offline Yes Not applicable Down Not applicable Down Takeover not enabled if TSSC is not present or accessible
Down Offline Yes Not applicable Connected Not applicable Down Takeover not enabled if TSSC is not present or accessible

Differences between forms of takeover

AOTM is not the only form of ownership takeover that can be employed in the event of a system failure. The following descriptions of Service Ownership Takeover (SOT), Read Only Takeover/Write Only Takeover (ROT / WOT), and AOTM are provided to avoid confusion when discussing options for ownership takeover.
SOT
SOT is activated during normal operating conditions prior to bringing a system offline for upgrade, maintenance, or relocation purposes. A TS7700 Cluster in the SOT state surrenders ownership of all its data and other TS7700 Clusters in the Grid may access and mount its virtual volumes.
ROT / WOT
ROT / WOT is employed when a TS7700 Cluster is in the failed state and cannot be placed in SOT, or Service mode. In this state, the virtual volumes that belong to the failed TS7700 Cluster cannot be accessed or modified. You must use the TS7700 Management Interface from an active TS7700 Cluster in the Grid to establish ROT / WOT for the failed TS7700 Cluster.

AOTM represents an automation of ROT / WOT. AOTM makes it unnecessary for a user to physically intervene through the TS7700 Management Interface to access data on a failed TS7700 Cluster. AOTM limits the amount of time that virtual volumes from a failed TS7700 Cluster are inaccessible and reduces opportunities for human error while establishing ROT / WOT.