Troubleshooting manual recovery for storage subsystems

PowerHA® SystemMirror® Enterprise Edition Version 7.1.2, or later, supports various storage subsystems that provide high availability for applications and services by monitoring for failures and implementing an automatic recovery for the failure. The storage subsystems use various replication technologies to manage the replication of data between a primary and auxiliary data center.

If the storage subsystem is online and available, PowerHA SystemMirror Enterprise Edition 7.1.2, or later, can automatically manage the replicated data during the fallover and fallback. However, the following scenarios explain in what circumstances PowerHA SystemMirror Enterprise Edition does not automatically manage the replicated data and when manual intervention is required:
  • PowerHA SystemMirror Enterprise Edition cannot determine the status of the storage subsystem, storage links, or device groups. In this scenario, PowerHA SystemMirror Enterprise Edition stops the cluster event processing and displays the corrective actions in the /var/hacmp/log/hacmp.out log file. To troubleshoot storage subsystem problems, review the information in the RECOMMENDED USER ACTIONS section in the /var/hacmp/log/hacmp.out log file.

    When the storage subsystem is brought back online, you must manually resume cluster event processing by selecting Problem Determination Tools > Recover PowerHA SystemMirror From Script Failure from the SMIT interface.

  • A fallover occurs for a partitioned cluster across different sites. The primary and auxiliary partitions begin to write data to a local storage subsystem. When the primary partition recovers and the storage links are brought back online, you must determine whether the data from the two sites can be merged or if one site's data can replace the other site’s data. In this scenario, you do not want PowerHA SystemMirror Enterprise Edition to use the automatic recovery function.
    To configure PowerHA SystemMirror Enterprise Edition to use manual recovery, complete the following steps:
    1. From the command line, enter smit sysmirror.
    2. In the SMIT interface, select Cluster Applications and Resources > Resources.
    3. Select the storage subsystem that you want to configure for manual recovery.
    4. From the Recovery Action field, select MANUAL.
  • If an outage affects all the mirror links between the source site and target site, IBM FlashSystem® A9000 or IBM® XIV® Storage System on the primary storage might not fail over to the secondary storage. In this scenario, the mirror consistency group relationship is still active, but the mirror_switch_roles command fails. If you want the mirror consistency group to fail over to the secondary storage, you must manually perform the following steps:
    1. Deactivate the mirror consistency group relationship on the primary storage by running the following command:
      mirror_deactivate -y cg=cgname
    2. On the secondary storage, change the role of the consistency group to Primary by running the following command:
      mirror_change_role -y cg=cgname role=Master
    3. On the primary storage, change the role of the consistency group to Slave by running the following command:
      mirror_change_role -y cg=cgname role=Slave
      Note: When you change the role of the consistency group on the secondary storage, the volume group on the secondary storage can be in the Varied ON state as the mirror volumes on the secondary storage are no longer in read only mode. When you run the mirror_change_role command on the primary storage, a time delay occurs because the I/O activity is broken on the host. To avoid the time delay, stop the disk I/O activity on the host before you run the mirror_change_role command.
Important: The input and output syntax of command line interface (CLI) commands uses the legacy terminology of "Master", "SMaster", and "Slave" volumes, which in any documentation except the CLI reference, are referred to as "Primary", "Secondary", and "Tertiary". This inconsistency is a necessary compromise, required to avoid changes to older CLI commands that are in customer use, and also to keep the CLI terminology consistent across the board. The new terminology helps emphasize the commonality between the more recent functions of Multi-site HA/DR, high availability (HyperSwap), and the disaster recovery (Synchronous and Asynchronous mirroring) ones. It is used outside the CLI reference, where broader concepts can be explained.