Metro Mirror heartbeat
The heartbeat is a Metro Mirror function. When the Metro Mirror heartbeat is disabled, data consistency across multiple storage systems is not guaranteed if the Copy Services Manager management server cannot communicate with one or more storage systems. The problem occurs as a result of the Hardware Freeze Timeout Timer function within the storage system. If the controlling software loses connection to a storage system, the Metro Mirror relationships that it is controlling remains established and there is no way to freeze those pairs to create consistency across the multiple storage systems. When the freeze times out, dependent I/O is written to the target storage systems, which might corrupt data consistency. Freeze refers to a Metro Mirror (peer-to-peer remote copy [PPRC]) freeze function.
When determining whether to use the Metro Mirror heartbeat, analyze your business needs. Disabling the Metro Mirror heartbeat might result in data inconsistency. If you enable the Metro Mirror heartbeat and a freeze occurs, your applications will be unable to write during the freeze.
Metro Mirror heartbeat is disabled by default.
Metro Mirror heartbeat is not available for Metro Mirror with HyperSwap® or Metro Global Mirror with HyperSwap.
- Freeze event not detected by a disconnected storage system
- Consider a situation with four storage system machines in a primary
site and four in a secondary site. One of the four storage systems
on the primary loses the connection to the target site. This causes
the affected storage system to prevent any writes from occurring,
for a period determined by the Freeze timeout timer. At the same time,
the affected storage controller loses communication with the controlling
software and cannot communicate the Freeze event to the software.
Unaware of the problem, the controlling software does not issue the Freeze command to the remaining source storage systems. The freeze will stop dependent writes from being written to connected storage systems. However, once the Freeze times out and the long-busy is terminated, dependent write I/Os continue to be copied from the storage systems that did not receive the Freeze command. The Metro Mirror session remains in a state where one storage system has suspended copying while the other three storage systems are still copying data. This state causes inconsistent data on the target storage systems.
- Freeze event detected, but unable to propagate the Freeze command to all storage systems
- Consider a situation with four storage system machines in a primary
site and four in a secondary site. One of the four storage systems
on the primary loses the connection to the target site. This causes
the affected storage system to issue long-busy to the applications
for a period determined by the Freeze timeout timer. At the same
time, one of the remaining three source systems loses communications
with the controlling software.
The storage system that had an error writing to its target cannot communicate the Freeze event to the controlling software. The controlling software issues the Freeze command to all but the disconnected storage system (the one that lost communication with the software). The long-busy stops dependent writes from being written to the connected storage systems.
However, once the Freeze times out on the frozen storage system and the long-busy is terminated, dependent write I/Os continue to the target storage system from the source storage system that lost communication and did not receive the Freeze command. The Metro Mirror session remains in a state where three storage systems have suspended copying and one storage system is still copying data. This state causes inconsistent data on the target storage systems.
Before Copy Services Manager V3.1, if the controlling software within a Metro Mirror environment detected that a managed storage system lost its connection to its target, the controlling software stopped all the other source systems to ensure consistency across all the targets. However, if the controlling software lost communication with any of the source subsystems during the failure, it could not notify those storage systems of the freeze event or ensure data consistency. The Metro Mirror heartbeat helps to overcome this problem. In a high-availability configuration, the Metro Mirror heartbeat is continued by the standby server after the Takeover command is issued on the standby, enabling you to perform actions on the standby server without causing a freeze.
- The Metro Mirror heartbeat can cause a single point of failure: if an error occurs on just the management server and not the storage system, a freeze might occur.
- When the Metro Mirror heartbeat timeout occurs, the storage system remains in a long busy state for the duration of the LSS freeze timeout.
The Metro Mirror heartbeat is supported on storage systems connected though a TCP/IP (direct connect or HMC) connection. It is not supported on storage systems connected though a z/OS® connection. Enabling the Metro Mirror heartbeat with a z/OS connection does not fail; however, a warning message is displayed specifying that the Metro Mirror heartbeat function does not work unless you have an IP connection.
If Metro Mirror heartbeat is enabled for storage systems that are connected through a TCP/IP (either direct connect or HMC) connection and z/OS connection, and the TCP/IP connection fails, Copy Services Manager suspends the Metro Mirror session because there is no heartbeat through the z/OS connection.
If Metro Mirror heartbeat is enabled for storage systems that are connected through a TCP/IP connection and z/OS connection and you remove all TCP/IP connections, Copy Services Manager suspends the Metro Mirror sessions and the applications using those volume will be in Extended Long Busy timeout until the storage system's internal timeout timer expires. Ensure that you disable the Metro Mirror heartbeat for all Metro Mirror sessions before removing the last TCP/IP connection to avoid the Extended Long Busy timeout.