4040 Replication suspended and a full resynchronization is required for one or more volume groups.

Explanation

Replication between the production and the recovery system is suspended. The suspended state occurs when errors exist in the replication configuration or replication is stopped intentionally. A full resynchronization of one or more volume groups is required.

User response

This event is logged when a system that uses policy-based replication is recovered after a severe system outage. After recovery, the system suspends replication to maintain the existing recovery point of replicated volume groups and prevent potential data loss from being replicated.

When replication is restarted after a system outage, a full resynchronization of the copies is required. The resynchronization process temporarily uses more capacity to maintain the current recovery point until a full resynchronization completes. Replication must be restarted for each volume group individually. You can choose to limit the number of volume groups that are resynchronizing at one time to manage the additional capacity required during the synchronization. Before you restart resynchronization, prioritize the order of the volume groups, monitor system capacity, and provide additional capacity, if necessary.

The event is automatically marked as fixed when no more volume groups with suspended status are on the system. To unsuspend replication and restart replication, you need to determine which system has replication status of suspended. Enter the following command on one of the systems defined in the replication policy:
lsvolumegroupreplication <volumegroup_id/name>
Where volumegroup_id or name is the ID or name of the volume group. You can run this command on either the production or recovery system to display the status of both locations. For example:
local_location 1
location1_system_name system1
location1_replication_mode production
location1_volumegroup_id 1
location1_status suspended

In this case, volume group 1 is reporting the suspended status on the production system.

If the recovery copy is suspended, the lsvolumegroupreplication command displays the following results:


local_location 2
location2_system_name system2
location2_replication_mode recovery
location2_volumegroup_id 5
location2_status suspended

In this case, volume group 5 on the recovery system is suspended.

Depending on the system that displays the suspended status, different user actions are required. Use the following information to resynchronize data:

If the recovery copy is suspended:
If the recovery copy is suspended, issue the following command on the recovery system:
chvolumegroupreplication -unsuspend <volumegroup_id/name>

This starts a resynchronization of the specified volume group. If you have multiple volume groups that need resynchronization, you can choose to limit the number of volume groups that are resynchronizing at one time to manage the additional capacity required during the synchronization.

If the production copy is suspended:

If the production copy is suspended, it indicates either a planned or unplanned outage on the production location. Depending on when the outage occurred and state of the data, you might have data loss if the outage is greater than of your Recovery Point Objective (RPO).

After the outage is resolved, verify whether the data on the production volumes is consistent and not corrupted.

If the data is consistent, enter the following command on the production system:
chvolumegroupreplication -unsuspend <volumegroup_id/name>

If the data is not consistent, complete the following step:

For volume groups where replication is managed using external software:

If the replication for a volume group is managed by external orchestration software, such as VMware Site Recovery Manager (SRM), use the appropriate workflow in that application to failover to the recovery copy.

Verify that the data is usable on the recovery volume groups.
Note: If data is inconsistent with the production copy, determine whether the data loss is acceptable for your current RPO.
Run the following command on the production system to allow replication to be restarted:
chvolumegroupreplication -unsuspended <volume group ID | name>

Use the appropriate application workflow to restart replication.

After the data is resynchronized on the original production system, you can change the direction of the replication back to the original configuration by using the appropriate application workflow.

For volume groups where replication is managed using the native storage system:
On the recovery system, enter the following command:
chvolumegroupreplication -mode independent <volumegroup_id/name>

This command fails over to the recovery system and the recovery volume groups. Hosts are able to access the volumes while the volume group is in independent mode.

Verify that the data is usable on the recovery volume groups.
Note: If data is inconsistent with the production copy, determine whether the data loss is acceptable for your current RPO.

If you are satisfied with the data, run the following command on the production system:

chvolumegroupreplication -unsuspended <volume group ID | name>

Run the following command on the recovery system to restart replication using this copy as the production copy:

chvolumegroupreplication -mode production <volume group ID | name>
This command makes the recovery system the new production system. The data is replicated back to the original production system.
After the data is resynchronized on the original production system, you can change the direction of the replication back to the original configuration by enabling access to the recovery copy on the original production system and then restarting replication with that copy as production.  These actions can be performed by using the management GUI on the original production system, or by entering the following commands on the original production system:
chvolumegroupreplication -mode independent <volume group ID | name>
chvolumegroupreplication -mode production <volume group ID | name>