The recover system
procedure recovers the entire storage system if the system state is lost from all control enclosure
node canisters. The procedure re-creates the storage system by using saved configuration data and is
also known as Tier 3 (T3) recovery. The saved configuration data is in the active quorum disk and
the latest JSON configuration
backup file. For IBM Storage Virtualize for Public Cloud, the saved
configuration data is in the system disk and the latest JSON configuration backup file . The recovery might not be able to
restore all volume data.
CAUTION:
If the system encounters a state where:
- No nodes are active
Do not attempt to initiate a node rescue, contact IBM®Support. If you start the system recovery system procedure while in this
specific state, then loss of the JSON configuration backup files can result.
- No nodes are active
- One or more nodes have node errors that require a node rescue, node canister replacement, or
node firmware re-installation.
Do not attempt system recovery. Contact IBM Remote
Technical Support. If you start the system recovery system procedure while in this specific state,
then loss of the JSON backup
of the block volume storage configuration can result.
Attention:
- Run service actions only when directed by the fix procedures. If used inappropriately, service
actions can cause loss of access to data or even data loss. Before you attempt to recover a system,
investigate the cause of the failure and attempt to resolve those issues by using other fix
procedures. Read and understand all of the instructions before you complete any action.
- The recovery procedure can take several hours if the system uses large-capacity devices as
quorum devices.
- If there are offline arrays after you run the recovery procedure, contact IBM Support.
Do not attempt the recover system procedure unless the following conditions are met:
- All of the conditions are met in When to run the recover system procedure.
- All hardware errors are fixed. See Fix hardware errors.
- All node canisters have candidate status. Otherwise, see step 1.
- All nodes must be at the same level of code that the system had before the failure. If any nodes
were modified or replaced, use the service assistant to verify the levels of code, and where
necessary, to reinstall the level of code so that it matches the level that is running on the other
nodes in the system. For more information, see Removing system information for nodes with error code 550 or error code 578 using the service assistant.
- All node canisters must be at the same level of code that the storage system had before the
system failure. If any node canisters were modified or replaced, use the service assistant to verify
the levels of code, and where necessary, to reinstall the level of code so that it matches the level
that is running on the other node canisters in the system.
- If the system was using IP quorum for T3 metadata, verify that all the IP quorum applications
are running.
- If the system recovery occurs during a non-disruptive system migration,
recovery of system data is dependent on the point in the migration process when the system recovery
action occurred. For more information, see Verifying migration volumes after a system recovery.
The system recovery procedure is one of several tasks that must be completed. The following list
is an overview of the tasks and the order in which they must be completed:
- Preparing for system recovery:
- Review the information about when to run the recover system procedure.
- Fix your hardware errors and make sure that all nodes in the system are shown in service
assistant or in the output from sainfo lsservicenodes.
- Remove the system information for node canisters with error code 550 or error code 578 by using
the service assistant, but only if the recommended user responses for these node errors are
followed. See Removing system information for nodes with error code 550 or error code 578 using the service assistant.
- Remove the system information for nodes with error code 550 or error code 578 by using the
service assistant, but only if the recommended user responses for these node errors are followed.
- For Virtual Volumes (VVols), shut down the services for any instances of Spectrum Control Base
that are connecting to the system. Use the Spectrum Control Base command service
ibm_spectrum_control stop.
- Remove hot spare nodes from the system and set them into candidate mode before you start the
recovery process. Run the following CLI command to remove the node from the system.
satask leavecluster -force spare-node-panel-name
Once the node returns in service mode, run the following CLI command to set it into candidate mode.
satask stopservice spare-node-panel-name
- For IBM Storage Virtualize for Public Cloud on Amazon Web
Services (AWS), if you run the recovery from a non-configuration node, the system ID
might change, and then it is not able to detect its originally managed Amazon Elastic Block Store
(EBS). In this case, it is necessary to delete the EBS tags first, and then start the T3
recovery.
- For IBM Storage Virtualize for Public Cloud on Microsoft Azure, if you run the recovery from a
non-configuration node, the cluster ID might change, and then it is not able to detect its
originally managed Azure disk. In this case, it is necessary to delete the Azure disk tags first,
and then start the T3 recovery.
- For IBM Storage Virtualize for Public Cloud on Amazon Web
Services (AWS) before you run the satask t3recovery
-prepare and svcconfig restore CLI commands, delete the tag
values of IBM-SV-cluster-id and IBM-SV-cluster-name and detach the EBS volumes that are managed by the system from
the AWS console. Continue to perform the T3 procedure after this action is complete.
- For IBM Storage Virtualize for Public Cloud on Microsoft Azure before you run the satask t3recovery
-prepare and svcconfig restore CLI commands, delete the
Azure disk tag values of IBM-SV-cluster-id and IBM-SV-cluster-name and detach the Azure disk volumes that are
managed by the cluster from the Microsoft Azure console.
Continue to perform the T3 procedure after this action is complete.
- Running the system recovery. After you prepared the system for recovery and met all the
pre-conditions, run the system recovery.
Note: Run the procedure on one system in a fabric at a time.
Do not run the procedure on different node canisters in the same system. This restriction also
applies to remote systems.
- Completing actions to get your environment operational.