Question & Answer
Question
How do you fail over a SPU or SFI?
Answer
This document assumes you are familiar with the normal operations of an NPS system in terms of SPU states and roles and system states.
Replacing a failed SPU
These instructions assume the following:
- The system state has been Online and is not being nzstopped. (Check by running the nzstate command. Valid SYSTEM states are Initialized, Online, Pause*, Fail*, Synch*.)
- The number of SPUs reported by nzstats is correct (28, 56, 112, 224, 448, 672) If not, then determine which devices have been removed from the system and delete the removed devices.
If the preceding assumptions are true, then these directions should be followed to replace the failed SPU and enable it as a spare on the system. In these instructions, a hardware ID is <abcd>, it’s IP address is <A.B.C.D>.
1. Identify the IP address (A.B.C.D) of the SPU that is not functioning:
- nzinventory | grep abcd
2. Run /nz/support/bin/CheckSPU -ip A.B.C.D to obtain the information related to the failure of the SPU. If multiple SPUs are down, you may invoke it as /nz/suppport/bin/CheckSPU allFails. This will investigate the failed device(s) and, if nzevents is set up properly, will send e-mail back to support. (The man page for CheckSPU is in /nz/support/cat/CheckSPU.8.txt.)
3. If the e-mail does not work, please attach the output found in the following to the ticket you have opened:
- /nz/support/hwlogs/CheckSPU-<uname>/CheckSPU-<uname>-YYMMDD.spa.slot.tgz
- nzsystem pause
nzspu failover -id abcd
nzsystem resume
5. Physically remove the device from the system and insert the replacement SPU.
6. After removing the device from the system, remove the SPU from the system catalog:
- nzspu delete -id abcd
7. It will take 1-2 minutes for the SPU to boot and be recognized by the system as Mismatched Initialized.
8. Issue nzinventory | grep A.B.C.D to verify the removed SPU <abcd> is no longer listed and that the replacement SPU is recognized by the system as Mismatched Initialized. (The system manager log will also log the discovery of a new SPU.)
9. Activate the newly inserted SPU:
- nzspu activate –ip A.B.C.D
This will set the role of the new SPU to Spare Initialized, making it available.
Replacing a failed SFI
The following procedure should be followed when replacing a failed SFI in the system.
1. Stop the system (nzstop or nzsystem stop) .
2. Remove power from the SPA containing the SFI.
3. Physically remove the SFI from the system.
4. Replace the failed SFI.
5. Replace power and network cables (one network cable in port 1, if an HA system refer to another SPA).
6. Start the system (nzstart).
7. Verify that the new SFI has been detected and has a role of Primary and a state of Up. (Use nzinventory show to do this.)
8. Determine the id of the SFI (<xxxx>) in the Down state. It should have the same slot assignment as the new SFI. (Use nzinventory to do this.)
9. Run nzsfi delete –id xxxx where xxxx is the ID of an SFI in the Down state. (Note that nzsfi returns a confusing and usually incorrect error message that it cannot delete the device--ignore it.)
10. Run nzinventory show to verify the removed SFI (<xxxx>) is no longer listed
Note: The delete subcommand for nzspu and nzsfi requires that the user specify the device with the -id <hwId> option. The other designations do not uniquely identify the specific hardware component, only its current or last known physical location in the system.
[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]
Historical Number
NZ374926
Was this topic helpful?
Document Information
Modified date:
17 October 2019
UID
swg21575026