Troubleshooting
Problem
The SPU
Symptom
Issues reported in nzhealthcheck
Issues reported by Advanced Management Module (AMM)
Issues reported in sysmgr.log
Cause
Hardware issues
Diagnosing The Problem
Check nzhealthcheck report
[nz]$ nzhealthcheck
Check the SPU health via AMM
[nz]$ ssh mm00x health -l 2
Review the sysmgr.log if any error occurred
[nz]$ grep -i 'hwid=<SPU_id>' /nz/kit/log/sysmgr/sysmgr.log
Report failures to IBM support.
Collect service data from AMM
[nz]$ /nz/kit/bin/adm/ibm_amm --loc=spax.mm --service_data=ibm_amm_x_output.tgz
Where 'x' is Snippet Processing Array (SPA) number
Resolving The Problem
To failover SPU run command:
[nz]$ nzhw failover -id <SPU_id>
Example failing spu0105
[nz@nzhost~]$ nzhw -type spu
Description HW ID Location Role State Security
----------- ----- --------- ------ ------ --------
SPU 1317 spa1.spu1 Active Online N/A
SPU 1323 spa1.spu7 Active Online N/A
SPU 1326 spa1.spu3 Active Online N/A
SPU 1329 spa1.spu5 Active Online N/A
[nz]$ nzhw failover -id 1329
After the failover the spu and associated DACs will be listed in nzhw -issues
[nz@nzhost~]$ nzhw -issues
Description HW ID Location Role State Security
----------- ----- -------------- -------- ------- --------
SPU 1329 spa1.spu5 Failed Stopped N/A
DAC 1330 spa1.spu5.dac1 Inactive Ok N/A
DAC 1331 spa1.spu5.dac2 Inactive Ok N/A
[nz@nzhost~]$
When a SPU is failed, the NPS system will shift through the following system states:
system state change from 'Online' to 'Pausing Now'
system state change from 'Pausing Now' to 'Discovering'
system state change from 'Discovering' to 'Initializing'
system state change from 'Initializing' to 'Initialized'
system state change from 'Initialized' to 'Going Pre-Online'
system state change from 'Going Pre-Online' to 'Resuming'
system state change from 'Resuming' to 'Online'
After the rebalancing of the dataslices is completed, the NPS system comes online.
Run nzstate to confirm the NPS is Online after the failover
Very important Note: Running queries will be aborted due to the failover and the system will need about 15 minutes to become Online again.
You will see the message ERROR: Transaction rolled back due to restart or failover in /nz/kit/log/postgres/pg.log and open nzsql sessions.
Was this topic helpful?
Document Information
Modified date:
17 October 2019
UID
swg21993339