IBM Support

Fail an unhealthy Snippet Processing Unit (SPU / S-blade) in a PureData Systems for Analytics environment (after unhealthy SPU diagnosis by Support)

Troubleshooting


Problem

The SPU identified has reported errors and is recommended to be failed proactively

Symptom


Issues reported in nzhealthcheck
Issues reported by Advanced Management Module (AMM)
Issues reported in sysmgr.log

Cause

Hardware issues

Diagnosing The Problem

Check nzhealthcheck report
[nz]$ nzhealthcheck
 
Check the SPU health via AMM
[nz]$ ssh mm00x health -l 2
 
Review the sysmgr.log if any error occurred
[nz]$ grep -i 'hwid=<SPU_id>' /nz/kit/log/sysmgr/sysmgr.log
 
Report failures to IBM support.
 
Collect service data from AMM
[nz]$ /nz/kit/bin/adm/ibm_amm --loc=spax.mm --service_data=ibm_amm_x_output.tgz
Where 'x' is Snippet Processing Array (SPA) number
 

Resolving The Problem

To failover SPU run command:
 
[nz]$ nzhw failover -id <SPU_id>

Example failing spu0105

[nz@nzhost~]$ nzhw -type spu
Description HW ID Location Role State Security
----------- ----- --------- ------ ------ --------
SPU 1317 spa1.spu1 Active Online N/A
SPU 1323 spa1.spu7 Active Online N/A
SPU 1326 spa1.spu3 Active Online N/A
SPU 1329 spa1.spu5 Active Online N/A

[nz]$ nzhw failover -id 1329

After the failover the spu and associated DACs will be listed in nzhw -issues

[nz@nzhost~]$ nzhw -issues
Description HW ID Location Role State Security
----------- ----- -------------- -------- ------- --------
SPU
1329 spa1.spu5 Failed Stopped N/A
DAC 1330 spa1.spu5.dac1 Inactive Ok N/A
DAC 1331 spa1.spu5.dac2 Inactive Ok N/A
[nz@nzhost~]$


When a SPU is failed, the NPS system will shift through the following system states:

system state change from 'Online' to 'Pausing Now'
system state change from 'Pausing Now' to 'Discovering'
system state change from 'Discovering' to 'Initializing'
system state change from 'Initializing' to 'Initialized'
system state change from 'Initialized' to 'Going Pre-Online'
system state change from 'Going Pre-Online' to 'Resuming'
system state change from 'Resuming' to 'Online'

After the rebalancing of the dataslices is completed, the NPS system comes online.

Run nzstate to confirm the NPS is Online after the failover


Very important Note: Running queries will be aborted due to the failover and the system will need about 15 minutes to become Online again.
You will see the message ERROR: Transaction rolled back due to restart or failover in /nz/kit/log/postgres/pg.log and open nzsql sessions.

 

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Blade","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

swg21993339