IBM Support

PureData for Analytics report multiple failed components in range of single SPA

Troubleshooting


Problem

Customer is receiving multiple alerts regarding failed or unreachable components in range of single SPA while system remain online.

Symptom

Monitoring commands will return similiar output:
nzhw -issues

Description HW ID Location Role State
----------- ----- ----------- ------ -----------
MM 1234 spa1.mm1 Active Warning
PowerSupply 1235 spa1.pwr3 Active Missing
EthSw 1236 spa1.ethsw1 Active Unreachable
EthSw 1237 spa1.ethsw2 Active Unreachable


ssh mm001 health -l a

system> health -l a -f
system: Critical
mm[1] : OK
mm[2] : Critical
Media Tray 1 hardware failure.
Power module 1 or 2 is required to power blades in power domain 1.
Insufficient chassis power to support redundancy
Power module 3 or 4 is required to power blades in power domain 2.
Chassis temperature device is unavailable. Cooling capacity set to maximum.
blade[1] : Non-Critical
(SN#YK11509CW2SA) Blade incompatible with I/O module configuration
blade[3] : Non-Critical
(SN#YK115001Y1YN) Blade incompatible with I/O module configuration
blade[5] : Non-Critical
(SN#YK11509CW2P5) Blade incompatible with I/O module configuration
blade[7] : Non-Critical
(SN#YK11509CW2GW) Blade incompatible with I/O module configuration
blade[9] : Non-Critical
(SN#YK105002GFM1) Blade incompatible with I/O module configuration
blade[11] : Non-Critical
(SN#YK115001Y1SR) Blade incompatible with I/O module configuration
power[1] : Critical
Power module 1 communication failure
power[2] : Critical
Power module 2 communication failure
power[3] : Critical
Power module 3 communication failure
power[4] : Critical
Power module 4 communication failure
blower[1] : OK
blower[2] : OK
switch[1] : Critical
I/O module 1 fault
I/O module 1 incompatible with blade configuration
switch[2] : Critical
I/O module 2 fault
I/O module 2 incompatible with blade configuration
switch[3] : Critical
I/O module 3 fault
I/O module 3 incompatible with blade configuration
switch[4] : Critical
I/O module 4 fault
I/O module 4 incompatible with blade configuration

Replacement/resat of AMM or mediatray, midplane replacement are not providing any improvements

Cause

Possible cause of this behaviour is presence of two different power supply units in same Chassis:

Model 1)
Part no.: 69Y5815
FRU no.: 69Y5816

Model 2)
Part no.: 39Y7408
FRU no.: 39Y7409

Diagnosing The Problem

Verify power supply units type in H-Chassis by running following command:
for i in {1..4}; do ssh mm0XX info -T power[$i] | grep -i "FRU no"; done;

Where mm0XX is number of concerned SPA

Sample command and outputs:
for i in {1..4}; do ssh mm001 info -T power[$i] | grep -i "FRU no"; done;

FRU no.: 39Y7409
FRU no.: 39Y7409
FRU no.: 39Y7409
FRU no.: 39Y7409

Resolving The Problem

If after AMM and/or mediatray replacement - communication erros still occur - replace power supply units with FRU: 39Y7409 to type: 69Y5816

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Blade","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"All Editions","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Blade","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"All Editions","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

swg21686304