IBM Support

When Parts Fail: Impacts to your IIAS system when hardware components fail and when they are serviced

Flashes (Alerts)


Abstract

IIAS was designed with High Availability in mind. But since most computer hardware components can fail now and then, it is important to understand the impact those failures will have on the overall IIAS system, both when the failure occurs and also when the failed component is serviced.

Content

The table below lists all serviceable components in the IBM Integrated Analytics System (IIAS), impacts to the system when that component fails, the recovery which is attempted after component failure and how the system is returned to normal operation after service has been performed.  IIAS has been designed with few single points of failure; with significant redundancy provided.    The two "colored" columns indicate whether that subsystem (node, storage array, network switch) remain available after component failure.  In the case of the network switches (management, fabric or fiber channel), the "subsystem" is viewed as that network, comprised of the two like switches.
If a component failure within one of the system's server nodes should result in that node becoming unavailable, the database will restart with (N-1) nodes.  On larger systems, spare nodes are allocated, so that even these failures (beyond the first rack) only result in a minimal interruption for restart. 
SubSystem Component SubSystem Online after Failure SubSystem Online during Service Impact Recovery Return to Operation Comments
Node Disk Yes Yes performance impact during rebuild after replacement rebuild after replacement automatic when rebuild completes  
Node Fan Yes Yes reduced cooling none automatic  
Node Power Supply Yes Yes no interrupt (N+1) none automatic  
Node Battery Yes No none unless power is lost none repair, restart, verify and re-enable customer to rebalance at their convenience
Node PCIe-Fabric+Mgmt Yes No performance impact until repair alternate path recovery repair, restart, verify and re-enable customer to rebalance at their convenience
Node PCIe-SAN Yes No performance impact until repair alternate path recovery repair, restart, verify and re-enable customer to rebalance at their convenience
Node DIMM Yes No performance impact until repair node fails, reboots with less memory repair, restart, verify and re-enable customer to rebalance at their convenience
Node Processor Module Yes No performance impact until repair node fails, reboots with less CPU repair, restart, verify and re-enable customer to rebalance at their convenience
Node SAS HBA  No No cluster runs with N-1 nodes node fails, stays down until repair repair, restart, verify and re-enable customer to rebalance at their convenience
Node Motherboard No No cluster runs with N-1 nodes node fails, stays down until repair repair, restart, verify and re-enable customer to rebalance at their convenience
Node Anchor card No No cluster runs with N-1 nodes node fails, stays down until repair repair, restart, verify and re-enable customer to rebalance at their convenience
Flash 900 Battery Yes Yes no interrupt (N+1) none automatic  
Flash 900 Power Supply Yes Yes no interrupt (N+1) none automatic  
Flash 900 Canister Fan Yes Yes reduced cooling remaining fans increase RPMs automatic  
Flash 900 Flash Module Yes Yes performance impact during rebuild auto rebuild to spare replacement module becomes spare  
Flash 900-AE2 Canister Yes No performance impact until repair alternate path recovery re-enablement part of guided replacement for M4001 appliances
Flash 900-AE2 Interface Card Yes No performance impact until repair alternate path recovery re-enablement part of guided replacement for M4001 appliances
Flash 900-AE3 Canister Yes Yes performance impact until repair alternate path recovery re-enablement part of guided replacement for M4002 appliances
Flash 900-AE3 Interface Card Yes Yes performance impact until repair alternate path recovery re-enablement part of guided replacement for M4002 appliances
Flash 900 Power Interposer Maybe No repair requires system outage may or may not cause array outage ap stop (if needed), repair, verify, ap start  
Flash 900 Midplane Maybe No repair requires system outage may or may not cause array outage ap stop (if needed), repair, verify, ap start  
Flash 900 Front Panel LED Yes No repair requires system outage none ap stop (if needed), repair, verify, ap start  
Storwize V5000 Disk Drive Yes Yes Performance impact until rebuild to spare completes Auto rebiuld to spare Replacement HDD becomes spare  
Storwize V5000 Power Supply Yes Yes No interrupt (N+1) none automatic  
Storwize V5000 Battery Pack Yes Yes No interrupt (N+1) none automatic Inside canister
Storwize V5000 CMOS Coin Battery Yes Yes No interrupt (N+1) none automatic Inside canister
Storwize V5000 Canister Yes Yes Performance impact until repair alternate path recovery Re-enablement part of guided replacement  
Storwize V5000 Canister Memory Yes Yes Performance impact until repair alternate path recovery Re-enablement part of guided replacement Inside canister
Storwize V5000 Interface Card Yes Yes Performance impact until repair alternate path recovery Re-enablement part of guided replacement Inside canister
Storwize V5000 Interface SFP Yes Yes Performance impact until repair alternate path recovery Re-enablement part of guided replacement  
Storwize V5000 Enclosure midplane No No Loss of access to specific cool storage until repair is complete None possible Bring V5000 back online (*) will cause any queries which include cool storage tables to fail
Fabric Switch Switch Yes Yes performance impact until repair alternate path recovery replace, customize config, verify  
Fabric Switch Power Supply Yes Yes no interrupt (N+1) none automatic  
Fabric Switch Fan Yes Yes no interrupt (N+1) none automatic  
SAN Switch Switch Yes Yes performance impact until repair alternate path recovery replace, customize config, verify  
SAN Switch Power Supply Yes Yes no interrupt (N+1) none automatic  
Mgmt Switch Switch Yes Yes performance impact until repair alternate path recovery replace, customize config, verify  
Mgmt Switch Power Supply Yes Yes no interrupt (N+1) none automatic  
RPC   Yes Yes no interrupt (N+1) none replace, customize config, verify  
Display   Yes Yes no impact unless appliance mgmt required none replace, customize config, verify  
Terminal Server   Yes Yes no impact unless appliance mgmt required none replace, customize config, verify  
There are only several components which would result in a full IIAS system outage beyond database stop and restart.  These are the power interposer and midplane of the FS900 Flash array.  These are both mechanical components with no electrical elements and extremely low failure rates.  And even then, there are some ways these parts can fail which do NOT cause the array to go offline, however, the array (and system) would have to be offline for service.  But these components have extremely low failure rates which contribute to the 99.999% availability characteristic for the FS900 array.
Other than these two components in the FS900 arrays, the overall system impacts of any hardware failure and/or replacement are potential performance degradations or brief interrupts due to application (database) restarts during failure and/or service.   For the most part, the system is designed to operate after hardware failure.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHRBY","label":"IBM Integrated Analytics System"},"Component":"hardware","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
26 September 2022

UID

ibm11102227