IBM Support

When Parts Fail: Impacts to your Cloud Pak for Data System (ICPDS) when hardware components fail and when they are serviced

White Papers


Abstract

The IBM Cloud Pak for Data System (ICPDS) was designed with High Availability in mind. But since most computer hardware components can fail now and then, it is important to understand the impact those failures will have on the overall ICPDS system, both when the failure occurs and also when the failed component is serviced.

Content

Objective

The table below lists all serviceable components in the IBM Cloud Pak for Data System (ICPDS), impacts to the system when that component fails, the recovery which is attempted after component failure and how the system is returned to normal operation after service has been performed.  ICPDS has been designed with few single points of failure; with significant redundancy provided.   Additionally, the software applications which you might deploy on this system also have redundancy and failover built in to them.   The two "colored" columns in the table below indicate whether the system remains Online after component failure and/or during component service. 
 
In the case of the network switches (management or fabric), the "subsystem" is viewed as that network, comprised of the two like switches.

Environment

For some of the server component failures, the impact depends largely on the specifics of the application environment.  Certain component failures within a server node will result in that node becoming automatically unavailable, in other scenarios, the node will only be unavailable during the time of service.  In these instances, if the node in question is running as a Netezza (NPS) server node, the customer might choose to simply shut down the database for the service procedure, or to remove the one server from the Netezza cluster, keeping the database operational, but requiring server delays for reconfiguration both before and again after the service is conducted.
Subsystem Component System Online after Failure System Online during Service Impact Recovery Return to Operation Comments
Lenovo Enclosure components
Enclosure Power Suppy Yes Yes None (N+1) None Automatic  
Enclosure Fan - 60A Yes Yes Reduced cooling Other fans increase RPMs Automatic  
Enclosure Fan - 80A Yes Yes
Enclosure SMM Module Yes Yes Loss of enclosure management, monitoring and control None Automatic  
Enclosure EIOM Yes No depends on enclosure personality; IPC4D control enclosures require a full system outage, ICP4D worker  enclosures also (for now) require system outage, however an expansion enclosure running NPS SPUs would just require NPS being down none, really, because either we have to shut down the system or we have to shut down the specific application (NPS) to service an enclosure configuration and application dependant; should be automatic in most cases once serviced enclosure's nodes are  ENABLED; in the case of NPS, will require the Netezza database (nzstart) once enclosure is reurned to service.  
Enclosure X16 Shuttle Maybe No
Enclosure x16 PIOR-Right Assembly Maybe No
Enclosure x16 PIOR-Left Assembly Maybe No
Enclosure Main Chassis Enclosure Maybe No
Enclosure Cable Management Arm Maybe No
Enclosure PCIe cassette  Maybe No
Enclosure Rail Kit Maybe No
Lenovo SD530 Node components
Node NVMe Data Drive Yes Yes loss of data replica in case of NPS node data drive failure, NPS will automatically re-generate  data on another node's spare data slice    
Node SFP adapter Yes Yes possible loss of bandwidth Alternate Path Recovery Automatic  
Node Ethernet Cable Yes Yes possible loss of bandwidth Alternate Path Recovery Automatic  
Node Chassis Line Cord Yes Yes None (N+1 power) None None  
Node Planar Yes Maybe configuration dependant; most applications and pods should be deployed in a clustered configuration, so loss of a single node would not cause application failure, although there might be a perfomance degradation.   in most cases, Kubernetes will fail over all pods and applications running on failed node or node being disabled, system will continue to run.  Only exception would be an application running on a single node without backup. configuration and application dependant; should be automatic in most cases once serviced node is ENABLED for NPS (SPU) nodes, the failed node may continue to run, but that node will have to be powered off for service.  In these instances, the customer may choose to  stop the Netezza application for the duration of service, or instead to remove the one SPU from the database cluster (requiring Regens to run at the start and the end of the replacement procedure.
Node PCIe FPGA / Network card  Yes Maybe
Node CPU Yes Maybe
Node CPU Heat Sink Yes Maybe
Node DIMM Yes Maybe
Node NVMe Backplane Yes Maybe
Node M.2 Controller Yes Maybe
Node M.2 Flash Module Yes Maybe
Node M.2 Retainer Yes Maybe
Node KVM Module Yes Maybe
Node Internal NVMe cable Yes Maybe
Node Internal NVMe cable Yes Maybe
Node Internal Cable Yes Maybe
Node Internal Cable Yes Maybe
Node Internal Cable Yes Maybe
Node Internal Cable Yes Maybe
Node Node cover Yes Yes None None None  
Node Node Air baffle Yes Yes
Node Fan Door Cover Yes Yes
Node Blank Filler Yes Yes
Node KVM Dongle Yes Yes
Node Label, NVMe x4 SSL Yes Yes
Node Label, OEM Node label Yes Yes
Mellanox (Edge-Core) Management Switch components
MgtSw 3454-A3C Yes Yes Mgmt switch can be swapped concurrently, with loss of management, monitoring and control capabilities only
MgtSw Power Supply with Fans Yes Yes None (N+1) None None  
MgtSw Air Duct Yes Yes None None None  
Mellanox Fabric Switch components
FabSw Single 3454-B8C Switch No No No connectivity None Automatic  
HA switch configurations Yes Yes HA switch configurations wil allow concurrent Fabric switch replacement with no loss of system availability (although performance may be impacted)
FabSw Power Supply Yes Yes None (N+1) None None  
FabSw Fan
FabSw 25b Transceiver Yes Yes possible loss of bandwidth Alternate Path Recovery Automatic  
FabSw 10Gb Transceiver Yes Yes
FabSw Air Duct Yes Yes None None None  
(Optional) Power Distribution Units (PDUs)
PDU Power Distribution Unit Yes Yes None (N+1) None None  
PDU PDU Line Cord

Additional Information

Most components have back-up redundancy or some level of HA and failover.  The only notable exceptions are:
  1. For base systems only - a single high speed Fabric switch failure.
    • Should the switch fail, or need to be replaced, the system will be completely unavailable.  
    • This single-switch configuration is only supported for the base (two compute enclosures for Lenovo based systems, four compute servers for Dell).  Once the system is expanded, switch pairs are required, both for availability and bandwidth.
  2. Lenovo enclosure failures
    • Most active (electronic) components at the Lenovo enclosure level (power supplies, fans, systems management module) are either fully redundant and/or hot pluggable.  Most of the other enclosure-level components (with the exception of the EIOM) are purely mechanical parts with extremely low failure rates.  Any of these components would require the enclosure to be disabled, which in turn may or may not disrupt system operation, depending on what was running on that enclosure's servers.
All other components of the ICPDS system are redundant or configured in such a way that the system can continue to operate (even at diminished capacity) should a failure occur.  Detection, reporting and notifications of hardware issues are also designed to minimize impact to your ongoing operations.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHDA9","label":"IBM Cloud Pak for Data System"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
20 May 2020

UID

ibm16210524