When Parts Fail: Impacts to your Cloud Pak for Data System (ICPDS) when hardware components fail and when they are serviced

White Papers

Abstract

The IBM Cloud Pak for Data System (ICPDS) was designed with High Availability in mind. But since most computer hardware components can fail now and then, it is important to understand the impact those failures will have on the overall ICPDS system, both when the failure occurs and also when the failed component is serviced.

Content

Objective

The table below lists all serviceable components in the IBM Cloud Pak for Data System (ICPDS), impacts to the system when that component fails, the recovery which is attempted after component failure and how the system is returned to normal operation after service has been performed. ICPDS has been designed with few single points of failure; with significant redundancy provided. Additionally, the software applications which you might deploy on this system also have redundancy and failover built in to them. The two "colored" columns in the table below indicate whether the system remains Online after component failure and/or during component service.

In the case of the network switches (management or fabric), the "subsystem" is viewed as that network, comprised of the two like switches.

Environment

For some of the server component failures, the impact depends largely on the specifics of the application environment. Certain component failures within a server node will result in that node becoming automatically unavailable, in other scenarios, the node will only be unavailable during the time of service. In these instances, if the node in question is running as a Netezza (NPS) server node, the customer might choose to simply shut down the database for the service procedure, or to remove the one server from the Netezza cluster, keeping the database operational, but requiring server delays for reconfiguration both before and again after the service is conducted.

Subsystem	Component	System Online after Failure	System Online during Service	Impact	Recovery	Return to Operation	Comments
Lenovo Enclosure components
Enclosure	Power Suppy	Yes	Yes	None (N+1)	None	Automatic
Enclosure	Fan - 60A	Yes	Yes	Reduced cooling	Other fans increase RPMs	Automatic
Enclosure	Fan - 80A	Yes	Yes	Reduced cooling	Other fans increase RPMs	Automatic
Enclosure	SMM Module	Yes	Yes	Loss of enclosure management, monitoring and control	None	Automatic
Enclosure	EIOM	Yes	No	depends on enclosure personality; IPC4D control enclosures require a full system outage, ICP4D worker enclosures also (for now) require system outage, however an expansion enclosure running NPS SPUs would just require NPS being down	none, really, because either we have to shut down the system or we have to shut down the specific application (NPS) to service an enclosure	configuration and application dependant; should be automatic in most cases once serviced enclosure's nodes are ENABLED; in the case of NPS, will require the Netezza database (nzstart) once enclosure is reurned to service.
Enclosure	X16 Shuttle	Maybe	No
Enclosure	x16 PIOR-Right Assembly	Maybe	No
Enclosure	x16 PIOR-Left Assembly	Maybe	No
Enclosure	Main Chassis Enclosure	Maybe	No
Enclosure	Cable Management Arm	Maybe	No
Enclosure	PCIe cassette	Maybe	No
Enclosure	Rail Kit	Maybe	No
Lenovo SD530 Node components
Node	NVMe Data Drive	Yes	Yes	loss of data replica	in case of NPS node data drive failure, NPS will automatically re-generate data on another node's spare data slice
Node	SFP adapter	Yes	Yes	possible loss of bandwidth	Alternate Path Recovery	Automatic
Node	Ethernet Cable	Yes	Yes	possible loss of bandwidth	Alternate Path Recovery	Automatic
Node	Chassis Line Cord	Yes	Yes	None (N+1 power)	None	None
Node	Planar	Yes	Maybe	configuration dependant; most applications and pods should be deployed in a clustered configuration, so loss of a single node would not cause application failure, although there might be a perfomance degradation.	in most cases, Kubernetes will fail over all pods and applications running on failed node or node being disabled, system will continue to run. Only exception would be an application running on a single node without backup.	configuration and application dependant; should be automatic in most cases once serviced node is ENABLED	for NPS (SPU) nodes, the failed node may continue to run, but that node will have to be powered off for service. In these instances, the customer may choose to stop the Netezza application for the duration of service, or instead to remove the one SPU from the database cluster (requiring Regens to run at the start and the end of the replacement procedure.
Node	PCIe FPGA / Network card	Yes	Maybe
Node	CPU	Yes	Maybe
Node	CPU Heat Sink	Yes	Maybe
Node	DIMM	Yes	Maybe
Node	NVMe Backplane	Yes	Maybe
Node	M.2 Controller	Yes	Maybe
Node	M.2 Flash Module	Yes	Maybe
Node	M.2 Retainer	Yes	Maybe
Node	KVM Module	Yes	Maybe
Node	Internal NVMe cable	Yes	Maybe
Node	Internal NVMe cable	Yes	Maybe
Node	Internal Cable	Yes	Maybe
Node	Internal Cable	Yes	Maybe
Node	Internal Cable	Yes	Maybe
Node	Internal Cable	Yes	Maybe
Node	Node cover	Yes	Yes	None	None	None
Node	Node Air baffle	Yes	Yes
Node	Fan Door Cover	Yes	Yes
Node	Blank Filler	Yes	Yes
Node	KVM Dongle	Yes	Yes
Node	Label, NVMe x4 SSL	Yes	Yes
Node	Label, OEM Node label	Yes	Yes
Mellanox (Edge-Core) Management Switch components
MgtSw	3454-A3C	Yes	Yes	Mgmt switch can be swapped concurrently, with loss of management, monitoring and control capabilities only
MgtSw	Power Supply with Fans	Yes	Yes	None (N+1)	None	None
MgtSw	Air Duct	Yes	Yes	None	None	None
Mellanox Fabric Switch components
FabSw	Single 3454-B8C Switch	No	No	No connectivity	None	Automatic
FabSw	HA switch configurations	Yes	Yes	HA switch configurations wil allow concurrent Fabric switch replacement with no loss of system availability (although performance may be impacted)
FabSw	Power Supply	Yes	Yes	None (N+1)	None	None
FabSw	Fan	Yes	Yes	None (N+1)	None	None
FabSw	25b Transceiver	Yes	Yes	possible loss of bandwidth	Alternate Path Recovery	Automatic
FabSw	10Gb Transceiver	Yes	Yes	possible loss of bandwidth	Alternate Path Recovery	Automatic
FabSw	Air Duct	Yes	Yes	None	None	None
(Optional) Power Distribution Units (PDUs)
PDU	Power Distribution Unit	Yes	Yes	None (N+1)	None	None
PDU	PDU Line Cord	Yes	Yes	None (N+1)	None	None

Additional Information

Most components have back-up redundancy or some level of HA and failover. The only notable exceptions are:

For base systems only - a single high speed Fabric switch failure.
- Should the switch fail, or need to be replaced, the system will be completely unavailable.
- This single-switch configuration is only supported for the base (two compute enclosures for Lenovo based systems, four compute servers for Dell). Once the system is expanded, switch pairs are required, both for availability and bandwidth.
Lenovo enclosure failures
- Most active (electronic) components at the Lenovo enclosure level (power supplies, fans, systems management module) are either fully redundant and/or hot pluggable. Most of the other enclosure-level components (with the exception of the EIOM) are purely mechanical parts with extremely low failure rates. Any of these components would require the enclosure to be disabled, which in turn may or may not disrupt system operation, depending on what was running on that enclosure's servers.

All other components of the ICPDS system are redundant or configured in such a way that the system can continue to operate (even at diminished capacity) should a failure occur. Detection, reporting and notifications of hardware issues are also designed to minimize impact to your ongoing operations.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHDA9","label":"IBM Cloud Pak for Data System"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips

When Parts Fail: Impacts to your Cloud Pak for Data System (ICPDS) when hardware components fail and when they are serviced

White Papers

Abstract

Content

Objective

Environment

Additional Information

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?