White Papers
Abstract
The IBM Cloud Pak for Data System (ICPDS) was designed with High Availability in mind. But since most computer hardware components can fail now and then, it is important to understand the impact those failures will have on the overall ICPDS system, both when the failure occurs and also when the failed component is serviced.
Content
Objective
The table below lists all serviceable components in the IBM Cloud Pak for Data System (ICPDS), impacts to the system when that component fails, the recovery which is attempted after component failure and how the system is returned to normal operation after service has been performed. ICPDS has been designed with few single points of failure; with significant redundancy provided. Additionally, the software applications which you might deploy on this system also have redundancy and failover built in to them. The two "colored" columns in the table below indicate whether the system remains Online after component failure and/or during component service.
In the case of the network switches (management or fabric), the "subsystem" is viewed as that network, comprised of the two like switches.
Environment
For some of the server component failures, the impact depends largely on the specifics of the application environment. Certain component failures within a server node will result in that node becoming automatically unavailable, in other scenarios, the node will only be unavailable during the time of service. In these instances, if the node in question is running as a Netezza (NPS) server node, the customer might choose to simply shut down the database for the service procedure, or to remove the one server from the Netezza cluster, keeping the database operational, but requiring server delays for reconfiguration both before and again after the service is conducted.
| Subsystem | Component | System Online after Failure | System Online during Service | Impact | Recovery | Return to Operation | Comments | |
| Lenovo Enclosure components | ||||||||
| Enclosure | Power Suppy | Yes | Yes | None (N+1) | None | Automatic | ||
| Enclosure | Fan - 60A | Yes | Yes | Reduced cooling | Other fans increase RPMs | Automatic | ||
| Enclosure | Fan - 80A | Yes | Yes | |||||
| Enclosure | SMM Module | Yes | Yes | Loss of enclosure management, monitoring and control | None | Automatic | ||
| Enclosure | EIOM | Yes | No | depends on enclosure personality; IPC4D control enclosures require a full system outage, ICP4D worker enclosures also (for now) require system outage, however an expansion enclosure running NPS SPUs would just require NPS being down | none, really, because either we have to shut down the system or we have to shut down the specific application (NPS) to service an enclosure | configuration and application dependant; should be automatic in most cases once serviced enclosure's nodes are ENABLED; in the case of NPS, will require the Netezza database (nzstart) once enclosure is reurned to service. | ||
| Enclosure | X16 Shuttle | Maybe | No | |||||
| Enclosure | x16 PIOR-Right Assembly | Maybe | No | |||||
| Enclosure | x16 PIOR-Left Assembly | Maybe | No | |||||
| Enclosure | Main Chassis Enclosure | Maybe | No | |||||
| Enclosure | Cable Management Arm | Maybe | No | |||||
| Enclosure | PCIe cassette | Maybe | No | |||||
| Enclosure | Rail Kit | Maybe | No | |||||
| Lenovo SD530 Node components | ||||||||
| Node | NVMe Data Drive | Yes | Yes | loss of data replica | in case of NPS node data drive failure, NPS will automatically re-generate data on another node's spare data slice | |||
| Node | SFP adapter | Yes | Yes | possible loss of bandwidth | Alternate Path Recovery | Automatic | ||
| Node | Ethernet Cable | Yes | Yes | possible loss of bandwidth | Alternate Path Recovery | Automatic | ||
| Node | Chassis Line Cord | Yes | Yes | None (N+1 power) | None | None | ||
| Node | Planar | Yes | Maybe | configuration dependant; most applications and pods should be deployed in a clustered configuration, so loss of a single node would not cause application failure, although there might be a perfomance degradation. | in most cases, Kubernetes will fail over all pods and applications running on failed node or node being disabled, system will continue to run. Only exception would be an application running on a single node without backup. | configuration and application dependant; should be automatic in most cases once serviced node is ENABLED | for NPS (SPU) nodes, the failed node may continue to run, but that node will have to be powered off for service. In these instances, the customer may choose to stop the Netezza application for the duration of service, or instead to remove the one SPU from the database cluster (requiring Regens to run at the start and the end of the replacement procedure. | |
| Node | PCIe FPGA / Network card | Yes | Maybe | |||||
| Node | CPU | Yes | Maybe | |||||
| Node | CPU Heat Sink | Yes | Maybe | |||||
| Node | DIMM | Yes | Maybe | |||||
| Node | NVMe Backplane | Yes | Maybe | |||||
| Node | M.2 Controller | Yes | Maybe | |||||
| Node | M.2 Flash Module | Yes | Maybe | |||||
| Node | M.2 Retainer | Yes | Maybe | |||||
| Node | KVM Module | Yes | Maybe | |||||
| Node | Internal NVMe cable | Yes | Maybe | |||||
| Node | Internal NVMe cable | Yes | Maybe | |||||
| Node | Internal Cable | Yes | Maybe | |||||
| Node | Internal Cable | Yes | Maybe | |||||
| Node | Internal Cable | Yes | Maybe | |||||
| Node | Internal Cable | Yes | Maybe | |||||
| Node | Node cover | Yes | Yes | None | None | None | ||
| Node | Node Air baffle | Yes | Yes | |||||
| Node | Fan Door Cover | Yes | Yes | |||||
| Node | Blank Filler | Yes | Yes | |||||
| Node | KVM Dongle | Yes | Yes | |||||
| Node | Label, NVMe x4 SSL | Yes | Yes | |||||
| Node | Label, OEM Node label | Yes | Yes | |||||
| Mellanox (Edge-Core) Management Switch components | ||||||||
| MgtSw | 3454-A3C | Yes | Yes | Mgmt switch can be swapped concurrently, with loss of management, monitoring and control capabilities only | ||||
| MgtSw | Power Supply with Fans | Yes | Yes | None (N+1) | None | None | ||
| MgtSw | Air Duct | Yes | Yes | None | None | None | ||
| Mellanox Fabric Switch components | ||||||||
| FabSw | Single 3454-B8C Switch | No | No | No connectivity | None | Automatic | ||
| HA switch configurations | Yes | Yes | HA switch configurations wil allow concurrent Fabric switch replacement with no loss of system availability (although performance may be impacted) | |||||
| FabSw | Power Supply | Yes | Yes | None (N+1) | None | None | ||
| FabSw | Fan | |||||||
| FabSw | 25b Transceiver | Yes | Yes | possible loss of bandwidth | Alternate Path Recovery | Automatic | ||
| FabSw | 10Gb Transceiver | Yes | Yes | |||||
| FabSw | Air Duct | Yes | Yes | None | None | None | ||
| (Optional) Power Distribution Units (PDUs) | ||||||||
| PDU | Power Distribution Unit | Yes | Yes | None (N+1) | None | None | ||
| PDU | PDU Line Cord | |||||||
Additional Information
Most components have back-up redundancy or some level of HA and failover. The only notable exceptions are:
- For base systems only - a single high speed Fabric switch failure.
- Should the switch fail, or need to be replaced, the system will be completely unavailable.
- This single-switch configuration is only supported for the base (two compute enclosures for Lenovo based systems, four compute servers for Dell). Once the system is expanded, switch pairs are required, both for availability and bandwidth.
- Lenovo enclosure failures
- Most active (electronic) components at the Lenovo enclosure level (power supplies, fans, systems management module) are either fully redundant and/or hot pluggable. Most of the other enclosure-level components (with the exception of the EIOM) are purely mechanical parts with extremely low failure rates. Any of these components would require the enclosure to be disabled, which in turn may or may not disrupt system operation, depending on what was running on that enclosure's servers.
All other components of the ICPDS system are redundant or configured in such a way that the system can continue to operate (even at diminished capacity) should a failure occur. Detection, reporting and notifications of hardware issues are also designed to minimize impact to your ongoing operations.
[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHDA9","label":"IBM Cloud Pak for Data System"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]
Was this topic helpful?
Document Information
Modified date:
20 May 2020
UID
ibm16210524