Flashes (Alerts)
Abstract
IIAS was designed with High Availability in mind. But since most computer hardware components can fail now and then, it is important to understand the impact those failures will have on the overall IIAS system, both when the failure occurs and also when the failed component is serviced.
Content
The table below lists all serviceable components in the IBM Integrated Analytics System (IIAS), impacts to the system when that component fails, the recovery which is attempted after component failure and how the system is returned to normal operation after service has been performed. IIAS has been designed with few single points of failure; with significant redundancy provided. The two "colored" columns indicate whether that subsystem (node, storage array, network switch) remain available after component failure. In the case of the network switches (management, fabric or fiber channel), the "subsystem" is viewed as that network, comprised of the two like switches.
If a component failure within one of the system's server nodes should result in that node becoming unavailable, the database will restart with (N-1) nodes. On larger systems, spare nodes are allocated, so that even these failures (beyond the first rack) only result in a minimal interruption for restart.
| SubSystem | Component | SubSystem Online after Failure | SubSystem Online during Service | Impact | Recovery | Return to Operation | Comments |
| Node | Disk | Yes | Yes | performance impact during rebuild after replacement | rebuild after replacement | automatic when rebuild completes | |
| Node | Fan | Yes | Yes | reduced cooling | none | automatic | |
| Node | Power Supply | Yes | Yes | no interrupt (N+1) | none | automatic | |
| Node | Battery | Yes | No | none unless power is lost | none | repair, restart, verify and re-enable | customer to rebalance at their convenience |
| Node | PCIe-Fabric+Mgmt | Yes | No | performance impact until repair | alternate path recovery | repair, restart, verify and re-enable | customer to rebalance at their convenience |
| Node | PCIe-SAN | Yes | No | performance impact until repair | alternate path recovery | repair, restart, verify and re-enable | customer to rebalance at their convenience |
| Node | DIMM | Yes | No | performance impact until repair | node fails, reboots with less memory | repair, restart, verify and re-enable | customer to rebalance at their convenience |
| Node | Processor Module | Yes | No | performance impact until repair | node fails, reboots with less CPU | repair, restart, verify and re-enable | customer to rebalance at their convenience |
| Node | SAS HBA | No | No | cluster runs with N-1 nodes | node fails, stays down until repair | repair, restart, verify and re-enable | customer to rebalance at their convenience |
| Node | Motherboard | No | No | cluster runs with N-1 nodes | node fails, stays down until repair | repair, restart, verify and re-enable | customer to rebalance at their convenience |
| Node | Anchor card | No | No | cluster runs with N-1 nodes | node fails, stays down until repair | repair, restart, verify and re-enable | customer to rebalance at their convenience |
| Flash 900 | Battery | Yes | Yes | no interrupt (N+1) | none | automatic | |
| Flash 900 | Power Supply | Yes | Yes | no interrupt (N+1) | none | automatic | |
| Flash 900 | Canister Fan | Yes | Yes | reduced cooling | remaining fans increase RPMs | automatic | |
| Flash 900 | Flash Module | Yes | Yes | performance impact during rebuild | auto rebuild to spare | replacement module becomes spare | |
| Flash 900-AE2 | Canister | Yes | No | performance impact until repair | alternate path recovery | re-enablement part of guided replacement | for M4001 appliances |
| Flash 900-AE2 | Interface Card | Yes | No | performance impact until repair | alternate path recovery | re-enablement part of guided replacement | for M4001 appliances |
| Flash 900-AE3 | Canister | Yes | Yes | performance impact until repair | alternate path recovery | re-enablement part of guided replacement | for M4002 appliances |
| Flash 900-AE3 | Interface Card | Yes | Yes | performance impact until repair | alternate path recovery | re-enablement part of guided replacement | for M4002 appliances |
| Flash 900 | Power Interposer | Maybe | No | repair requires system outage | may or may not cause array outage | ap stop (if needed), repair, verify, ap start | |
| Flash 900 | Midplane | Maybe | No | repair requires system outage | may or may not cause array outage | ap stop (if needed), repair, verify, ap start | |
| Flash 900 | Front Panel LED | Yes | No | repair requires system outage | none | ap stop (if needed), repair, verify, ap start | |
| Storwize V5000 | Disk Drive | Yes | Yes | Performance impact until rebuild to spare completes | Auto rebiuld to spare | Replacement HDD becomes spare | |
| Storwize V5000 | Power Supply | Yes | Yes | No interrupt (N+1) | none | automatic | |
| Storwize V5000 | Battery Pack | Yes | Yes | No interrupt (N+1) | none | automatic | Inside canister |
| Storwize V5000 | CMOS Coin Battery | Yes | Yes | No interrupt (N+1) | none | automatic | Inside canister |
| Storwize V5000 | Canister | Yes | Yes | Performance impact until repair | alternate path recovery | Re-enablement part of guided replacement | |
| Storwize V5000 | Canister Memory | Yes | Yes | Performance impact until repair | alternate path recovery | Re-enablement part of guided replacement | Inside canister |
| Storwize V5000 | Interface Card | Yes | Yes | Performance impact until repair | alternate path recovery | Re-enablement part of guided replacement | Inside canister |
| Storwize V5000 | Interface SFP | Yes | Yes | Performance impact until repair | alternate path recovery | Re-enablement part of guided replacement | |
| Storwize V5000 | Enclosure midplane | No | No | Loss of access to specific cool storage until repair is complete | None possible | Bring V5000 back online | (*) will cause any queries which include cool storage tables to fail |
| Fabric Switch | Switch | Yes | Yes | performance impact until repair | alternate path recovery | replace, customize config, verify | |
| Fabric Switch | Power Supply | Yes | Yes | no interrupt (N+1) | none | automatic | |
| Fabric Switch | Fan | Yes | Yes | no interrupt (N+1) | none | automatic | |
| SAN Switch | Switch | Yes | Yes | performance impact until repair | alternate path recovery | replace, customize config, verify | |
| SAN Switch | Power Supply | Yes | Yes | no interrupt (N+1) | none | automatic | |
| Mgmt Switch | Switch | Yes | Yes | performance impact until repair | alternate path recovery | replace, customize config, verify | |
| Mgmt Switch | Power Supply | Yes | Yes | no interrupt (N+1) | none | automatic | |
| RPC | Yes | Yes | no interrupt (N+1) | none | replace, customize config, verify | ||
| Display | Yes | Yes | no impact unless appliance mgmt required | none | replace, customize config, verify | ||
| Terminal Server | Yes | Yes | no impact unless appliance mgmt required | none | replace, customize config, verify |
There are only several components which would result in a full IIAS system outage beyond database stop and restart. These are the power interposer and midplane of the FS900 Flash array. These are both mechanical components with no electrical elements and extremely low failure rates. And even then, there are some ways these parts can fail which do NOT cause the array to go offline, however, the array (and system) would have to be offline for service. But these components have extremely low failure rates which contribute to the 99.999% availability characteristic for the FS900 array.
Other than these two components in the FS900 arrays, the overall system impacts of any hardware failure and/or replacement are potential performance degradations or brief interrupts due to application (database) restarts during failure and/or service. For the most part, the system is designed to operate after hardware failure.
[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHRBY","label":"IBM Integrated Analytics System"},"Component":"hardware","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]
Was this topic helpful?
Document Information
Modified date:
26 September 2022
UID
ibm11102227