Handling a node failure

As a storage administrator, you can experience a whole node failing within the storage cluster, and handling a node failure is similar to handling a disk failure. With a node failure, instead of Ceph recovering placement groups (PGs) for only one disk, all PGs on the disks within that node must be recovered. Ceph will detect that the OSDs are all down and automatically start the recovery process, known as self-healing.

There are three node failure scenarios.
  • Replacing the node by using the root and Ceph OSD disks from the failed node.
  • Replacing the node by reinstalling the operating system and using the Ceph OSD disks from the failed node.
  • Replacing the node by reinstalling the operating system and using all new Ceph OSD disks.
For a high level workflow for each node replacement scenario, see Workflow for replacing a node.

Prerequisites

  • A running IBM Storage Ceph cluster.

  • A failed node.