Simulating a node failure

To simulate a hard node failure, power off the node and reinstall the operating system.

Prerequisites

  • A running IBM Storage Ceph cluster.

  • Root-level access to all nodes on the storage cluster.

Procedure

  1. Check the storage cluster’s capacity to understand the impact of removing the node:

    Example

    [ceph: root@host01 /]# ceph df
    [ceph: root@host01 /]# rados df
    [ceph: root@host01 /]# ceph osd df
  2. Optionally, disable recovery and backfilling:

    Example

    [ceph: root@host01 /]# ceph osd set noout
    [ceph: root@host01 /]# ceph osd set noscrub
    [ceph: root@host01 /]# ceph osd set nodeep-scrub
  3. Shut down the node.

  4. If you are changing the host name, remove the node from CRUSH map:

    Example

    [ceph: root@host01 /]# ceph osd crush rm host03
  5. Check the status of the storage cluster:

    Example

    [ceph: root@host01 /]# ceph -s
  6. Reinstall the operating system on the node.

  7. Add the new node:

  8. Optionally, enable recovery and backfilling:

    Example

    [ceph: root@host01 /]# ceph osd unset noout
    [ceph: root@host01 /]# ceph osd unset noscrub
    [ceph: root@host01 /]# ceph osd unset nodeep-scrub
  9. Check Ceph’s health:

    Example

    [ceph: root@host01 /]# ceph -s