Recovering an HDFS Transparency cluster

Learn how to bring back online an HDFS Transparency cluster.

Disk failure or other unforeseen storage issues sometimes cause IBM Storage Scale file systems to get unmounted. In such cases, HDFS Transparency automatically shuts down itself, and workloads could report an exception. When the IBM Storage Scale cluster is back functioning, follow this recovery procedure for HDFS Transparency to bring the cluster back online.

  1. Shut down HDFS Transparency NameNodes and Data Nodes by using the next commands.
    # mmces node suspend --stop -N <NameNode1_HOST>,<NameNode2_HOST>
    # mmhdfs hdfs-dn stop

    Use the mmces node suspend command to stop the NameNodes. Using this command is needed to ensure that the root directory shared with CES gets unlocked.

  2. To retrieve the mount points that HDFS Transparency uses for the IBM Storage Scale file system, run the mmhdfs config get command as shown in the following example.

    For example:

    # mmhdfs config get gpfs-site.xml -k gpfs.mnt.dir
    gpfs.mnt.dir=/gpfs1,/gpfs2
  3. If a secondary file system is configured, unmount that one first. Then, unmount the primary file system.

    For example:

    # mmumount gpfs2 -a
    # mmumount gpfs1 -a
  4. Check the status of HDFS Transparency to ensure that all the NameNodes and DataNodes are down.
    # mmhdfs hdfs status
  5. Remount the IBM Storage Scale file systems.
    # mmmount gpfs1 -a
    # mmmount gpfs2 -a

    Make sure that all the file systems are successfully mounted. Use the mmlsmount and mount commands to verify.

  6. Start the HDFS Transparency NameNodes and DataNodes.
    # mmces node resume --start -N <NameNode1_HOST>,<NameNode2_HOST>
    # mmhdfs hdfs-dn start
  7. Check the status of HDFS Transparency to corroborate that all NameNodes and DataNodes are in operation.
    # mmhdfs hdfs status