Simultaneous restart of worker nodes causes GlusterFS to fail

When you restart all the worker nodes at the same time, GlusterFS does not start.

Causes

Because of a simultaneous restart of the worker nodes, Heketi pod does not start. The Heketi container fails to start as it is unable to mount heketidbstorage volumes. Status of heketidbstorage shows as offline because the corresponding bricks are not online due to an unclean shutdown.

Resolving the problem

Get the GlusterFS pod information by running the following command:

kubectl -n kube-system get pod | grep gluster

Following is an example of the command output:

glusterfs-36nd0 1/1 Running 4 7d
glusterfs-3m5ql 1/1 Running 3 7d
glusterfs-tc279 1/1 Running 16 7d

Complete the following steps for all the GlusterFS pods:

  1. Log in to the GlusterFS pod:

     kubectl -n kube-system exec -it <POD ID> bash
    

    Following is an example of the command and its output:

     root@BPILICPMSTR001:~/cluster# kubectl -n kube-system exec -it glusterfs-36nd0 bash
     [root@bpilicpwrk001 /]#
    
  2. Check the status of the GlusterFS volume on the pod:

     gluster volume status
    

    Following is an example of the command and its output:

     [root@bpilicpwrk001 /]# gluster volume status
     Status of volume: heketidbstorage
     Gluster process TCP Port RDMA Port Online Pid
    
     Brick 10.10.25.49:/var/lib/heketi/mounts/vg
     _22bbf0fbb483f9c170774d83081c3420/brick_2fb
     3a10c7eafb8bed375829e8aaf782a/brick 49153 0 Y 5858
     Brick 10.10.25.51:/var/lib/heketi/mounts/vg
     _118f22bc13626321606280ea1d79fdc3/brick_649
     4a3b077c38667f07a59197efabea7/brick 49153 0 Y 5318
     Brick 10.10.25.50:/var/lib/heketi/mounts/vg
     _d4d4f2e86c08f571befe7fc272dc4aae/brick_dc9
     416bf4d88e45ff4d0061c08ef5b19/brick 49153 0 Y 5441
     Self-heal Daemon on localhost N/A N/A Y 5878
     Self-heal Daemon on 10.10.25.50 N/A N/A Y 5461
     Self-heal Daemon on 10.10.25.51 N/A N/A Y 5338
     Task Status of Volume heketidbstorage
    
     There are no active volume tasks
    
     [root@bpilicpwrk001 /]#
    

    If the bricks corresponding to heketidbstorage are down, restart the bricks by running the following commands:

     gluster volume stop heketidbstorage
    
     gluster volume start heketidbstorage force
    
  3. Verify the Heketi pod status:

     kubectl -n kube-system get pod | grep heketi
    

    The status displays a message similar to the following message:

     heketi-402978595-pjnd7 1/1 Running 0 2h