• No replies
37 Posts

Pinned topic GPFS CES node failover failure

‏2019-05-28T20:40:26Z | ces



We had a production outage at the customer's site due to CES service down event. 4 CES servers mount 2 GPFS filesystem (test01 and prod01). prod01 has ces-root directory.  4 CES nodes got in the "node failed" state due to an issue with test01 filesystem and as result NFS services stop running.  GPFS prod01 (with ces-root) remain online all the time.


Can someone confirm the statement below?


 It is now a solid confirmation that one nfs exporting fs unmounted in cluster-wide will triggered ces service down, even other exporting fs still alive. That says, ces node requires all nfs exporting fs are mounted.
Below are the official RCA i can provide:

Since xxxx_dmz01 was unmounted in cluster wide due to lost disk access, ces ip address failover would never complete successfully until this fs remount back on ces nodes 

In summary, it is by-design that one single exporting filesystem failure on one ces node will take down ces service, even other exporting file systems may still be alive.


GPFS cluster information


  GPFS cluster name:

  GPFS cluster id:           3253305039493079400


Cluster Export Services global parameters


  Shared root directory:                /xxxx_DATASTORE/.ces

  Enabled Services:                     NFS

  Log level:                            0

  Address distribution policy:          even-coverage


 Node  Daemon node name            IP address       CES IP address list




   9       x.x.x..123 x.x.x..125

  10       x.x.x..121 x.x.x..122