Topic
  • No replies
adubrovsky
adubrovsky
37 Posts

Pinned topic GPFS CES node failover failure

‏2019-05-28T20:40:26Z | ces

Hello,

 

We had a production outage at the customer's site due to CES service down event. 4 CES servers mount 2 GPFS filesystem (test01 and prod01). prod01 has ces-root directory.  4 CES nodes got in the "node failed" state due to an issue with test01 filesystem and as result NFS services stop running.  GPFS prod01 (with ces-root) remain online all the time.

 

Can someone confirm the statement below?

 

 It is now a solid confirmation that one nfs exporting fs unmounted in cluster-wide will triggered ces service down, even other exporting fs still alive. That says, ces node requires all nfs exporting fs are mounted.
Below are the official RCA i can provide:

Since xxxx_dmz01 was unmounted in cluster wide due to lost disk access, ces ip address failover would never complete successfully until this fs remount back on ces nodes 

In summary, it is by-design that one single exporting filesystem failure on one ces node will take down ces service, even other exporting file systems may still be alive.

 

GPFS cluster information

========================

  GPFS cluster name:         dmz_datastore.gpfs.xxxxdmz.net

  GPFS cluster id:           3253305039493079400

 

Cluster Export Services global parameters

-----------------------------------------

  Shared root directory:                /xxxx_DATASTORE/.ces

  Enabled Services:                     NFS

  Log level:                            0

  Address distribution policy:          even-coverage

 

 Node  Daemon node name            IP address       CES IP address list

-----------------------------------------------------------------------

   7   dk37.gpfs.xxxxdmz.net       10.6.14.17       10.6.14.16 10.6.14.22

   8   dk38.gpfs.xxxxdmz.net       10.6.14.19       10.6.14.18 10.6.14.20

   9   dk39.gpfs.xxxxdmz.net       10.6.14.21       x.x.x..123 x.x.x..125

  10   dk36.gpfs.xxxxdmz.net       10.6.14.15       x.x.x..121 x.x.x..122