Topic
3 replies Latest Post - ‏2013-05-01T20:16:17Z by davesmo
SystemAdmin
SystemAdmin
2092 Posts
ACCEPTED ANSWER

Pinned topic Lost membership in cluster

‏2009-01-07T02:46:34Z |
Hi All,
I have a mixed GPFS cluster with 10 AIX and 8 Linux on PPC nodes(JS21).

The problem is when I run a application in the linux nodes one by one, the GPFS of thress nodes is unmounted. The other 5 nodes can work correctly. I find the same info in the failed nodes, "Lost membership in cluster, Unmounting file systems."

This is the mmfs log of one failed node:

Tue Jan 6 16:02:06 CST 2009 runmmfs starting
Removing old /var/adm/ras/mmfs.log.* files:
Unloading modules from /usr/lpp/mmfs/bin
Loading modules from /usr/lpp/mmfs/bin
Module Size Used by
mmfslinux 279440 1 mmfs
tracedev 28592 2 mmfs,mmfslinux
Removing old /var/mmfs/tmp files:
Tue Jan 6 16:02:07.708 2009: mmfsd initializing. {Version: 3.1.0.15 Built: Sep 25 2007 19:59:04} ...
Tue Jan 6 16:02:08.729 2009: Connecting to 192.168.13.101 f01n01-data
Tue Jan 6 16:02:08.730 2009: Connected to 192.168.13.101 f01n01-data
Tue Jan 6 16:02:08.731 2009: Connecting to 192.168.13.102 f01n02-data
Tue Jan 6 16:02:08.730 2009: Connected to 192.168.13.102 f01n02-data
Tue Jan 6 16:02:08.741 2009: mmfsd ready
Tue Jan 6 16:02:08 CST 2009: mmcommon mmfsup invoked
Tue Jan 6 16:02:15.577 2009: Command: mount gpfslv
Tue Jan 6 16:03:56.938 2009: Command: err 0: mount gpfslv
Tue Jan 6 16:11:20.297 2009: Close connection to 192.168.13.102 f01n02-data
Tue Jan 6 16:11:20.297 2009: Close connection to 192.168.13.101 f01n01-data
Tue Jan 6 16:11:30.491 2009: Lost membership in cluster gpfs_cluster.f01n01-data. Unmounting file systems.
Tue Jan 6 16:11:31.118 2009: Connecting to 192.168.13.101 f01n01-data
Tue Jan 6 16:11:31.119 2009: Connected to 192.168.13.101 f01n01-data
Tue Jan 6 16:11:31.120 2009: Connecting to 192.168.13.102 f01n02-data
Tue Jan 6 16:11:31.121 2009: Connected to 192.168.13.102 f01n02-data
Tue Jan 6 16:13:13.304 2009: Remounted gpfslv
Tue Jan 6 16:13:13.305 2009: mmfsd ready
Tue Jan 6 16:13:13 CST 2009: mmcommon mmfsup invoked

f01n01-data and f01n02-data are the NSD servers in the GPFS cluster.

in the /var/log/message

Jan 6 16:09:12 b01n03 kernel: GPFS Deadman Switch timer [0] has expired; IOs in progress: 0
Jan 6 16:11:30 b01n03 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=6492723: Reason code 668 Failure Reason Los t membership in cluster gpfs_cluster.f01n01-data. Unmounting file systems.
Jan 6 16:11:30 b01n03 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=6492723:
Jan 6 16:12:29 b01n03 kernel: GPFS Deadman Switch timer [0] has expired; IOs in progress: 0

What's the meaning of "Reason code 668"? Any suggestions on how to check this issue?
Thanks!
Updated on 2009-01-09T03:39:31Z at 2009-01-09T03:39:31Z by SystemAdmin
  • dlmcnabb
    dlmcnabb
    1012 Posts
    ACCEPTED ANSWER

    Re: Lost membership in cluster

    ‏2009-01-07T06:05:27Z  in response to SystemAdmin
    The 668 is just the error code for "Lost Membership".

    The applications must be paging so much that when GPFS wants to send its lease renewal RPC to the Cluster Manager 5 seconds before leaseDuration runs out, it cannot get paged in before the lease expires and then the DMS timer pops later 2/3rds of leaseRecoveryWait later. If leaseRecoveryWait is the default 35 seconds then membership will be lost 23 seconds later.

    If you do not need fast failure detection, then you can increase leaseRecoveryWait=120 seconds giving the client 80 seconds to get paged in.

    You should also increase minMissedPingTimeout=120 so the the Cluster manager will wait up to 120 seconds after leaseDuration before declaring a node out of the cluster.

    Consider fixing the application so it can run in the memory available instead of paging the system to death. Or get more memory. Or reduce the GPFS pinned pagepool memory so there is more available for the applications.
    • SystemAdmin
      SystemAdmin
      2092 Posts
      ACCEPTED ANSWER

      Re: Lost membership in cluster

      ‏2009-01-09T03:39:31Z  in response to dlmcnabb
      Yes, after changing this two parameters, the application can run on this three nodes. But the performance is very poor. On the other five nodes the application can commplete in 5 minutes, but on this three nodes it took about 17 minutes.

      There was no swap in and out when the application was running, the system still had about 5GB free memory.

      root@b01n03 tmp# vmstat 5
      procs -----------memory---------- -swap -----io---- system ----cpu----
      r b swpd free buff cache si so bi bo in cs us sy id wa
      3 0 0 5175424 21852 2370840 0 0 4 172 255 239 12 1 86 1
      3 1 0 5175424 21856 2370836 0 0 0 6 16 56 75 0 25 0
      3 0 0 5175424 21860 2370832 0 0 0 14 25 60 75 0 25 0
      3 0 0 5175424 21860 2370832 0 0 0 25 25 68 75 0 25 0
      3 0 0 5175424 21860 2370832 0 0 0 0 15 52 75 0 25 0
      3 0 0 5175424 21860 2370832 0 0 0 0 20 61 75 0 25 0

      I did a copy test on this node, f01n01-data and f01n02-data are the NSD servers,

      root@b01n03 tmp# time scp f01n01-data:/gpfs/tmp/test /tmp/
      real 3m14.168s
      user 0m45.349s
      sys 0m29.911s

      root@b01n03 tmp# time scp f01n02-data:/gpfs/tmp/test /tmp/
      real 3m12.921s
      user 0m45.572s
      sys 0m30.313s

      root@b01n03 tmp# time cp /gpfs/tmp/test /tmp/
      real 12m44.622s
      user 0m0.176s
      sys 0m15.963s

      Copy the file from gpfs to local disk took so long time, how to check this problem? thanks.
      • davesmo
        davesmo
        1 Post
        ACCEPTED ANSWER

        Re: Lost membership in cluster

        ‏2013-05-01T20:16:17Z  in response to SystemAdmin

        I was facing a similar issue with: reason code 668 failure reason lost membership in cluster

        found out one of the private interfaces - was 1000/Full and the other 100/Full

        Check your interface speed settings per each system.

        100% the issue - with the 668 issue (for me)- where the private connections - on the same network - talking at different speeds.

        Updated on 2013-05-24T20:30:44Z at 2013-05-24T20:30:44Z by davesmo