Topic
  • 6 replies
  • Latest Post - ‏2009-11-25T07:43:37Z by Theeraph
Theeraph
Theeraph
110 Posts

Pinned topic How to diagnose GPFS split brain?

‏2009-11-17T16:07:54Z |
Env: AIX 5.3, GPFS 3.1, Oracle 10g RAC
.
The system has been implemented and running for

some time without any problem.
.
Then the customer upgrade Oracle RAC and since they

have more workload so that also add more CPU and

memory.
.
we believe we had solved some GPFS split brain

situation that were cause by:
.
AIX trashing due to not enough memory...
.
Network related things, eg. network cable, network

connection, network speed setting, network switch

setting, network MTU...
.
But still we have the split brain situation!
.
We are sure it was not caused by AIX trashing

(since even we just started up the node, so there

is not much workload yet, but it still happens) and

since we have fixed the network speed to 1000 Full

and verify everything listed above, so it should

not be caused by network too...
.
1. Are there any probable cause of GPFS split

brain?
.
2. How can we debug them? (Eg. are there any trace

hook that we should enable on GPFS so that we can

see more detail why it happens? ??)
.
Thank you very much,
Theeraphong
Updated on 2009-11-25T07:43:37Z at 2009-11-25T07:43:37Z by Theeraph
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: How to diagnose GPFS split brain?

    ‏2009-11-17T17:10:27Z  
    I don't understand what you mean by "split brain".

    My definition is that when the network breaks the two nodes cannot decide which one should be the remaining live node. This is solved in GPFS by requiring a tiebreaker disk for the two node case. Both nodes must have access to the disk, and use it to check which node is cluster manager. The other node then leaves the cluster. If the old cluster manager was really dead, the challenge on the TB disk would not be responded to and the other node takes over as Cluster Manager. The cluster will stay this way until the network is fixed.

    If one or both nodes is overloaded with paging, this can also look like a network break because GPFS cannot get itself paged in to renew its lease in time. The only way to fix this is to get enough memory or cut back on the application load so that the system is not swamped by paging activity.

    You can make GPFS a little more tolerant of the overload by increasing the GPFS minMissedPingTimeout. However, this will prevent fast failover in the case of real node failure since a node will not be expelled until at least minMissedPingTimeout has passed. So it all depends on what the customer needs:
    1) fast failover, in which case you may get false node reboots
    2) slow failover, to keep things going as long as possible even in the face of system overload or network glitches.
  • Theeraph
    Theeraph
    110 Posts

    Re: How to diagnose GPFS split brain?

    ‏2009-11-18T15:34:44Z  
    • dlmcnabb
    • ‏2009-11-17T17:10:27Z
    I don't understand what you mean by "split brain".

    My definition is that when the network breaks the two nodes cannot decide which one should be the remaining live node. This is solved in GPFS by requiring a tiebreaker disk for the two node case. Both nodes must have access to the disk, and use it to check which node is cluster manager. The other node then leaves the cluster. If the old cluster manager was really dead, the challenge on the TB disk would not be responded to and the other node takes over as Cluster Manager. The cluster will stay this way until the network is fixed.

    If one or both nodes is overloaded with paging, this can also look like a network break because GPFS cannot get itself paged in to renew its lease in time. The only way to fix this is to get enough memory or cut back on the application load so that the system is not swamped by paging activity.

    You can make GPFS a little more tolerant of the overload by increasing the GPFS minMissedPingTimeout. However, this will prevent fast failover in the case of real node failure since a node will not be expelled until at least minMissedPingTimeout has passed. So it all depends on what the customer needs:
    1) fast failover, in which case you may get false node reboots
    2) slow failover, to keep things going as long as possible even in the face of system overload or network glitches.
    Dan,
    .
    1. Your definition of split brain is what I meant.
    .
    In short, GPFS interconnect somehow is broken, so the node use tiebreaker disks to place challenge and respond. Since the cluster manager is still alive, after a while, the non-cluster manager node will be expelled from the cluster.
    .
    Its status will become 'arbitrating' and it will be forced to unmount the file system. After a while, RAC detected this and reboot this node...
    .
    ! The only way to fix this is to get enough memory or cut back on the application load so that the system is not swamped by paging activity.
    .
    2. Yes, we had this before due to not enough memory (Trial CoD memory has expired and when the node is power off and on, the CoD memory was not available, so trashing happens).
    .
    But now the system has Trial CoD memory back, and the problem still happens?!?
    .
    3. Are there any other cases that cause this split brain (besides network problem and paging-too-much)? Since it looks like both are not the cause of our current problem...
    .
    Also, it would be nice if we can trace in detail why this split brain happens with any hook in GPFS trace...
    .
    ! You can make GPFS a little more tolerant of the overload by increasing the GPFS minMissedPingTimeout.
    .
    4. I think we can use 'mmchconfig minMissedPingTimeout=nn' to change it...
    .
    If I change this, do I have to stop and start GPFS?
    .
    5. Here are the ping related parameters:
    ---
    pingPeriod 2
    totalPingTimeout 120
    minMissedPingTimeout 8
    maxMissedPingTimeout 60
    ---
    .
    So which value should I use for minMissedPingTimeout? (30 sec? Is 60 sec which equal to maxMissedPingTimeout too much?)
    .
    Or change I change minMissedPingTimeout to 90 and
    maxMissedPingTimeout to 120? (will I need to change totalPingTimeout too?)
    .
    Thank you very much for your kind advices,
    Theeraphong
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: How to diagnose GPFS split brain?

    ‏2009-11-18T16:55:07Z  
    • Theeraph
    • ‏2009-11-18T15:34:44Z
    Dan,
    .
    1. Your definition of split brain is what I meant.
    .
    In short, GPFS interconnect somehow is broken, so the node use tiebreaker disks to place challenge and respond. Since the cluster manager is still alive, after a while, the non-cluster manager node will be expelled from the cluster.
    .
    Its status will become 'arbitrating' and it will be forced to unmount the file system. After a while, RAC detected this and reboot this node...
    .
    ! The only way to fix this is to get enough memory or cut back on the application load so that the system is not swamped by paging activity.
    .
    2. Yes, we had this before due to not enough memory (Trial CoD memory has expired and when the node is power off and on, the CoD memory was not available, so trashing happens).
    .
    But now the system has Trial CoD memory back, and the problem still happens?!?
    .
    3. Are there any other cases that cause this split brain (besides network problem and paging-too-much)? Since it looks like both are not the cause of our current problem...
    .
    Also, it would be nice if we can trace in detail why this split brain happens with any hook in GPFS trace...
    .
    ! You can make GPFS a little more tolerant of the overload by increasing the GPFS minMissedPingTimeout.
    .
    4. I think we can use 'mmchconfig minMissedPingTimeout=nn' to change it...
    .
    If I change this, do I have to stop and start GPFS?
    .
    5. Here are the ping related parameters:
    ---
    pingPeriod 2
    totalPingTimeout 120
    minMissedPingTimeout 8
    maxMissedPingTimeout 60
    ---
    .
    So which value should I use for minMissedPingTimeout? (30 sec? Is 60 sec which equal to maxMissedPingTimeout too much?)
    .
    Or change I change minMissedPingTimeout to 90 and
    maxMissedPingTimeout to 120? (will I need to change totalPingTimeout too?)
    .
    Thank you very much for your kind advices,
    Theeraphong
    Change minMissedingTimeout=90
    and make maxMinMissedPingTimout same or larger.

    You have to restart each GPFS daemon to enable the settings. This can be done serially. Be sure Oracle stopped first, so it does not reboot the machine when the filesystem disappears.
  • Theeraph
    Theeraph
    110 Posts

    Re: How to diagnose GPFS split brain?

    ‏2009-11-20T03:26:02Z  
    • dlmcnabb
    • ‏2009-11-18T16:55:07Z
    Change minMissedingTimeout=90
    and make maxMinMissedPingTimout same or larger.

    You have to restart each GPFS daemon to enable the settings. This can be done serially. Be sure Oracle stopped first, so it does not reboot the machine when the filesystem disappears.
    Dan,
    .
    Thank you for your advices...
    .
    I tried changing our internal system:
    • minMissedPingTimeout from 3 to 90
    • maxMissedPingTimeout from 60 to 120
    .
    Then stop and start GPFS.
    .
    GPFS can start successfully and the fs is mounted. However, I noticed that commands like mmgetstate -La, mmlsconfig seems to be very slow to respond and after a while everything in the system is very slow to respond... (It seems that the system hangs since it does not respond to any command I enter...)
    .
    What had happened? How can I solve this? (I am glad that I tested it in our internal system first!)
    .
    Thank you very much,
    Theeraphong
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: How to diagnose GPFS split brain?

    ‏2009-11-24T15:37:33Z  
    • Theeraph
    • ‏2009-11-20T03:26:02Z
    Dan,
    .
    Thank you for your advices...
    .
    I tried changing our internal system:
    • minMissedPingTimeout from 3 to 90
    • maxMissedPingTimeout from 60 to 120
    .
    Then stop and start GPFS.
    .
    GPFS can start successfully and the fs is mounted. However, I noticed that commands like mmgetstate -La, mmlsconfig seems to be very slow to respond and after a while everything in the system is very slow to respond... (It seems that the system hangs since it does not respond to any command I enter...)
    .
    What had happened? How can I solve this? (I am glad that I tested it in our internal system first!)
    .
    Thank you very much,
    Theeraphong
    Plwase gather "mmfsadm dump waiters" from all nodes when it is sluggish. Without that it is very hard to diagnose.
  • Theeraph
    Theeraph
    110 Posts

    Re: How to diagnose GPFS split brain?

    ‏2009-11-25T07:43:37Z  
    • dlmcnabb
    • ‏2009-11-24T15:37:33Z
    Plwase gather "mmfsadm dump waiters" from all nodes when it is sluggish. Without that it is very hard to diagnose.
    Hi,
    .
    After a while the system performance came back to normal...
    .
    OK, so the next time we have slow performance on GPFS, I will always collect "mmfsadm dump waiters" on all nodes...
    .
    Thank you very much,
    Theeraphong