IBM Support

IV97240: CAA:SLOW GOSSIP TRANSMISSION ON BOOT MAY CAUSE PARTIONED CLUSTER

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • After rebooting, one of the cluster nodes was not able
    to join  the cluster environment.
    - lscluster from each node did not show the other node as
      UP.
    - After system reboot, the syslog.caa log file showed
      a delay of over 2 minutes in getting the first
    multicast
      gossip packet.
      When this delay occurs, the node creates its own
    cluster
      ignoring the other node which is already up.
      This leads to a split-brain / partitioned cluster
      in the CAA environment.
    

Local fix

  • n/a
    

Problem summary

  • If a node is rebooted and, due to network issues, fails to
    receive a gossip from other UP nodes within twice of
    node_down_delay, could join by itself, causing a split-brain
    (partitioned cluster).
    

Problem conclusion

  • There is a gate in which all initial clusterwide lock requests
    should consider the count of nodes heartbeating to the
    repository in addition to those gossiping over network.
    There was a hole in the gate and the fix closes it.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IV97240

  • Reported component name

    AIX V7.1

  • Reported component ID

    5765H4000

  • Reported release

    710

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-06-15

  • Closed date

    2017-06-15

  • Last modified date

    2017-10-13

  • APAR is sysrouted FROM one or more of the following:

    IV97148

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    AIX V7.1

  • Fixed component ID

    5765H4000

Applicable component levels

  • R710 PSY U872165

       UP17/10/13 I 1000

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SG11R"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"710","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
18 April 2022