IBM Support

IV97265: CAA:SLOW GOSSIP TRANSMISSION ON BOOT MAY CAUSE PARTITIONED CLUSTAPPLIES TO AIX 7100-04 17/09/29 PTF PECHANGE

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • **************************************************************
    * USERS AFFECTED:
    * Systems running the AIX 7100-04 Technology Level
    * with bos.cluster.rte at the 7.1.4.30 or 7.1.4.31 level.
      **************************************************************
    * ERROR DESCRIPTION:
    * After rebooting a node in either a PowerHA or VIOS SSP cluster
    * using CAA, there is a chance that the node may create its own
    * cluster, causing a split-brain / partitioned cluster in the
    * CAA environment.
    *
    * This is more likely to be seen if the network is slow and
    * there is a delay in gossip packets being received by the
    * rebooted node.
    *
    * The effect of a split-brain / partitioned cluster can vary,
    * but in the worst cases: PowerHA may react by bringing
    * resources online at the same time on multiple nodes, and
    * VIOS SSP can experience pool going down on one or more nodes.
      **************************************************************
    * RECOMMENDATION:
    * Install APAR IV97265.
    * Prior to fix availability, an interim fix is available from
    * either
    * ftp://aix.software.ibm.com/aix/ifixes/iv97265/
    * https://aix.software.ibm.com/aix/ifixes/iv97265/
    * Installation of the ifix requires a reboot.
      **************************************************************
    

Local fix

  • n/a
    

Problem summary

  • PROBLEM SUMMARY:
    After rebooting a node in either a PowerHA or VIOS SSP
    cluster using CAA, there is a chance that the node may
    create its own cluster, causing a split-brain / partitioned
    cluster in the CAA environment.
    This is more likely to be seen if the network is slow and
    there is a delay in gossip packets being received by the
    rebooted node.
    The effect of a split-brain / partitioned cluster can vary,
    but in the worst cases: PowerHA may react by bringing
    resources online at the same time on multiple nodes, and
    VIOS SSP can experience pool going down on one or more
    nodes.
    

Problem conclusion

  • There is a gate in which all initial clusterwide lock
    requests should consider the count of nodes heartbeating to
    the repository in addition to those gossiping over network.
    There was a hole in the gate and the fix closes it.
    

Temporary fix

  •   *********
      * HIPER *
      *********
    

Comments

APAR Information

  • APAR number

    IV97265

  • Reported component name

    AIX V7.1

  • Reported component ID

    5765H4000

  • Reported release

    710

  • Status

    CLOSED PER

  • PE

    YesPE

  • HIPER

    YesHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-06-16

  • Closed date

    2017-06-16

  • Last modified date

    2017-11-07

  • APAR is sysrouted FROM one or more of the following:

    IV97148

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    AIX V7.1

  • Fixed component ID

    5765H4000

Applicable component levels

  • R710 PSY U873626

       UP17/09/22 I 1000

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SG11R"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"710","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
18 April 2022