IBM Support

IT24388: MQ FOR NONSTOP SERVER, REINITIALISATION OF MULTIPLE SLAVE REPOSITORY MANGERS IN PARALLEL RESULTS IN DEADLOCK.

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • MQ for NonStop Server, reinitialisation of multiple slave
    repository mangers in parallel results in deadlock,  FDC
    RM220005 with AMQ9511 and AMQ9448 Logged.
    
    When there is an inconsistency between the cache state of the
    slave repository managers compared to the masters one. The
    slaves perform an automatic full reinitialization. During
    initilisation the repmans perform a handshake among each other.
    
    In situations resulting in reinitialisation of multiple slave
    repository mangers in parallel, this handshake results in a
    deadlock, preventing the slave repository managers ability to
    process further updates.
    
    Restarting a particular repman will invalidate cluster cache on
    its CPU.
    This results in cluster cache being unavailable as the freshly
    started repman will also hang on the deadlock situation.
    

Local fix

  • The deadlock could be resolved by identifying repmans
    participating in the deadlock using pstate open information:
    
       26 \CS3.$X12PC:6937755593                  Process       0
    0
               Current operations: Writeread
               Sync depth at open time was 0.
               Options at open time was x4000.
               Access mode is Read/Write Shared
    
    Usually all slaves but one should be waiting for writeread on a
    particular slave, which in turn waits for one of the others.
    
    To resolve the deadlock all the slaves waiting for the same
    slave to complete their request, should be stopped one by one,
    with the last one to be be stopped being the slave the others
    are waiting for.
    

Problem summary

  • Every slave amqrrmfa process introduces itself on
    initialization to master and other slave amqrrmfa processes.
    When multiple slave instances reinitialize in parallel, each
    one tries to introduce to all others. As the other slaves are
    also initializing, they don't process this requests and don't
    reply until they completed initialization.
    

Problem conclusion

  • Code was changed and a timeout was implemented, when trying
    to handshake with another slave running on a CPU, with a
    larger ordinal number, than the process itself.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT24388

  • Reported component name

    WEBS MQ NSS ITA

  • Reported component ID

    5724A3902

  • Reported release

    531

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2018-03-15

  • Closed date

    2018-07-13

  • Last modified date

    2018-07-13

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    WEBS MQ NSS ITA

  • Fixed component ID

    5724A3902

Applicable component levels

[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCPQ5M","label":"APAR"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"5.3.1","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
13 July 2018