IBM Support

PI59634: MQ TAKING A LONG TIME FOR A BACKWARD RECOVERY AFTER A CRASH IN THE Z/OS SYSTEM.

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • The customer experienced a crash in their z/OS system
    (not related to MQ).  During the startup of MQ,
    the backward recovery process was taking a long time
    to complete.
    .
    The root cause of the slow recovery process was a
    combination of two aspects of the work which
    needed recovering, which caused recovery processing
    to go through logic whose performance degrades as
    the volume of data to recover increases.
    .
    The first aspect is the volume of data which needs
    to be processed to recover MQ.
    In this particular system, there was a significant
    amount of work which had been performed since the
    last checkpoint, which meant that a large log range
    needed to be read to complete the recovery.
    The checkpoint frequency is controlled by the size
    of the MQ log datasets (a checkpoint is taken each
    time a log is filled) and the LOGLOAD parameter
    which controls how much logging can be done before a
    checkpoint is taken. In this customer scenario,
    the LOGLOAD set to 9,000,000. This value is large
    enough that it is unlikely to be reached before a
    log dataset fills.
    .
    The second aspect of the workload was the use of
    XA transaction coordination.
    It is the recovery logic related to the handling
    of this style of UOW which has the performance
    problem.
    .
    The issue that was this combination exposed was
    that the recovery processing builds a list of
    XA transactions it has seen during the forward
    recovery phase, and for some log record types
    it scans this list to see if the record can be
    associated with a corresponding XA transaction.
    
    As the list grows, the processing required to
    traverse the list for each log record increases.
    Where a large number of log records need to be
    processed and they contain a large number of XA
    transactions, this increasing CPU cost causes
    the recovery processing to become slower as
    the processing of the log proceeds.
    
    
    Additional Symptom(s) Search Keyword(s):
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED: All users of WebSphere MQ for z/OS Version 8 *
    *                 Release 0 Modification 0.                    *
    ****************************************************************
    * PROBLEM DESCRIPTION: After abnormal queue manager            *
    *                      termination, the queue manager can take *
    *                      a long time to restart when using XA.   *
    *                      While restarting the queue manager      *
    *                      demonstrates high cpu usage.            *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    During queue manager restart, MQ processes all MQ log records
    since the last checkpoint while recovering the state of
    transactions performed before the abnormal termination. For each
    XA transaction found, an XTE control block is created and
    chained from the XIT - this chain is searched for each control
    log record processed in order to determine if the log record
    relates to a known XA transaction. As the number of XA
    transactions performed since the last checkpoint increases, the
    time to perform this search increases greatly, leading to high
    cpu in CSQMLCUR, and extended delays during queue manager
    restart.
    

Problem conclusion

  • CSQMLCUR is changed to remove XTEs for XA transactions that have
    completed commit or abort processing earlier in queue manager
    restart processing, so that the XIT chain only contains XTEs for
    active XA units of work. This greatly reduces the number of XTEs
    chained, and consequently reduces the time required to scan the
    chain.
    000Y
    CSQMLCUR
    

Temporary fix

  • *********
    * HIPER *
    *********
    

Comments

APAR Information

  • APAR number

    PI59634

  • Reported component name

    WMQ Z/OS 8

  • Reported component ID

    5655W9700

  • Reported release

    000

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2016-03-23

  • Closed date

    2016-04-26

  • Last modified date

    2016-06-02

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    UI37315

Modules/Macros

  • CSQMLCUR
    

Fix information

  • Fixed component name

    WMQ Z/OS 8

  • Fixed component ID

    5655W9700

Applicable component levels

  • R000 PSY UI37315

       UP16/05/26 P F605 Ž

Fix is available

  • Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.0","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
02 June 2016