IBM Support

PH40206: QUEUE MANAGER APPEARS TO BE IN A HUNG STATE / ALLIED ADDRESS SPACES HANG WHEN TERMINATED / MQ LATCHES FOR SHUNTING HELD

A fix is available

Subscribe

You can track all active APARs for this component.

APAR status

  • Closed as program error.

Error description

  • Queue manager appears to be in a hung state.
    STOP QMGR MODE(FORCE), STOP CHINIT commands were successfully
    taken by the queue manager but not executed.
    
    The performance problem is a result of very inefficient I/O
    when shunting a UR from archive logs located on tape.
    
    The problem will occur when:
    
    - A long-running unit of work needs to be shunted.
    
    - Part of the log range required for shunting is only available
    in an archive log.
    
    - The archive log is a tape.
    
    
    In these circumstances, the shunt processing will result in a
    very inefficient pattern of reading data from the archive log.
    Every time we encounter a log record that spans a block
    boundary, we need to re-read a block we have already passed.
    This is done by reopening the archive log and reading from the
    end back to the block we're interested in. For a large log,
    this means that the queue manager will repeatedly read the end
    portion of the log as it has to keep traversing back to the
    current point.
    
    In the testing done by IBM MQ Development: z/OS Service team,
    they used a unit of work containing 300 puts of messages with
    100,000 byte of data each.
    
    Shunt processing for this unit of work took over 2 minutes,
    with continuous I/O.
    
    The nature of the issue means that as the size of the data
    being read increases, the workload increases in two ways:
    
    - For each additional block we need to process, we have to
    restart reading the log another time.
    
    - As processing moves further from the end of the log, each
    reread has to move through more data to get back to the point
    we were processing at.
    
    
    From this, we would expect the processing time to increase by
    the size of the data range squared. For the customer's case
    there is a much larger log range to be processed, which would
    explain why the shunting was still going after hours of
    processing.
    .
    Additional keywords:
    latch
    

Local fix

  • The problem can be avoided by ensuring that the shunt task
    never needs to read from an archive log on tape. There are
    several ways this could be achieved:
    
    - Add extra active logs. Shunting will be triggered after a UR
    has been active for three checkpoints. At the latest, this
    means that it will start at the end of the third active log
    after the UR started. With 4 active logs, that means that there
    is a risk that the 4th log fills before shunting has completed
    and the first log gets reused. If you add more logs, this gives
    more head-room for shunting to finish before that happens.
    
    - Change the archive log configuration to use DASD, rather than
    tape. This would avoid the scenario completely, as different
    read logic is used for accessing data from DASD datasets.
    
    - Reduce the LOGLOAD setting to make checkpointing occur more
    frequently. This option may be easier to implement in the
    short-term than the other two. If LOGLOAD is lowered so that MQ
    is taking checkpoints part-way through an active log, rather
    than just at the end of the log, that would mean that
    long-running URs would be shunted earlier. While this may
    result in more instance of shunting, it would significantly
    reduce the risk of a shunt process requiring data from the
    archive logs, as the shunt would start while the UR only covers
    1 or 2 logs.
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED: All users of IBM MQ for z/OS Version 9       *
    *                 Release 1 Modification 0 and Release 2       *
    *                 Modification 0.                              *
    ****************************************************************
    * PROBLEM DESCRIPTION: Very busy QMGRs which offload active    *
    *                      logs to tape data sets may experience   *
    *                      performance problems when using low     *
    *                      numbers of active logs.                 *
    *                                                              *
    *                      This may manifest as delays to          *
    *                      recoverable actions, in addition to     *
    *                      large amounts of I/O to archive log     *
    *                      data sets.                              *
    ****************************************************************
    Log data from long running units of work (UOW) which span
    multiple log data sets will be shunted forward into more recent
    logs. Depending on the amount of shunting required, and the
    number of active log data sets, the shunting task may be
    required to read from archive log data sets.
    
    The processing used to read from tape archive log data sets is
    extremely inefficient when reading large log records, and
    results in the tape being rewound many times.
    

Problem conclusion

  • The logic for shunting log data from tape archive log data sets
    has been improved. This improves the performance of reading
    large log records from tape archive logs in certain scenarios.
    

Temporary fix

  • 
    

Comments

  • 
    

APAR Information

  • APAR number

    PH40206

  • Reported component name

    IBM MQ Z/OS V9

  • Reported component ID

    5655MQ900

  • Reported release

    100

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2021-08-29

  • Closed date

    2022-04-01

  • Last modified date

    2022-05-03

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    UI79981 UI79982

Modules/Macros

  • CSQJR103 CSQRSHUN
    

Fix information

  • Fixed component name

    IBM MQ Z/OS V9

  • Fixed component ID

    5655MQ900

Applicable component levels

  • R100 PSY UI79982

       UP22/04/15 P F204 ¢

  • R200 PSY UI79981

       UP22/04/13 P F204 ¢

Fix is available

  • Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"100"}]

Document Information

Modified date:
07 June 2022