IBM Support

PK24575: LOOP OR HANG WHEN PERFORMING RECOVERY OF 20 AREAS

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • When performing recovery of 20 DEDB areas
    using DRF PITR, the processing CPU time exceeded 14 hours and
    the problem was judged to be a hang or loop and was canceled.
    This problem can occur for Full Function DBs. The problem occurs
    when processing lots of log data, causing data to be spilled to
    dataspaces after all buffers have been used/filled. This is a
    data integrity issue.
    Additional symptoms: Fixes problem reported in R210 APAR
    PK30937
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED: All users of IMS Database Recovery Facility  *
    *                 Version 3 Release 1 running recovery to any  *
    *                 prior point in time PITR.                    *
    ****************************************************************
    * PROBLEM DESCRIPTION: While running a PITR recovery,the DRF   *
    *                      master address space runs into a loop   *
    *                      or hang condition.                      *
    ****************************************************************
    * RECOMMENDATION: INSTALL CORRECTIVE SERVICE FOR APAR/PTF      *
    ****************************************************************
    PITR results in a hang or endless loop.  This occurs when the
    amount of log data is a high enough percentage of private
    storage that it needs to be moved to data spaces.
    
    As part of the problem determination for the originally
    reported problem, it was disovered that under some high load
    circumstances, log data is lost as part of PITR processing.
    In addition, it was discovered that SDEP log records were not
    processed correctly and SDEP information was lost resulting in
    a data integrity problem.  The SDEP problem is when updates to
    the SDEPs cause it to wrap.
    
    As part of testing the fix, an ABENDS40D was encountered as a
    shortage of available private storage.
    

Problem conclusion

  • AIDS: RIDS/UTIL RIDS/DBS DBS/UTIL
      DEP: NONE
      GEN:
    
    *** END IMS KEYWORDS ***
    The initial problem reported by the customer of a hang or
    endless loop during large recoveries is fixed in the following
    ways.  As part of testing the fix, multiple problems were
    encountered and fixed as documented below.
    
    The hang and endless loop ended up being several different
    problems.  They are fixed in the following ways.
    
    First, buffer contention caused a hang and is fixed by
    separating the buffers in to two pools.  One for log read and
    one for buffer send to the subordinate address spaces.  A loop
    is fixed by using awes from the awe pool instead of local
    module storage.
    
    The missing data is fixed in three major ways.  First, spill
    management is fixed to always return the data spilled, not
    extraneous data due to residual data in the token on a spill.
    If the token on a spill is non-zero, the spill manager
    interprets the request as a retrieve and the spilled data is
    lost because the caller does not expect to have data back from
    a spill request.  The other way buffers were lost was through
    awe's in local storage being enqueued and the thread not
    waiting.  When control returned from the module enqueuing the
    awe, the awe storage is reused.  The awe field is cleared at
    times resulting in the buffer being enqueued over being lost.
    The last is for fast path SDEP processing.  PK11200 added
    support for the LOG5957 record.  The support is not complete
    and did not use FRXRRR to copy the LOG5957 from the input
    buffer to the buffer sent to the subordinate address space.
    Also, processing the LOG5957 depended on a UOR token but the
    LOG5957 does not contain a UOR token.  The LOG5957 is sent to
    random subordinate address spaces and not applied to the area
    intended.
    
    The ABENDS40D is fixed by separating the buffer pool into two
    pools.  One for read and one for buffer send to subordinate
    address spaces.  It is also fixed by sending all log data
    through FRXQBUF0 and FRXLMRG0 to have FRXUORM0 process the data
    in order.  This way, the UOR related storage can be released
    when end of UOR notification is encountered.  The hash table is
    reduced from a maximum of five levels to one hash table.  This
    significantly reduces the private storage utilization for
    extremely large recoveries.
    
    Timestamp was never expected to have hex zeroes so a check for
    timestamp validity is not done. Some vendor products may place
    x'F0' in the timestamp so validity checking is done with the
    fix.
    
    The code is changed in the following parts:
    
    Parameter change for multiple buffer types
    FRXCAMG0, FRXCBDM0, FRXICLI0, FRXICTL0, FRXLMRG0, FRXMNP,
    FRXMSTR1, FRXPDIR0, FRXPDIS0, FRXPDSR0, FRXPDSS0, FRXPSDR0,
    FRXPSDS0, FRXQBUF0, FRXRBUF0, FRXRCTL0, FRXRDTH0, FRXUORM0,
    FRXHBUF0, FRXBDMG0, FRXMTC
    
    Dump formatter recompile
    FRXADF00, FRXADM10, FRXADM20
    
    FRXBDCB0  Add rvur as a fixed length control block for
              performance enhancement on get/release rvur storage
    
    FRXBDMG0 and FRXMTC are changed to add MSGFRD2892I.
    
    FRXCON    define buffersend and bufferread
    FRXGFST   support added to release buffers from multiple pool
              types
    FRXHBUF0  Add support for multiple buffer pool types
    FRXLMRG0  fix intermittent hang on end of read when end of log
              read is not propogated to unit of recovery manager due
              to timing window.  Add diagnostic count on buffer
              release.
    FRXMINI0 and FRXMSTR0
              move upper limit of buffer percentage of private
              storage to half instead of three quarters to avoid
              storage shortage and ABENDS40D
    FRXMSTR0  Clear storage before reusing to avoid endless loop
    FRXPDSS0  Process no-op notification from FRXUORM0 at end of
              data
    FRXQBUF0  send end of log data notification to FRXUORM0 if no
              log data sets are to be read for recovery to avoid
              hang.
              Separate the OLR buffer logic from the non-OLR logic.
    FRXRBUF0  Add diagnostic buffer counts.
              FRXRBUF0 is also modified to check the time tamp for
              a type x06 log record.  If the time stamp is zero,
              MSG FRD2892I is issued and recovery is abnormally
              terminated with ABENDU385 RSN00A.
    
    FRXRCTL0  Simplify the buffer freed process
    FRXRDTH0  Use awe from awe storage pool instead of local
              storage. Add diagnostic count for buffers.
    FRXRLRA0  Add logic to track uor token for 5950, 5937, and
              5938.
    FRXRRR    Add diagnostic count for each log record
    FRXRVCS   Add olr indicator flag
    FRXRVDL   Add data space free space diagnostic information
    FRXRVGB   Add diagnostic count fields for buffer and record
              counting
    FRXRVQB   Add support for separate olr code path in FRXQBUF0
    FRXRVUR   Add support for spill and add diagnostic fields for
              uor tracking
    FRXURHS   Reduce number of hash table levels to 1
    FRXUORM0  Add support to spill log data buffers on input if
              recovery running low on private storage. Fix lost
              buffer problem when spilling data.  Fix support
              for SDEPs (LOG5957) and copy the LOG5957 to the
              output buffers via FRXRRR calls instead of MVCL.
              Fix hang and endless loop on buffer pool storage
              contention by supporting multiple buffer pool types.
    FRXWSPL0  If request is to spill data, clear the remote token
              to avoid ABENDU0385 - RSN 0015 in FRXWSPL0
    FRXWSPM0  Add free space diagnostic field and tracking
    -
    DOCUMENTATION CHANGE FOR APAR PK24575
    THIS MAINTENANCE IS BEING HELD SO YOU WILL BE
    AWARE OF DOCUMENTATION CHANGE TO MANUAL(S):
    SC18940700
    -
    THE FOLLOWING TEXT DESCRIBES THE DOC CHANGE:
    A change has been made in IBM IMS Database Recovery Facility
    for z/OS, User's Guide and Reference, Version 3 Release 1,
    at page 103, chapter 7: Messages and Codes of IMS Database
    Recovry Facility.
    
    FRD2892I     reason IN LOG RECORD seqnum DETECTED IN dsname
    
    Explanation: An invalid record contents is detected for the log
    data set with the dsname during database data set recovery by
    the IMS Database Recovery Facility.  The message destination is
    the z/OS system console and the IMS master terminal. If the
    message is issued in batch mode, the message destination is the
    z/OS system console. The message is followed by an ABEND 385-00A
    
    reason:      Identified the problem and is one of the following:
                 Invalid time stamp
    seqnum:      The sequence number that identifies the log record
                 in the log data set. It can be used to determine
                 which record is bad.
    dsname:      The data set from which the log record was read.
    User Action: Examine the log record identified in the message
                 within the log data set listed in the message.
                 Use the IMS DFSLOG06 macro mapping of the log
                 record to determine the offset to the ACPRILOG
                 field. If this time stamp is zero, determine
                 if anything in your environment interacts with the
                 IMS Logger component initialization or termination
                 processing. If not, report this problem to IBM.
                 In any case, use the appropriate tool or procedure
                 to place the prilog time for the subsystem or batch
                 job which created the log in the ACPRILOG field of
                 the log record. Refer to the appropriate IMS
                 documentation  for the format of the prilog time
                 stamp for the 06 log record.  Make sure the
                 06 log records at the beginning and end of the log
                 data set have the time stamp provided.
    
    System Action: The IMS Database Recovery Facility address space
                 terminates.
    
    Module:      FRXRBUF0
    

Temporary fix

  • *********
    * HIPER *
    *********
    

Comments

  • ×**** PE07/09/26 FIX IN ERROR. SEE APAR PK52492  FOR DESCRIPTION
    

APAR Information

  • APAR number

    PK24575

  • Reported component name

    IMS DB RECOVERY

  • Reported component ID

    5655I4400

  • Reported release

    310

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2006-05-08

  • Closed date

    2006-10-26

  • Last modified date

    2007-10-24

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    UK19165

Modules/Macros

  • FRXADF00 FRXADM10 FRXADM20 FRXAWEX  FRXBDCB0
    FRXBDMG0 FRXCAMG0 FRXCBDM0 FRXCON   FRXDCB   FRXDDRF  FRXEDRF0
    FRXGFST  FRXGRPT0 FRXHBUF0 FRXICLI0 FRXICTL0 FRXLMRG0 FRXMINI0
    FRXMINI1 FRXMNP   FRXMSTR0 FRXMSTR1 FRXMTC   FRXPDIR0 FRXPDIS0
    FRXPDSR0 FRXPDSS0 FRXPSDR0 FRXPSDS0 FRXQBUF0 FRXRBUF0 FRXRCTL0
    FRXRDTH0 FRXRLRA0 FRXRRR   FRXRVCS  FRXRVDL  FRXRVGB  FRXRVQB
    FRXRVUR  FRXUORM0 FRXURHS  FRXVSTA0 FRXWSPL0 FRXWSPM0
    

Publications Referenced
SC18940700    

Fix information

  • Fixed component name

    IMS DB RECOVERY

  • Fixed component ID

    5655I4400

Applicable component levels

  • R310 PSY UK19165

       UP06/10/28 P F610 Ž

[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCX88Z","label":"IMS Database Recovery Facility"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.1.0","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
24 October 2007