PK24575: LOOP OR HANG WHEN PERFORMING RECOVERY OF 20 AREAS

APAR status

Closed as program error.

Error description

When performing recovery of 20 DEDB areas
using DRF PITR, the processing CPU time exceeded 14 hours and
the problem was judged to be a hang or loop and was canceled.
This problem can occur for Full Function DBs. The problem occurs
when processing lots of log data, causing data to be spilled to
dataspaces after all buffers have been used/filled. This is a
data integrity issue.
Additional symptoms: Fixes problem reported in R210 APAR
PK30937

Local fix

Problem summary

****************************************************************
* USERS AFFECTED: All users of IMS Database Recovery Facility  *
*                 Version 3 Release 1 running recovery to any  *
*                 prior point in time PITR.                    *
****************************************************************
* PROBLEM DESCRIPTION: While running a PITR recovery,the DRF   *
*                      master address space runs into a loop   *
*                      or hang condition.                      *
****************************************************************
* RECOMMENDATION: INSTALL CORRECTIVE SERVICE FOR APAR/PTF      *
****************************************************************
PITR results in a hang or endless loop.  This occurs when the
amount of log data is a high enough percentage of private
storage that it needs to be moved to data spaces.

As part of the problem determination for the originally
reported problem, it was disovered that under some high load
circumstances, log data is lost as part of PITR processing.
In addition, it was discovered that SDEP log records were not
processed correctly and SDEP information was lost resulting in
a data integrity problem.  The SDEP problem is when updates to
the SDEPs cause it to wrap.

As part of testing the fix, an ABENDS40D was encountered as a
shortage of available private storage.

Problem conclusion

AIDS: RIDS/UTIL RIDS/DBS DBS/UTIL
  DEP: NONE
  GEN:

*** END IMS KEYWORDS ***
The initial problem reported by the customer of a hang or
endless loop during large recoveries is fixed in the following
ways.  As part of testing the fix, multiple problems were
encountered and fixed as documented below.

The hang and endless loop ended up being several different
problems.  They are fixed in the following ways.

First, buffer contention caused a hang and is fixed by
separating the buffers in to two pools.  One for log read and
one for buffer send to the subordinate address spaces.  A loop
is fixed by using awes from the awe pool instead of local
module storage.

The missing data is fixed in three major ways.  First, spill
management is fixed to always return the data spilled, not
extraneous data due to residual data in the token on a spill.
If the token on a spill is non-zero, the spill manager
interprets the request as a retrieve and the spilled data is
lost because the caller does not expect to have data back from
a spill request.  The other way buffers were lost was through
awe's in local storage being enqueued and the thread not
waiting.  When control returned from the module enqueuing the
awe, the awe storage is reused.  The awe field is cleared at
times resulting in the buffer being enqueued over being lost.
The last is for fast path SDEP processing.  PK11200 added
support for the LOG5957 record.  The support is not complete
and did not use FRXRRR to copy the LOG5957 from the input
buffer to the buffer sent to the subordinate address space.
Also, processing the LOG5957 depended on a UOR token but the
LOG5957 does not contain a UOR token.  The LOG5957 is sent to
random subordinate address spaces and not applied to the area
intended.

The ABENDS40D is fixed by separating the buffer pool into two
pools.  One for read and one for buffer send to subordinate
address spaces.  It is also fixed by sending all log data
through FRXQBUF0 and FRXLMRG0 to have FRXUORM0 process the data
in order.  This way, the UOR related storage can be released
when end of UOR notification is encountered.  The hash table is
reduced from a maximum of five levels to one hash table.  This
significantly reduces the private storage utilization for
extremely large recoveries.

Timestamp was never expected to have hex zeroes so a check for
timestamp validity is not done. Some vendor products may place
x'F0' in the timestamp so validity checking is done with the
fix.

The code is changed in the following parts:

Parameter change for multiple buffer types
FRXCAMG0, FRXCBDM0, FRXICLI0, FRXICTL0, FRXLMRG0, FRXMNP,
FRXMSTR1, FRXPDIR0, FRXPDIS0, FRXPDSR0, FRXPDSS0, FRXPSDR0,
FRXPSDS0, FRXQBUF0, FRXRBUF0, FRXRCTL0, FRXRDTH0, FRXUORM0,
FRXHBUF0, FRXBDMG0, FRXMTC

Dump formatter recompile
FRXADF00, FRXADM10, FRXADM20

FRXBDCB0  Add rvur as a fixed length control block for
          performance enhancement on get/release rvur storage

FRXBDMG0 and FRXMTC are changed to add MSGFRD2892I.

FRXCON    define buffersend and bufferread
FRXGFST   support added to release buffers from multiple pool
          types
FRXHBUF0  Add support for multiple buffer pool types
FRXLMRG0  fix intermittent hang on end of read when end of log
          read is not propogated to unit of recovery manager due
          to timing window.  Add diagnostic count on buffer
          release.
FRXMINI0 and FRXMSTR0
          move upper limit of buffer percentage of private
          storage to half instead of three quarters to avoid
          storage shortage and ABENDS40D
FRXMSTR0  Clear storage before reusing to avoid endless loop
FRXPDSS0  Process no-op notification from FRXUORM0 at end of
          data
FRXQBUF0  send end of log data notification to FRXUORM0 if no
          log data sets are to be read for recovery to avoid
          hang.
          Separate the OLR buffer logic from the non-OLR logic.
FRXRBUF0  Add diagnostic buffer counts.
          FRXRBUF0 is also modified to check the time tamp for
          a type x06 log record.  If the time stamp is zero,
          MSG FRD2892I is issued and recovery is abnormally
          terminated with ABENDU385 RSN00A.

FRXRCTL0  Simplify the buffer freed process
FRXRDTH0  Use awe from awe storage pool instead of local
          storage. Add diagnostic count for buffers.
FRXRLRA0  Add logic to track uor token for 5950, 5937, and
          5938.
FRXRRR    Add diagnostic count for each log record
FRXRVCS   Add olr indicator flag
FRXRVDL   Add data space free space diagnostic information
FRXRVGB   Add diagnostic count fields for buffer and record
          counting
FRXRVQB   Add support for separate olr code path in FRXQBUF0
FRXRVUR   Add support for spill and add diagnostic fields for
          uor tracking
FRXURHS   Reduce number of hash table levels to 1
FRXUORM0  Add support to spill log data buffers on input if
          recovery running low on private storage. Fix lost
          buffer problem when spilling data.  Fix support
          for SDEPs (LOG5957) and copy the LOG5957 to the
          output buffers via FRXRRR calls instead of MVCL.
          Fix hang and endless loop on buffer pool storage
          contention by supporting multiple buffer pool types.
FRXWSPL0  If request is to spill data, clear the remote token
          to avoid ABENDU0385 - RSN 0015 in FRXWSPL0
FRXWSPM0  Add free space diagnostic field and tracking
-
DOCUMENTATION CHANGE FOR APAR PK24575
THIS MAINTENANCE IS BEING HELD SO YOU WILL BE
AWARE OF DOCUMENTATION CHANGE TO MANUAL(S):
SC18940700
-
THE FOLLOWING TEXT DESCRIBES THE DOC CHANGE:
A change has been made in IBM IMS Database Recovery Facility
for z/OS, User's Guide and Reference, Version 3 Release 1,
at page 103, chapter 7: Messages and Codes of IMS Database
Recovry Facility.

FRD2892I     reason IN LOG RECORD seqnum DETECTED IN dsname

Explanation: An invalid record contents is detected for the log
data set with the dsname during database data set recovery by
the IMS Database Recovery Facility.  The message destination is
the z/OS system console and the IMS master terminal. If the
message is issued in batch mode, the message destination is the
z/OS system console. The message is followed by an ABEND 385-00A

reason:      Identified the problem and is one of the following:
             Invalid time stamp
seqnum:      The sequence number that identifies the log record
             in the log data set. It can be used to determine
             which record is bad.
dsname:      The data set from which the log record was read.
User Action: Examine the log record identified in the message
             within the log data set listed in the message.
             Use the IMS DFSLOG06 macro mapping of the log
             record to determine the offset to the ACPRILOG
             field. If this time stamp is zero, determine
             if anything in your environment interacts with the
             IMS Logger component initialization or termination
             processing. If not, report this problem to IBM.
             In any case, use the appropriate tool or procedure
             to place the prilog time for the subsystem or batch
             job which created the log in the ACPRILOG field of
             the log record. Refer to the appropriate IMS
             documentation  for the format of the prilog time
             stamp for the 06 log record.  Make sure the
             06 log records at the beginning and end of the log
             data set have the time stamp provided.

System Action: The IMS Database Recovery Facility address space
             terminates.

Module:      FRXRBUF0

Temporary fix

```
*********
* HIPER *
*********
```

Comments

&#215;**** PE07/09/26 FIX IN ERROR. SEE APAR PK52492  FOR DESCRIPTION

APAR Information

APAR number
PK24575
Reported component name
IMS DB RECOVERY
Reported component ID
5655I4400
Reported release
310
Status
CLOSED PER
PE
NoPE
HIPER
YesHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2006-05-08
Closed date
2006-10-26
Last modified date
2007-10-24

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

UK19165

Modules/Macros

FRXADF00 FRXADM10 FRXADM20 FRXAWEX  FRXBDCB0
FRXBDMG0 FRXCAMG0 FRXCBDM0 FRXCON   FRXDCB   FRXDDRF  FRXEDRF0
FRXGFST  FRXGRPT0 FRXHBUF0 FRXICLI0 FRXICTL0 FRXLMRG0 FRXMINI0
FRXMINI1 FRXMNP   FRXMSTR0 FRXMSTR1 FRXMTC   FRXPDIR0 FRXPDIS0
FRXPDSR0 FRXPDSS0 FRXPSDR0 FRXPSDS0 FRXQBUF0 FRXRBUF0 FRXRCTL0
FRXRDTH0 FRXRLRA0 FRXRRR   FRXRVCS  FRXRVDL  FRXRVGB  FRXRVQB
FRXRVUR  FRXUORM0 FRXURHS  FRXVSTA0 FRXWSPL0 FRXWSPM0

*Publications Referenced*
SC18940700

Fix information

Fixed component name
IMS DB RECOVERY
Fixed component ID
5655I4400

Applicable component levels

R310 PSY UK19165
UP06/10/28 P F610 Ž

[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCX88Z","label":"IMS Database Recovery Facility"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.1.0","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
24 October 2007

Tips

PK24575: LOOP OR HANG WHEN PERFORMING RECOVERY OF 20 AREAS

Subscribe

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Modules/Macros

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R310 PSY UK19165

Document Information

Share your feedback

Need support?