PM82843: DBRC PRA EXCESSIVE RETRY MITIGATION ENHANCEMENT

A fix is available

APAR status

Closed as new function.

Error description

When running with Parallel RECON Access, RLS timeout or
deadlocks can occur and repeat seemingly endlessly
when lock promotions are needed to update a RECON record. This
can severely impact IMS and utility processing.

Local fix

Problem summary

****************************************************************
* USERS AFFECTED: All users of IMS Version 13 with Parallel    *
*                 RECON Access enabled.                        *
****************************************************************
* PROBLEM DESCRIPTION: Excessive RECON contention can occur    *
*                      when a large number of DBRC requests    *
*                      doing similar actions are processed in  *
*                      parallel. This can result in excessive  *
*                      MSGDSP1184W messages and elongate the   *
*                      processing time.                        *
****************************************************************
* RECOMMENDATION: INSTALL CORRECTIVE SERVICE FOR APAR/PTF      *
****************************************************************
When PRA is enabled, the current DBRC logic will issue a VSAM
GET in a manner that it gets a shared lock. If it happens to
need to update the record, it will do a subsequent GET for
update which gets an exclusive lock. It then writes the changed
record.
If several DBRC instances happen to make similar requests in
this manner and are trying to update the same record, deadlocks
or timeouts can  occur since each has a shared lock and each is
trying to then get an exclusive lock for update.  At this point
the victims will retry their processing.  If the  number
involved is large enough and the timing just right, this
could result in endless (or seemingly endless) retries.

This SPE detects and attempts to mitigate this be altering
the first request from a GET NUP,CRE to a GET UPD so that
an initial attempt is for an exclusive lock.

Problem conclusion

Temporary fix

Comments

These type of deadlocks can be eliminated if DBRC always did the
initial GET with update intent to get an exclusive lock. In this
case other DBRCs would wait behind the exclusive lock.  But
getting an exclusive lock when a record will not  get updated
can cause more contention and extra TVS logging, so getting all
records exclusive would be detrimental. Also, DBRC request logic
does not always know if a  given record will get updated at the
time it does its  initial get.

To handle this, the proposal is to add logic to track what
records get updated by a given request, but only  if the
request has hit a retryable error such as a timeout or deadlock.
If the request hits a retry-able error a second time the logic
will then get an EX lock for any record that  was updated or was
attempted to be updated on any prior iteration that hit a
retryable error.

As an example.  Lets say a request makes the following
calls:
GET RECA, GET RECB GET RECC GET RECD UPD RECB UPD RECD
On the first attempt, a timeout occurs updating RECD,
on the second attempt, a timeout occurs updating RECB

The VSAM requests would be
GET NUP CRE RECA
GET NUP CRE RECB
GET NUP CRE RECC
GET NUP CRE RECD
GET UPD RECB
PUT RECB
GET UPD RECD ---timeout -> retry
---
GET NUP CRE RECA
GET NUP CRE RECB
GET NUP CRE RECC
GET NUP CRE RECD
GET UPD RECB (save fact that RECB update attempted)
   ---timeout -> retry

GET NUP CRE RECA
GET UPD RECB   <since we know it will get updated>
GET NUP CRE RECC
GET NUP CRE RECD
GET UPD RECB
PUT RECB
GET UPD RECD
PUT RECD


The concept is to limit overhead until we know we are hitting
contention repeatedly at which point the extra tracking of
updates and higher level lock will be done on subsequent GETs
for that record until the request works. Note this will not
prevent all deadlocks.  Also, because some DBRC locate
processing does a GET KGE or GET BWD to find a record greater
or less than a user specified value, it cannot know ahead of
time what that record key will be that gets returned to check if
we will eventually update.  In this case we will continue to get
the shared lock during the locate.

Code changes:
DSPURI00:
 -In DSPURILC for a direct locate of record w/o a timestamp:
    If we are retrying a request, check if we attempted to
    update the record in a prior attempt, and if so set the RPL
    options to UPD instead of NUP.  Direct locates of records
    with a timestamp in the key, the routine GETWithCRE will be
    used.
 -The only other locate type that we can check for update is a
  CLASS FIRST of type locate as that also calls GETwithCRE.
 -In DSPURICH: Add call to AddKeytoList to track the update
  of the passed record if the request is being retried
 -In DSPURIDL: Add call to AddKeytoList to track the update
  of the passed record if the request is being retried

 - Add new routine: AddKeytoList - This routine will create
   the initial hash table if one does not exists.  It then
   checks if the key is already in the table.  If not, it adds
   it to a table that holds all the updated keys. The hash table
   entry it updated to point to the new entry, or the last
   synonym is updated to point to the new entry.
   Note: if we are unable to get storage for more keys, the
   table will be marked full.

 - Add new routine: EntryExists - This routine takes a key and
   checks if it is already in the hash table tracking records
   that were updated.  If the table is marked full, ALL
   records will be considered as in the table.


DSPBRQ00:  Call to clear hash table routine to clear the entries
  in the update tracking table if the request completed without
  the need for retry.

DSPCRTR0:  Call to clear hash table routine to clear the entries
  in the update tracking table if the request completed without
  the need for retry.

DSPCRTR1: Add subroutine ClearUpdHT which just clears all
  entries in the update hash table

DSPDSS01: Add code to release the storage used for the key
  update hash table

DSPPRAB: Add prab_retry_updhtbl and prab_retry_updrqst

DSPURX00: Clear some PRAB_RETRY fields when request will not
  be retried. Add call to ClearUpdHT if the command being
  processed does not need to be retried.
  Add ClearUpdHT routine to clear the hash table entries.

DSPEF06F: Recompile for DSPPRAB change.
DSPEF00F: Recompile for DSPPRAB change.
DSPEF02F: Recompile for DSPPRAB change.
DSPEF0AF: Recompile for DSPPRAB change.

APAR Information

APAR number
PM82843
Reported component name
IMS V13
Reported component ID
5635A0400
Reported release
300
Status
CLOSED UR1
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2013-02-14
Closed date
2013-03-26
Last modified date
2013-10-04

APAR is sysrouted FROM one or more of the following:

PM82842
APAR is sysrouted TO one or more of the following:

UK92905

Modules/Macros

   DSPBRQ00 DSPCRTR0 DSPDSS01 DSPEF0AF DSPEF00F
DSPEF02F DSPEF06F DSPURI00 DSPURX00

Fix information

Fixed component name
IMS V13
Fixed component ID
5635A0400

Applicable component levels

R300 PSY UK92905
UP13/03/30 P F303

Fix is available

Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG19M","label":"APARs - z\/OS environment"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"300","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
14 December 2020

Tips

PM82843: DBRC PRA EXCESSIVE RETRY MITIGATION ENHANCEMENT

A fix is available

Subscribe

APAR status

Closed as new function.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Modules/Macros

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R300 PSY UK92905

Fix is available

Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

Document Information

Share your feedback

Need support?