Initial procedure for the WAIT/LOOP keywords

While the documentation needed for a WAIT/LOOP might appear excessive, this failure typically does not produce any documentation specific to its occurrence. Determining its cause often requires considerable examination.

  1. Determine the scope of the problem and the users and environments that are affected.
  2. If the problem occurs in a sharing environment and involves more than one member of the data sharing group, do the following:
    • Issue the command
      -DISPLAY GROUP
      to get information about the status of the data sharing group:
      • Status of each member: ACTIVE, QUIESCED, or FAILED.
      • Lock structure and SCA storage: the percentage of lock entries in use and the percentage of SCA in use indicates whether a resource availability problem might be the reason for the hang.
    • Issue the IRLM status command:
      F irlmproc,STATUS
      on each member of the data sharing group to determine which IRLMs are responsive. If any IRLM does not respond, it might indicate that a wait on a lock is causing the hang.

      In some cases, Db2 uses the IRLM notify service to send XCF messages between data sharing members. The sender of the message might be waiting indefinitely for the receivers to respond.

      The message from the command F irlmproc,STATUS does not indicate if a system is suspended on a notify message. The buffer manager drain service, data definition language, and data set extend services use IRLM notify.

      If the hung user is draining a page set, the user task might be suspended waiting for the claim count to reach zero. Issue the command:
      -DISPLAY DATABASE(dbname) CLAIMERS
      to determine which users across the data sharing group hold a claim on the page set. If claimers exist, the drainer will be resumed when the claim count reaches zero, or the drainer will time out.
  3. If Db2 'hangs' during distributed processing, see Diagnosing distributed data facility (DDF) failures for procedures to follow.
  4. Collect information about active Db2 threads. You can do this by issuing the Db2 command -DISPLAY THREAD TYPE(ACTIVE), or by formatting your console dump using DSNWDMP with the option DS=1. For information about DSNWDMP, see Format dumps by using the DSNWDMP statement. Review the output generated and the data in the following three fields. The field labels in -DISPLAY THREAD and DSNWDMP output are shown in parentheses.
    • Status (ST or first part of Status):
      For each thread, these status codes are possible:
      N
      The thread is in either identify or sign-on status.
      ND
      The thread is in either IDENTIFY or SIGNON status. The thread is not currently associated with any TCB.
      QT
      The create thread request is queued. The associated allied task is in a wait state.
      T
      An allied, non-distributed thread has been established.
      TD
      An allied thread was established (plan allocated). The thread is currently not associated with any TCB.
      PT
      The thread is a parallel task.
      TR
      The thread (an allied distributed thread) is requesting data from another database management system.
      RA
      The thread (database access agent) is performing a remote access on behalf of a request from another DBMS.
      RN
      A distributed thread is performing a remote access on behalf of a request from a partner location. The thread was suspended because Db2 must first connect to the partner location. The Db2 command DISPLAY LOCATION(*) shows conversation activity for this Db2 system conversation (SYSCON-O) service task.
      RQ
      A distributed thread is performing a remote access on behalf of a request from another location. The thread was suspended because the maximum number of active database access threads (as described by the MAX REMOTE ACTIVE value of the DSN6SYSP macro in the Db2 startup parameter, usually DSNZPARM) was reached. Database access agents (DBAAs) are queued until a slot becomes available. Consider increasing the MAX REMOTE ACTIVE value.
      SP
      A thread is executing within a stored procedure.
      SW
      A thread is waiting for a stored procedure to be scheduled.
      TN
      An allied thread was distributed to access data at another Db2 location, but was suspended because Db2 system conversations have not been established. The Db2 command DISPLAY LOCATION(*) shows conversation activity for this Db2 system conversation (SYSCON-O) service task.
      QD
      The thread is queued for deferred termination because the associated allied task terminated. The allied task is placed in a WAIT state if this Db2 thread is the last (or only) one for the address space.
      D
      The thread is being terminated because the associated allied task terminated. The allied task is placed in a WAIT state if this Db2 thread is the last (or only) one for the address space.

      If a QT status is reported on the connection identifier (job name for batch), you are in a WAIT for a thread to become available. If you receive an excessive number of QT status codes, the limit specified for Db2 subsystem parameter CTHREAD can be increased. CTHREAD corresponds to field MAX USERS on installation panel DSNTIPE.

      If after repeating the -DISPLAY THREAD command several times, different threads have the D status and the ones previously having a D have terminated, then there is NO wait or loop. It is simply taking a long time to terminate all threads. Since deferred termination is serialized across all threads whose allied tasks have abended, and since abend processing must be performed, the total length of elapsed time can be excessive.

      However, if after considerable time, the same thread still has a status of D, a wait or loop is the probable cause. Software or hardware monitors can be used to see if one or more of the Db2 address spaces is consuming processor or I/O resources. (The user's allied task has been in a wait state and cannot be using any processor or I/O resources.) If the processor utilization is high, then a loop should be assumed. If processor utilization is not high and there is ongoing I/O activity to the Db2 database and logging data sets, then wait a little longer. The allied thread is probably still in abend processing. If the processor and I/O utilization are low, it is safe to assume that Db2 is in an endless wait.

      If the allied thread with the status of “D” that appears to be waiting or looping is distributed, that is, the message, “ACCESSING DATA AT location” appears on the display thread command, attempt to terminate the thread by issuing the ‘VTAM VARY NET, TERM' command. If this does not solve the problem, continue with the following sections.

      After confirming a wait or loop condition, take an z/OS® console dump as described in the previous section, Guidelines for good operational procedures. After the dump is complete, canceling one of the Db2 address spaces is likely to clear the problem. Canceling the allied address space of the hung thread or issuing -STOP DB2 has no effect.

    • Db2 activity indicator (A or second part of Status):

      If the activity indicator is on (if column A contains an asterisk), the thread's allied task is executing in one of the Db2 address spaces. If the activity indicator is blank, the thread's allied task is not executing in a Db2 address space.

    • Wraparound request counter (REQ or Req):

      If several -DISPLAY THREAD commands are issued and the request counter is incrementing, the thread is not in a WAIT but may be in a LOOP.

      If the counter is incrementing and the activity indicator is changing from blank to asterisk, the loop is NOT in Db2 code, but in the application. The combined state of the activity indicator and the REQ counter indicate that control of the task is continuously transferring between Db2 and the application.

      If the activity indicator consistently contains an asterisk and the REQ counter is not incrementing, it is safe to assume there is a wait or loop in Db2 code.

  5. Verify that the problem is not in an application process.
    • If there is no asterisk in the Db2 activity indicator (column A), the thread is not active and the problem is in the application.
    • If the problem is in the application, determine where the processor cycles are being used. The z/OS DISPLAY ACTIVE command or other tools can help accomplish this. This can help to determine if the application is in a wait or a loop. Applications that fail to check SQL return codes properly after each SQL call might go into a loop, making it appear that Db2 is looping.
  6. Issue the Db2 command -DISPLAY DATABASE to determine if any Db2 resources are unavailable to the transactions involved in the wait or loop.
  7. If Db2 resources are restricted because of utilities, issue the Db2 -DISPLAY UTILITY command to determine which utilities are operating.

    Utilities holding exclusive use of Db2 objects during normal processing include REORG (except when SHRLEVEL CHANGE is specified), LOAD, REBUILD, RECOVER, CHECK, MODIFY, and REPAIR.

  8. Check for long-running jobs and for SPUFI users running with AUTOCOMMIT=NO. This determines if needed resources are being held.
  9. Determine if the Db2 catalog is being updated.
    • The catalog can be updated through BIND, DDL, and certain utilities, including COPY, RUNSTATS, and STOSPACE.
    • Many updates can lock out other users, causing what appears to be a wait. The problem disappears when the catalog updates are completed.
  10. Check for transactions using both DL/I and SQL. These transactions can wait for extended periods if IRLM has a large deadlock timeout specified and if a deadlock occurs over DL/I and SQL.
  11. If a user is holding needed resources, wait for termination of that user, or cancel the user and request a dump. Reassess the scope of the problem.
  12. If only one user is affected, determine if any indexes have been dropped that would extend the application's execution time.
  13. If you receive no response from Db2 commands, Db2 is in a wait state. A problem can have occurred during -STOP DB2 processing or abnormal subsystem termination.
  14. Review the list below to determine the appropriate procedure to follow next, based upon the environments that are affected. The first procedure has the largest scope.