APAR status
Closed as program error.
Error description
This error involves multiple CQSs. The reported case was 4 CQSs in a Sysplex, but the error is not specific to that number. Problem: All 4 CQSs in a 4-way sysplex abended U0100-04 Environment: All CQSs run with an overflow structure defined and DUPLEX(ENABLE) sequence of events: . All 4 CQSs are running . The primary structure hits the overflow threshold value. . The IXLALTER to expand the primary structure size fails (the structure has reached its maxSize) . The Overflow Threshold Process begins . The overflow structure is allocated/initialized . Since the overflow structure defined as DUPLEX(ENABLE), as soon as it's allocated and other CQSs connect to it, the DUPLEX process starts. . All connectors that connected to the overflow structure receive the Structure Temporarily Unavailable Event . All CQSs quiesces the structure . Meanwhile, the STE1 thread is receiving the OVERFLOW IXLUSYNC #3 to complete the Overflow Threshold phase2 . At label OFTC1400 (in CQSSTE10), the GETLATCH call fails because the structure quiesce latch is being heldby the STE2 thread (The Structure Temporarily Unavailable Event) . Thus all CQSs abend at label OFTC1500 in CQSSTE10
Local fix
Problem summary
**************************************************************** * USERS AFFECTED: IMS V9 CQS users of shared queues * * overflow structure and duplexing. * **************************************************************** * PROBLEM DESCRIPTION: When multiple CQSs go into overflow * * mode and connect to an overflow * * structure that is defined with * * DUPLEX(ALLOWED) or DUPLEX(ENABLED) * * and duplexing is being started * * while overflow mode is being * * established, CQS abends with * * ABENDU0100-00000004. * * * * Multiple CQSs starting up at the * * same time hang with MSGIXL040E * * DUPLEXING CANNOT CONTINUE for the * * overflow structure. * **************************************************************** * RECOMMENDATION: INSTALL CORRECTIVE SERVICE FOR APAR/PTF * **************************************************************** Multiple CQSs exist in an IMSplex. An overflow structure is defined to the CFRM policy as DUPLEX(ALLOWED) or DUPLEX(ENABLED) . A CQS attempts to put a message on the shared queues and detects that the overflow threshold has been reached. The CQS overflow master initiates overflow threshold phase 1 by issuing IXLUSYNC #1 to tell all the CQSs to quiesce the structure. Once all of the CQSs respond, the overflow master selects queues for overflow and then issues IXLUSYNC #2, to tell all the CQSs (including the master) to connect to the overflow structure. After all the CQSs connect to the overflow structure, the master starts moving queues to the overflow structure. For a DUPLEX(ENABLED) overflow structure, the Structure Temporarily Unavailable event can come in at any time to notify CQS to quiesce the structure while duplexing is being established. For a DUPLEX(ALLOWED) overflow structure, a setxcf start,rebuild,duplex command can come in at any time to establish duplexing. The timing of the Structure Temporarily Unavailable event is such that overflow is between phases and has resumed the structure and CQS is able to quiesce the structure to establish duplexing. When the next overflow phase attempts to quiesce the structure and cannot, CQS abends with ABENDU0100-00000004 in CQSSTE10 at offset X'418'. This means overflow could not get the structure quiesce latch. Additional problem: Multiple CQSs start up at the same time. Let's call the first CQS1 and the second CQS2. CQS initialization attempts to connect to the overflow structure, to ensure that the overflow structure is defined correctly. The overflow structure is defined as DUPLEX(ENABLED) . CQS1 locks the primary structure and proceeds with initialization. The other CQSs wait for the primary structure lock. Once CQS1 connects to the overflow structure, z/OS notifies CQS1 with the "structure temporarily unavailable" event, in order to quiesce the structure to establish duplexing. CQS1 attempts to quiesce the structure before responding to the event, but waits because CQS1 initialization holds the structure quiesce latch. After CQS1 determines that the primary and overflow structures don't need to be rebuilt, CQS1 releases the primary structure lock. This allows CQS2 to get the primary structure lock and proceed with its initialization. CQS2 attempts to connect to the overflow structure, but it fails with IXLRSNCODECONNPREVENTED. CQS2 waits for an ENF 35 event while holding the structure lock. When CQS1 initialization tries to lock the primary structure again to read and write control list entries, CQS1 waits because CQS2 has the structure locked. CQS2 is waiting for a duplexing established ENF 35 event while holding the primary structure lock, while CQS1 is waiting to get the structure quiesce latch that CQS1 initialization holds to establish duplexing, while CQS1 initialization is waiting for the primary structure lock that CQS2 has. CQS1 and CQS2 are deadlocked and CQS initialization hangs.
Problem conclusion
AIDS: RIDS/SYS RIDS/CNTRL SYS CNTRL GEN: KEYWORDS: SYSPLEXSQ POSTREQ PM04096 *** END IMS KEYWORDS *** CQSSTE20 is changed in the "structure temporarily unavailable" event logic. If CQS is initializing or establishing overflow, skip getting the terminate and structure quiesce latches. This allows duplexing to be established as fast as possible and prevents the deadlock. Duplexing is established fast because the overflow structure has just been allocated and is empty. z/OS doesn't require CQS to quiesce the structure for this system-managed duplex rebuild process, because the system will defer any incoming requests until the structure is available again. IBM recommends that the structure be quiesced, because this minimizes system resources required to quiesce operations and improves system performance. Since duplexing is established very quickly when no latches are gotten, it is likely to be established before CQS initialization or overflow threshold need to access the overflow structure. In the "structure available" event logic, skip releasing the structure quiesce and terminate latches, if they aren't held, and skip resuming the structure, if it wasn't quiesced.
Temporary fix
********* * HIPER * *********
Comments
REPINNED RP10/02/16 (ATXT) TO ADD POSTREQ PM04096 INFO. **** PE10/02/16 PTF IN ERROR. SEE APAR PM04096 FOR DESCRIPTION ×**** PE10/02/16 FIX IN ERROR. SEE APAR PM04096 FOR DESCRIPTION
APAR Information
APAR number
PK65285
Reported component name
IMS V9
Reported component ID
5655J3800
Reported release
900
Status
CLOSED PER
PE
NoPE
HIPER
YesHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2008-04-29
Closed date
2008-11-07
Last modified date
2010-03-18
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
PK73227 UK41442
Modules/Macros
CQSFM020 CQSSTE20 CQSSTRUC
Fix information
Fixed component name
IMS V9
Fixed component ID
5655J3800
Applicable component levels
R900 PSY UK41442
UP08/11/13 P F811
[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCVRBJ","label":"System Services"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"9.1","Edition":"","Line of Business":{"code":"","label":""}}]
Document Information
Modified date:
18 March 2010