A fix is available
APAR status
Closed as program error.
Error description
Customer observed the false contention rate increased after PTF for OA60394 was rolled in all LPAR in the plex. False contention rate (ie False contention count/ total contention) went up to 12 to 19% while it was below 2% prior. LOCK str itself was also enlarged to double the number of the LOCKs than before. This condition is one of the problem related to symptom described in OA60854. In this incident, ISGLOCK str for GRS was the str of interest. Based on the SYSXES CTRACE collected on all members with the dumps captured by the : SLIP SET,IF,RA=(6.10%+8C?+9C?+1C?+24?+FAE),DATA=(3R,GT,000007D0),ASID SVCD,JL=GRS,SDATA=(XESDATA,COUPLE,GRSQ,RGN,SQA,CSA,TRT,NUC,SUM), We could see the ENQ /DEQ request for single resource with EXCLusive interest were going toward the GLM of the LTE on remote LPAR. On the remote LPAR, it has no other resource under such LTE. Ideally, the GLM role on this LTE should be de-escalated once the last resource under this LTE is DEQ'ed.But the pacing indicator was reset after the DEQ, so the actual de-escalate was delayed. Higher rates of false contention can also be observed whenever a LTE is being globally managed with no lock resources and a new request for the same LTE is processed by the global manager when no contention exists. During the implementation of the fix, we also recognize the fact that global lock cleanup processing currently did NOT reset the global byte for the LTE in str once there was no resource underneath the LTE. In other words, once the LTE is placed into the defer de-escalate state, global lock cleanup process will NOT clear off the GLM id on the str. That could lead to a lot more LTE entries in the str with dirty global byte which eventually triggering many more CHASE processing on all connectors when trying to update resource under such LTE with out dated GLM connector id. PE INFORMATION: USERS AFFECTED: For false contention rate increase: All users with either OA60394 HBB77B0 PTF UJ04727 or HBB77C0 PTF UJ04728 installed supporting connectors to lock structures. APAR OA60394 corrected a performance impact due to a higher false contention rate for a lock structure that resulted with the fix for APAR OA59122 HBB77B0 PTF UJ03932 and HBB77C0 PTF UJ03933. For excessive Chase processing when LTE globally managed with no lock resources: All users with OA59122 PTF UJ03932 or UJ03933 installed supporting connectors to lock structures. APAR OA59122 addressed a problem where locking connectors become message-isolated when a burst of IXLLOCK requests releases a large number of globally managed resources causing the XES global management of those resources to end suddenly. USER IMPACT: For false contention rate increase: APAR OA60394 did not completely fix the problem it reported. APAR OA60394 reduced much of the performance impact caused by a higher rate of false contention encountered when XES lock resources are no longer in contention and XES global management of the resources does not end in a timely manner. However some performance impact due to a higher rate of false contention for a lock structures may still be encountered when a single exclusive resource being globally managed by another connector is no longer in contention. It may also be encountered when a LTE is being globally managed with no lock resources and a new request for the same LTE is processed by the global manager when no contention exists. While spikes of false contention rates from 1-19% have been observed due to this problem, no severe impact has been encountered from it. For excessive chase processing when LTE globally managed with no lock resources: APAR OA59122 fixed the problem it reported but introduced a new problem. APAR OA59122 better handles bursts of IXLLOCK requests that cause global management of a large number of resources to end avoiding becoming message-isolated (and associated MSGIXC638I and MSGIXC645E from being issued), and avoiding exploiter delays (avoiding ABENDS026 RSN08118001 from being issued) due to the associated processing. It introduces some performance impact due to triggering chase processing on all connectors when trying to update the resource being managed for LTEs that are improperly cleaned up during connector termination. During chase processing, the lock structure exploiter will be unable to access the LTE. Excessive chase processing may result in other problems such as those described for APAR OA61788.
Local fix
BYPASS/CIRCUMVENTION: No local fix RECOVERY ACTION: Running with the fix for OA60394 should not be a severely impactful condition. Apply the ++APAR for OA61848 when it becomes available. for the final fix. Removal of ++APAR packages mentioned on 08/10/2021 as the fix package did NOT include the fix for the global lock cleanup change. 08/24/2021 POK NCH
Problem summary
**************************************************************** * USERS AFFECTED: * * Users who are running in a Parallel Sysplex environment * * making use of coupling facility (CF) lock structures for * * sysplex-scope serialization functions with APAR OA59122 or * * OA60394 installed. * **************************************************************** * PROBLEM DESCRIPTION: * * High rates of false contention may be encountered due to * * mismanagement of an internal XES indication when a connector * * other than the global lock manager (GLM) is the only one * * with an exclusive interest in a resource that is no longer * * in contention. The high false contention rate may also be * * encountered whenever a lock table entry (LTE) is being * * globally managed with no lock resources and a new request * * for the same LTE is processed by the global manager when no * * lock contention exists. Additionally, peer survivor recovery * * processing for a terminating connector with LTEs globally * * managed with no lock resources can fail to properly clean up * * the LTE. This can result in a performance problem during * * request processing as XES chase processing corrects the LTE * * with a proper GLM. * * * * SYSPLEXDS * **************************************************************** * RECOMMENDATION: * * This PTF will not be fully effective on the system it is * * being applied until the PTF(s) for this APAR are applied to * * all systems in the sysplex. * * An IPL is required to activate this fix on each system of * * the SYSPLEX, however, a rolling IPL is sufficient to * * accomplish the activation. * **************************************************************** With APAR OA59122, LTEs being globally managed with no lock resources may encounter false contention when a new request for the same LTE is processed by the global manager even when no lock contention still exists. This contributes to a higher rate of false contention. Peer survivor recovery processing for a terminating connector with LTEs in this state may also miss properly cleaning up the LTE. This can subsequently cause excessive chase processing to occur for the LTE to get it corrected. Due to an error with the management of an internal XES indication with APAR OA60394, a high rate of false contention may be encountered when a connector other than the GLM is the only one left with an exclusive interest in a resource that is no longer in contention.
Problem conclusion
Logic added to reduce high rates of false contention by: 1.No longer reporting requests for LTEs being globally managed with no lock resources that are not in contention. 2.Correcting management of internal XES indication when a connector other than the GLM is the only one left with an exclusive interest in a resource that is no longer in contention. Logic added to peer survivor recovery processing to properly clean up LTEs being globally managed with no lock resources. +--- PUBLICATIONS AFFECTED -----------------------------------+ | | | o z/OS MVS System Messages, Vol 10 (IXC-IZP), | | SA38-0677-xx | +-------------------------------------------------------------+ In z/OS MVS System Messages, Vol 10 (IXC-IZP), MSGIXL016I is updated to include new Additional Status Information, and include other missing information: IXL016I CONNECTOR conname TO [NEW] STRUCTURE strname [DISCONNECTING | TERMINATING]: JOB jobname ASID asid trigger. | [statusinfo] Explanation Connector disconnect or termination processing is being performed by XES. In the message text: conname The name of the CF structure connection. | NEW | Indicates processing for the rebuild connection. | When not included indicates processing for the | original or only connection. strname The name of the CF structure. | DISCONNECTING | Indicates the connector is disconnecting. | TERMINATING | Indicates the connector is terminating. ... trigger One of the following: ... REQUESTED DISCONNECT WITH LOCK RESOURCES The connector used IXLDISC REASON=NORMAL or IXLDISC REASON=DELETESTR to disconnect from the structure, but the lock structure connector held lock resources. | For connections to a lock structure with RECORD=YES, | the disconnect is treated as if the connector had used | IXLDISC REASON=FAILURE. Note in particular that if | IXLDISC REASON=DELETESTR was used in this case and the | structure has a disposition of keep, the structure will | persist even if it has no connectors. ... | REQUESTED DISCONNECT REASON=DELETESTR | The connector used IXLDISC REASON=DELETESTR to | disconnect from the structure. | statusinfo | Status information which may include: | | ADDITIONAL STATUS INFORMATION: | This line is issued whenever there is additional status | information to be displayed. It is followed by the | following line: | | - MANAGING CONTENTION WITH NO LOCK RESOURCES | The lock structure connector is managing contention | with no lock resources currently held. Lock table | cleanup by peer connectors is required. ... System action The system continues processing. | When z/OS is running on a server that supports recovery | process boosts, coupling facility disconnect processing takes | advantage of System Recovery Boost's recovery process boost | support to expedite lock structure connector cleanup | processing performed by the system and active connected | users. When a connected user is disconnected from a lock | structure while holding lock resources, managing contention | with no lock resources currently held or is disconnected | implicitly as the result of task termination, address space | termination or a system being removed from the sysplex, a | System Recovery Boost will be requested on the systems where | active connected users receive the Disconnected or Failed | Connection event. ... +--- PUBLICATIONS AFFECTED -----------------------------------+ | | | o z/OS MVS System Codes, | | SA38-0665-xx | +-------------------------------------------------------------+ Change reason code 0A0D0101 shown under Operator response for 026 System completion code to 0A0Dxxxx. +--- PUBLICATIONS AFFECTED -----------------------------------+ | | | o z/OS MVS Programming: Sysplex Services Guide, | | SA23-1400-xx | +-------------------------------------------------------------+ Under Sysplex Services for Data Sharing (XES) Connection Services make the following updates/corrections: Under Disconnecting from a Coupling Facility Structure, Persistence Considerations, Handling Resources for a Disconnection: Update description for handling resources for a disconnection from a lock structure: ... -Lock structure ... | When z/OS is running on a server that supports recovery process boosts, coupling facility disconnect processing takes advantage of System Recovery Boost's recovery process boost support to expedite lock structure connector cleanup processing performed by the system and active connected users. When a connected user is disconnected from a lock | structure while holding lock resources, managing contention | with no lock resources currently held or is disconnected | implicitly as the result of task termination, address space termination or a system being removed from the sysplex, a System Recovery Boost will be requested on the systems where active connected users receive the Disconnected or Failed Connection event. Under Disconnecting from a Coupling Facility Structure, Persistence Considerations, Disconnection to Delete the Structure: Update note: | Note: If a connected user with RECORD=YES disconnects with REASON=DELETESTR on IXLDISC while still holding locks, XES treats the disconnect as if the user had specified REASON=FAILURE on IXLDISC. Under Structure Concepts, Identifying Connection States: Update note: *An IXLDISC REASON=NORMAL or REASON=DELETESTR request by a | connection with RECORD=YES which owns resources in a lock structure will be converted to an IXLDISC REASON=FAILURE request. Under Structure Concepts, Understanding Connection Persistence and Structure Persistence, Connection Persistence: Update last paragraph: If the connection terminates normally (disconnect with REASON=NORMAL or REASON=DELETESTR), the persistence attribute for the connection does not apply, and so the connection | becomes not defined. However, if a connector with RECORD=YES to a lock structure disconnects with REASON=NORMAL or REASON=DELETESTR while still owning resources associated with the lock structure, XES converts the reason to REASON=FAILURE.
Temporary fix
********* * HIPER * *********
Comments
APAR Information
APAR number
OA61848
Reported component name
CROSS SYS.EXT.S
Reported component ID
5752SCIXL
Reported release
7B0
Status
CLOSED PER
PE
YesPE
HIPER
YesHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-07-28
Closed date
2022-02-09
Last modified date
2022-03-01
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
UJ07740 UJ07741 UJ07742
Modules/Macros
IXLR1GLB IXLR1GLU IXLF1TFT IXLC1DCN IXLR2SSF IXCL2GAT IXLR2SSD IXLF1TX2 IXLF1TF5 IXLM1MST IXLR1GLC IXLI1SIN
| SA380677XX | SA380665XX | SA231400XX |
Fix information
Fixed component name
CROSS SYS.EXT.S
Fixed component ID
5752SCIXL
Applicable component levels
R7D0 PSY UJ07742
UP22/02/23 P F202
R7C0 PSY UJ07741
UP22/02/23 P F202
R7B0 PSY UJ07740
UP22/02/23 P F202
Fix is available
Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.
[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z\/OS"}],"Version":"7B0"}]
Document Information
Modified date:
02 March 2022