A fix is available
APAR status
Closed as program error.
Error description
System hangs can result during machine check recovery due to incorrect tests for processor malfunctions in spinlock manager.
Local fix
Problem summary
**************************************************************** * USERS AFFECTED: All users of z/VM. * **************************************************************** * PROBLEM DESCRIPTION: * **************************************************************** * RECOMMENDATION: APPLY PTF * **************************************************************** This error can only occur if a non-repressible machine check or processor checkstop occurs while a guest is running. The cause of the hang is in a code path that only runs when a processor malfunction condition occurs. The handling of the machine check for processor x'0A' was done properly and that processor was successfully restarted. The cause of the hang was indirectly related. The sequence of events leading to the system hang on the Vary Proc Lock (HCPRCCVA) involves multiple functions and actions by multiple processors as follow: 1. Sequence of events in processor x'11's spinlock manager: During an attempt to acquire the Scheduler Lock exclusive, the HCPSXL MALFCHK subroutine checks to see if the processor last observed holding the lock has malfunctioned. In this case, although processor x'11' thought that processor x'0A' held a share of the Scheduler Lock when it malfunctioned, the machine check old PSW shows that processor x'0A' was running a guest in SIE at the time of the machine check and therefore it would not have held a share of the Scheduler Lock. The spinlock data structures confirm this. If HCPSXL had done the lock hold check again after determining the processor had malfunctioned it would have seen that the processor no longer held the lock and finished its own lock obtain. This missing lock test is the problem. For this problem to occur, this was the sequence of events: PROC_X'11' PROC_X'0A' wants lock exclusive acquires lock share sees x'0A's share releases lock share runs guest gets machine check marked as malfunctioning sees x'0A malfunction Goto HCPMPRPR 2. Sequence of events on processor x'11' after HCPMPRPR call: In this case, processor x'11' general registers indicate that it had gotten to HCPMPR label TERMSELF and set PFXDOWNR=PFXDOWN before calling HCPSGPIN because processor x'11's PFXRCVFG.PFXMALFW is set. It can also be seen that HCPSGPIN got control and called the SGPTERM subroutine. This path is taken to terminate a processor when it believes the system is terminating. This code got control because of the problem in [1] where it incorrectly identified the situation as a fatal condition. If instead there was truly a machine check on a processor that held a spinlock then the processor receiving the machine check would have requested the system to terminate with ABENDMCH005. The fact that HCPMPRPR placed a processor in a disabled wait state while the rest of the system continued to run is not a defect in HCPMPRPR. Rather HCPMPR is a victim of incorrect processing in its caller. Fixing the problem in [1] will remove the conditions that led to this incorrect state. 3. Sequence of events in Monitor MRPRCMFC D5 R13 processing: In preparation for generating these records, HCPMNPDM loops requesting each processor to extract its CPUMF counters. The Vary Proc Lock (HCPRCCVA) is acquired at the beginning and held continuously while the request is processed on each of the processors. The processing passes control from one processor to the next using SIGP Emergency Signals by HCPSGPNC/HCPSGPPK. Before the request is passed to the next processor, the processor state is checked to be sure it is online and that there is no error when its state is tested using SIGP Sense. In this case HCPSGPNC decided that processor x'11' was running properly because SIGP sense returned CC=0 and saw PFXTYPE=PFXTYSLV which is a valid state for a non-Master. However, the EMSBK stacked on processor x'11' for this function has never been processed so the processing for Monitor stopped here. Sequence [2] caused the Vary Processor Lock hang because processor x'11' went into a wait state but did not do any of the formal processor deconfiguration processing. That caused all other processors to treat processor x'11' as if it was online and able to respond to requests though it couldn't because it was in a disabled wait state. The processing in HCPSGPNC is another case of a victim. Fixing the problem in [1] will remove the conditions that led to this incorrect state. The problem in [1] can only occur if a processor checkstop or machine check occurs and the timing of the handling of spinlock requests by HCPSXL is such that the sequence of events in [1] can occur. While that is rather low likelihood, this is the cause of the hang condition and should be fixed. The invalid states in sequences [2] and [3] are really cases of victims of the problem in [1] so no change is needed for them.
Problem conclusion
This defect is corrected by modifying the malfunction check routine to reinspect the lock state when it finds that the processor it last observed holding the lock is marked as malfunctioning. Code changes were made to: - HCPSXL - MALFCHK subroutine was modified to retest the lock state if the processor was observed as malfunctioning.
Temporary fix
********* * HIPER * ********* FOR RELEASE VM/ESA CP/ESA R640 : PREREQ: VM65988 CO-REQ: NONE IF-REQ: NONE FOR RELEASE VM/ESA CP/ESA R710 : PREREQ: NONE CO-REQ: NONE IF-REQ: NONE
Comments
APAR Information
APAR number
VM65690
Reported component name
VM CP
Reported component ID
568411202
Reported release
640
Status
CLOSED PER
PE
NoPE
HIPER
YesHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2018-10-03
Closed date
2018-11-02
Last modified date
2019-09-30
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
UM35289 UM35324
Modules/Macros
HCPSXL
Fix information
Fixed component name
VM CP
Fixed component ID
568411202
Applicable component levels
Fix is available
Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.
[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG27M","label":"APARs - z\/VM environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"640","Edition":"","Line of Business":{"code":"LOB16","label":"Mainframe HW"}}]
Document Information
Modified date:
30 September 2019