VM65690: Z/VM HANG DUE TO ERRORS IN MACHINE CHECK RECOVERY

A fix is available

APAR status

Closed as program error.

Error description

System hangs can result during machine check recovery due to
incorrect tests for processor malfunctions in spinlock manager.

Local fix

Problem summary

****************************************************************
* USERS AFFECTED: All users of z/VM.                           *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
****************************************************************
* RECOMMENDATION: APPLY PTF                                    *
****************************************************************
This error can only occur if a non-repressible machine check
or processor checkstop occurs while a guest is running.
The cause of the hang is in a code path that only runs when a
processor malfunction condition occurs.

The handling of the machine check for processor x'0A' was
done properly and that processor was successfully restarted.
The cause of the hang was indirectly related.

The sequence of events leading to the system hang on the Vary
Proc Lock (HCPRCCVA) involves multiple functions and actions
by multiple processors as follow:

1. Sequence of events in processor x'11's spinlock manager:
   During an attempt to acquire the Scheduler Lock exclusive,
   the HCPSXL MALFCHK subroutine checks to see if the
   processor last observed holding the lock has malfunctioned.

   In this case, although processor x'11' thought that
   processor x'0A' held a share of the Scheduler Lock when it
   malfunctioned, the machine check old PSW shows that
   processor x'0A' was running a guest in SIE at the time of
   the machine check and therefore it would not have held a
   share of the Scheduler Lock. The spinlock data structures
   confirm this.

   If HCPSXL had done the lock hold check again after
   determining the processor had malfunctioned it would have
   seen that the processor no longer held the lock and finished
   its own lock obtain. This missing lock test is the problem.

   For this problem to occur, this was the sequence of events:
      PROC_X'11'               PROC_X'0A'
      wants lock exclusive     acquires lock share
      sees x'0A's share
                               releases lock share
                               runs guest
                               gets machine check
                               marked as malfunctioning
      sees x'0A malfunction
      Goto HCPMPRPR

2. Sequence of events on processor x'11' after HCPMPRPR call:
   In this case, processor x'11' general registers indicate
   that it had gotten to HCPMPR label TERMSELF and set
   PFXDOWNR=PFXDOWN before calling HCPSGPIN because processor
   x'11's PFXRCVFG.PFXMALFW is set.

   It can also be seen that HCPSGPIN got control and called the
   SGPTERM subroutine.  This path is taken to terminate a
   processor when it believes the system is terminating.
   This code got control because of the problem in [1] where it
   incorrectly identified the situation as a fatal condition.

   If instead there was truly a machine check on a processor
   that held a spinlock then the processor receiving the
   machine check would have requested the system to terminate
   with ABENDMCH005.

   The fact that HCPMPRPR placed a processor in a disabled wait
   state while the rest of the system continued to run is not a
   defect in HCPMPRPR.  Rather HCPMPR is a victim of incorrect
   processing in its caller.  Fixing the problem in [1] will
   remove the conditions that led to this incorrect state.

3. Sequence of events in Monitor MRPRCMFC D5 R13 processing:
   In preparation for generating these records, HCPMNPDM loops
   requesting each processor to extract its CPUMF counters.
   The Vary Proc Lock (HCPRCCVA) is acquired at the beginning
   and held continuously while the request is processed on
   each of the processors. The processing passes control from
   one processor to the next using SIGP Emergency Signals by
   HCPSGPNC/HCPSGPPK. Before the request is passed to the next
   processor, the processor state is checked to be sure it is
   online and that there is no error when its state is tested
   using SIGP Sense.

   In this case HCPSGPNC decided that processor x'11' was
   running properly because SIGP sense returned CC=0 and saw
   PFXTYPE=PFXTYSLV which is a valid state for a non-Master.
   However, the EMSBK stacked on processor x'11' for this
   function has never been processed so the processing for
   Monitor stopped here.

   Sequence [2] caused the Vary Processor Lock hang because
   processor x'11' went into a wait state but did not do any of
   the formal processor deconfiguration processing. That caused
   all other processors to treat processor x'11' as if it was
   online and able to respond to requests though it couldn't
   because it was in a disabled wait state.

   The processing in HCPSGPNC is another case of a victim.
   Fixing the problem in [1] will remove the conditions that
   led to this incorrect state.

The problem in [1] can only occur if a processor checkstop or
machine check occurs and the timing of the handling of spinlock
requests by HCPSXL is such that the sequence of events in [1]
can occur. While that is rather low likelihood, this is the
cause of the hang condition and should be fixed.

The invalid states in sequences [2] and [3] are really cases of
victims of the problem in [1] so no change is needed for them.

Problem conclusion

This defect is corrected by modifying the malfunction check
routine to reinspect the lock state when it finds that the
processor it last observed holding the lock is marked as
malfunctioning.

Code changes were made to:
 - HCPSXL - MALFCHK subroutine was modified to retest the
            lock state if the processor was observed as
            malfunctioning.

Temporary fix

*********
* HIPER *
*********
FOR RELEASE VM/ESA CP/ESA R640 :
PREREQ: VM65988
CO-REQ: NONE
IF-REQ: NONE
FOR RELEASE VM/ESA CP/ESA R710 :
PREREQ: NONE
CO-REQ: NONE
IF-REQ: NONE

Comments

APAR Information

APAR number
VM65690
Reported component name
VM CP
Reported component ID
568411202
Reported release
640
Status
CLOSED PER
PE
NoPE
HIPER
YesHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2018-10-03
Closed date
2018-11-02
Last modified date
2019-09-30

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

UM35289 UM35324

Modules/Macros

```
HCPSXL
```

Fix information

Fixed component name
VM CP
Fixed component ID
568411202

Applicable component levels

R640 PSY UM35289
UP18/11/14 P 1901 ¢
R710 PSY UM35324
UP18/11/14 P 1901 ¢

Fix is available

Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG27M","label":"APARs - z\/VM environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"640","Edition":"","Line of Business":{"code":"LOB16","label":"Mainframe HW"}}]

Document Information

Modified date:
30 September 2019

Tips

VM65690: Z/VM HANG DUE TO ERRORS IN MACHINE CHECK RECOVERY

A fix is available

Subscribe

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Modules/Macros

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R640 PSY UM35289

R710 PSY UM35324

Fix is available

Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

Document Information

Share your feedback

Need support?