IBM Support

VM65690: Z/VM HANG DUE TO ERRORS IN MACHINE CHECK RECOVERY

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • System hangs can result during machine check recovery due to
    incorrect tests for processor malfunctions in spinlock manager.
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED: All users of z/VM.                           *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    ****************************************************************
    * RECOMMENDATION: APPLY PTF                                    *
    ****************************************************************
    This error can only occur if a non-repressible machine check
    or processor checkstop occurs while a guest is running.
    The cause of the hang is in a code path that only runs when a
    processor malfunction condition occurs.
    
    The handling of the machine check for processor x'0A' was
    done properly and that processor was successfully restarted.
    The cause of the hang was indirectly related.
    
    The sequence of events leading to the system hang on the Vary
    Proc Lock (HCPRCCVA) involves multiple functions and actions
    by multiple processors as follow:
    
    1. Sequence of events in processor x'11's spinlock manager:
       During an attempt to acquire the Scheduler Lock exclusive,
       the HCPSXL MALFCHK subroutine checks to see if the
       processor last observed holding the lock has malfunctioned.
    
       In this case, although processor x'11' thought that
       processor x'0A' held a share of the Scheduler Lock when it
       malfunctioned, the machine check old PSW shows that
       processor x'0A' was running a guest in SIE at the time of
       the machine check and therefore it would not have held a
       share of the Scheduler Lock. The spinlock data structures
       confirm this.
    
       If HCPSXL had done the lock hold check again after
       determining the processor had malfunctioned it would have
       seen that the processor no longer held the lock and finished
       its own lock obtain. This missing lock test is the problem.
    
       For this problem to occur, this was the sequence of events:
          PROC_X'11'               PROC_X'0A'
          wants lock exclusive     acquires lock share
          sees x'0A's share
                                   releases lock share
                                   runs guest
                                   gets machine check
                                   marked as malfunctioning
          sees x'0A malfunction
          Goto HCPMPRPR
    
    2. Sequence of events on processor x'11' after HCPMPRPR call:
       In this case, processor x'11' general registers indicate
       that it had gotten to HCPMPR label TERMSELF and set
       PFXDOWNR=PFXDOWN before calling HCPSGPIN because processor
       x'11's PFXRCVFG.PFXMALFW is set.
    
       It can also be seen that HCPSGPIN got control and called the
       SGPTERM subroutine.  This path is taken to terminate a
       processor when it believes the system is terminating.
       This code got control because of the problem in [1] where it
       incorrectly identified the situation as a fatal condition.
    
       If instead there was truly a machine check on a processor
       that held a spinlock then the processor receiving the
       machine check would have requested the system to terminate
       with ABENDMCH005.
    
       The fact that HCPMPRPR placed a processor in a disabled wait
       state while the rest of the system continued to run is not a
       defect in HCPMPRPR.  Rather HCPMPR is a victim of incorrect
       processing in its caller.  Fixing the problem in [1] will
       remove the conditions that led to this incorrect state.
    
    3. Sequence of events in Monitor MRPRCMFC D5 R13 processing:
       In preparation for generating these records, HCPMNPDM loops
       requesting each processor to extract its CPUMF counters.
       The Vary Proc Lock (HCPRCCVA) is acquired at the beginning
       and held continuously while the request is processed on
       each of the processors. The processing passes control from
       one processor to the next using SIGP Emergency Signals by
       HCPSGPNC/HCPSGPPK. Before the request is passed to the next
       processor, the processor state is checked to be sure it is
       online and that there is no error when its state is tested
       using SIGP Sense.
    
       In this case HCPSGPNC decided that processor x'11' was
       running properly because SIGP sense returned CC=0 and saw
       PFXTYPE=PFXTYSLV which is a valid state for a non-Master.
       However, the EMSBK stacked on processor x'11' for this
       function has never been processed so the processing for
       Monitor stopped here.
    
       Sequence [2] caused the Vary Processor Lock hang because
       processor x'11' went into a wait state but did not do any of
       the formal processor deconfiguration processing. That caused
       all other processors to treat processor x'11' as if it was
       online and able to respond to requests though it couldn't
       because it was in a disabled wait state.
    
       The processing in HCPSGPNC is another case of a victim.
       Fixing the problem in [1] will remove the conditions that
       led to this incorrect state.
    
    The problem in [1] can only occur if a processor checkstop or
    machine check occurs and the timing of the handling of spinlock
    requests by HCPSXL is such that the sequence of events in [1]
    can occur. While that is rather low likelihood, this is the
    cause of the hang condition and should be fixed.
    
    The invalid states in sequences [2] and [3] are really cases of
    victims of the problem in [1] so no change is needed for them.
    

Problem conclusion

  • This defect is corrected by modifying the malfunction check
    routine to reinspect the lock state when it finds that the
    processor it last observed holding the lock is marked as
    malfunctioning.
    
    Code changes were made to:
     - HCPSXL - MALFCHK subroutine was modified to retest the
                lock state if the processor was observed as
                malfunctioning.
    

Temporary fix

  • *********
    * HIPER *
    *********
    FOR RELEASE VM/ESA CP/ESA R640 :
    PREREQ: VM65988
    CO-REQ: NONE
    IF-REQ: NONE
    FOR RELEASE VM/ESA CP/ESA R710 :
    PREREQ: NONE
    CO-REQ: NONE
    IF-REQ: NONE
    

Comments

APAR Information

  • APAR number

    VM65690

  • Reported component name

    VM CP

  • Reported component ID

    568411202

  • Reported release

    640

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2018-10-03

  • Closed date

    2018-11-02

  • Last modified date

    2019-09-30

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    UM35289 UM35324

Modules/Macros

  • HCPSXL
    

Fix information

  • Fixed component name

    VM CP

  • Fixed component ID

    568411202

Applicable component levels

  • R640 PSY UM35289

       UP18/11/14 P 1901 ¢

  • R710 PSY UM35324

       UP18/11/14 P 1901 ¢

Fix is available

  • Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG27M","label":"APARs - z\/VM environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"640","Edition":"","Line of Business":{"code":"LOB16","label":"Mainframe HW"}}]

Document Information

Modified date:
30 September 2019