IBM Support

VM65971: UNRESPONSIVE MASTER PROCESSOR LOOPING BUT NO ABEND

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • Unresponsive processor detection runs on the Master so it can't
    detect a looping Master processor. The system hangs as a result.
    

Local fix

  • If a system hang is observed, use SYSTEM RESTART to obtain an
    ABENDSVC002 dump and restart the system rather than SNAPDUMP.
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED: All Users of z/VM                            *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    ****************************************************************
    * RECOMMENDATION: APPLY PTF                                    *
    ****************************************************************
    The master processor becomes unresponsive which causes the
    entire system to become unresponsive and no longer do
    productive work. The most common cause of an unresponsive
    Master processor is that it is looping continuously so the
    Master processor will show very high utilization. The system
    will appear hung and unlikely to be able to process commands.
    
    Further evidence of this problem is the lack of MSHCPMPG9152E
    messages to report unresponsive non-Master processors. The
    existing unresponsive processor detection function runs on the
    master processor so it is unable to detect that the master
    processor has become unresponsive.
    
    If the OPERATOR continues to receive message MSHCPMPG9152E,
    then the Master processor is responsive.  The text of the
    MSHCPMPG9152E message indicates which non-Master processor
    is unresponsive.
    
    The problem that this APAR addresses can only occur if there
    is another prior error that causes the Master processor to
    become unresponsive. A variety of conditions could cause this
    but typically the processor is in a spin loop in CP code so
    that it never returns to the dispatcher to do other work and
    never opens interrupt windows.
    

Problem conclusion

  • A mechanism for non-Master processors to detect an unresponsive
    Master processor has been added to areas of the system likely
    to execute when the Master is looping. If an unresponsive
    Master processor is detected, an ABENDMCW002 dump is generated.
    To provide a way to disable the detection of unresponsive
    processors on second level test systems, a new operand on the
    FEATURES statement in the system configuration file is added:
    UNRESPONSIVE_PROCESSOR_DETECTION.
    
    All processors repeatedly mark themselves as responsive in
    PFXDETUP which is tested by both the existing and new
    unresponsive processor checks.  The checking mechanism sets
    PFXDETUP indicating it has been tested for responsiveness.
    Processors that are operating normally reset PFXDETUP to a
    state indicating they are responsive in the Dispatcher.  A
    looping processor will not execute the frequent code path that
    resets PFXDETUP so they will appear unresponsive when the
    unresponsive processor checking code checks PFXDETUP again.
    When PFXDETUP is not reset by the processor for longer than
    allowed, it is considered unresponsive and an ABENDMCW002 dump
    is generated.
    
    HCPDSP checks whether the Master is responsive on entry to
    wait state.  HCPDSP, HCPSXL and HCPSYN check responsiveness of
    the Master in code that checks for processor malfunctions
    while looping attempting to acquire a spinlock.
    
    This APAR provides FFDC (first-failure data capture) and
    availability improvements.  The ABENDMCW002 dump for the
    unresponsive Master processor condition provides diagnostic
    data used to determine the reason the Master processor was
    unresponsive.  The system is re-IPLed as soon as the dump
    generation completes which greatly reduces the length of time
    the system is unavailable compared to the hang condition that
    required manual intervention for recovery actions.
    
    The detection is only applicable in a multiprocessor partition
    when there is at least one alternate processor in addition to
    the master processor.
    
    The documentation for system configuration FEATURES statement
    in the following books will be changed to include information
    about the new UNRESPONSIVE_PROCESSOR_DETECTION operand.
    - SC24-6178-14 z/VM V640 "CP Planning and Administration".
    - SC24-6271-03 z/VM V710 "CP Planning and Administration".
    
       FEATURES DISABLE UNRESPONSIVE_PROCESSOR_DETECTION
    
         specifies that detection of unresponsive processors will
         not occur on second level systems.  A common cause of an
         unresponsive processor is that it is looping continuously
         and is no longer doing productive work.  If the master
         processor is unresponsive, the system will appear hung and
         unlikely to be able to process commands.  Disabling
         unresponsive processor detection prevents the CP from
         recognizing an unresponsive master or non-master processor
         and from initiating normal error recovery, which could be
         to restart the unresponsive processor or could be to abend
         the system.
    
         This option is intended for use in diagnostic situations
         when running as a second level system where CP might be
         stopped for long periods of time when using the CP TRACE
         command or similar facilities to debug code within the CP
         nucleus or a CPXLOADed nucleus extension.  Virtual
         processors that aren't being traced could run enough to
         recognize the traced processor is being unresponsive.
         Specifying DISABLE UNRESPONSIVE_PROCESSOR_DETECTION only
         affects CP when it is running second level; when running
         first, this specification is ignored.
    
       FEATURES ENABLE UNRESPONSIVE_PROCESSOR_DETECTION
    
         specifies that detection of unresponsive processors
         occurs.  A common cause of an unresponsive processor is
         that it is looping continuously and is no longer doing
         productive work.  If the master processor is unresponsive,
         the system will appear hung and unlikely to be able to
         process commands.  Detection of unresponsive processors
         is the default and allows CP to initiate normal error
         recovery, which could be to restart the unresponsive
         processor or could be to abend the system.
    
    *** Warning *** When enabling or disabling unresponsive
    processor detection in the z/VM system configuration file, it
    is recommended to add a separate FEATURES statement on a new
    line rather than include UNRESPONSIVE_PROCESSOR_DETECTION on an
    existing FEATURES line.  This will avoid issues with existing
    FEATURES statements when IPLing CPLOAD MODULES which do not
    include this Apar.
    
    The documentation for Abend MCW002 in the following books
    will be changed.
      GC24-6177-12 z/VM V640 "CP Messages and Codes" book.
      GC24-6270-03 z/VM V710 "CP Messages and Codes" book.
    
     MCW002
    
     Explanation
     Either the master processor or an alternate processor was
     discovered to be unresponsive.  In the case when the master
     processor is unresponsive, no recovery can be attempted so
     the system was terminated.  In the case of an unresponsive
     alternate processor, a reset was done but the processor was
     doing system work when it was reset, so recovery was not
     possible.  This abend can occur if running first level, or
     if running second level and unresponsive processor detection
     is enabled.
    
     User response
     See z/VM: Diagnosis Guide for information on gathering the
     documentation you need to assist IBM in diagnosing the
     problem; then contact your IBM Support Center personnel.
    

Temporary fix

  • *********
    * HIPER *
    *********
    FOR RELEASE VM/ESA CP/ESA R640 :
    PREREQ: VM65776 VM65988 VM66248
    CO-REQ: NONE
    IF-REQ: NONE
    FOR RELEASE VM/ESACP/ESAR710 :
    PREREQ: VM66265 VM66283
    CO-REQ: NONE
    IF-REQ: NONE
    

Comments

APAR Information

  • APAR number

    VM65971

  • Reported component name

    VM CP

  • Reported component ID

    568411202

  • Reported release

    630

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-01-18

  • Closed date

    2019-09-25

  • Last modified date

    2021-06-29

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    UM35459 UM35460

Modules/Macros

  • HCPCKMST HCPDSP   HCPMPC   HCPOM2   HCPSRMBK HCPSXL   HCPSYN
    HCPSYS   HCPSYSCM HCPZSC
    

Publications Referenced
GC24617712SC24617814GC24627003SC24627103 

Fix information

  • Fixed component name

    VM CP

  • Fixed component ID

    568411202

Applicable component levels

  • R640 PSY UM35459

       UP19/09/30 P 2001 ¢

  • R710 PSY UM35460

       UP19/09/30 P 2101 ¢

Fix is available

  • Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Business Unit":{"code":"BU011","label":"Systems - zSystems software"},"Product":{"code":"SG27M"},"Platform":[{"code":"PF054","label":"z\/OS"}],"Version":"630","Line of Business":{"code":"LOB16","label":"Mainframe HW"}}]

Document Information

Modified date:
30 June 2021