A fix is available
APAR status
Closed as program error.
Error description
Unresponsive processor detection runs on the Master so it can't detect a looping Master processor. The system hangs as a result.
Local fix
If a system hang is observed, use SYSTEM RESTART to obtain an ABENDSVC002 dump and restart the system rather than SNAPDUMP.
Problem summary
**************************************************************** * USERS AFFECTED: All Users of z/VM * **************************************************************** * PROBLEM DESCRIPTION: * **************************************************************** * RECOMMENDATION: APPLY PTF * **************************************************************** The master processor becomes unresponsive which causes the entire system to become unresponsive and no longer do productive work. The most common cause of an unresponsive Master processor is that it is looping continuously so the Master processor will show very high utilization. The system will appear hung and unlikely to be able to process commands. Further evidence of this problem is the lack of MSHCPMPG9152E messages to report unresponsive non-Master processors. The existing unresponsive processor detection function runs on the master processor so it is unable to detect that the master processor has become unresponsive. If the OPERATOR continues to receive message MSHCPMPG9152E, then the Master processor is responsive. The text of the MSHCPMPG9152E message indicates which non-Master processor is unresponsive. The problem that this APAR addresses can only occur if there is another prior error that causes the Master processor to become unresponsive. A variety of conditions could cause this but typically the processor is in a spin loop in CP code so that it never returns to the dispatcher to do other work and never opens interrupt windows.
Problem conclusion
A mechanism for non-Master processors to detect an unresponsive Master processor has been added to areas of the system likely to execute when the Master is looping. If an unresponsive Master processor is detected, an ABENDMCW002 dump is generated. To provide a way to disable the detection of unresponsive processors on second level test systems, a new operand on the FEATURES statement in the system configuration file is added: UNRESPONSIVE_PROCESSOR_DETECTION. All processors repeatedly mark themselves as responsive in PFXDETUP which is tested by both the existing and new unresponsive processor checks. The checking mechanism sets PFXDETUP indicating it has been tested for responsiveness. Processors that are operating normally reset PFXDETUP to a state indicating they are responsive in the Dispatcher. A looping processor will not execute the frequent code path that resets PFXDETUP so they will appear unresponsive when the unresponsive processor checking code checks PFXDETUP again. When PFXDETUP is not reset by the processor for longer than allowed, it is considered unresponsive and an ABENDMCW002 dump is generated. HCPDSP checks whether the Master is responsive on entry to wait state. HCPDSP, HCPSXL and HCPSYN check responsiveness of the Master in code that checks for processor malfunctions while looping attempting to acquire a spinlock. This APAR provides FFDC (first-failure data capture) and availability improvements. The ABENDMCW002 dump for the unresponsive Master processor condition provides diagnostic data used to determine the reason the Master processor was unresponsive. The system is re-IPLed as soon as the dump generation completes which greatly reduces the length of time the system is unavailable compared to the hang condition that required manual intervention for recovery actions. The detection is only applicable in a multiprocessor partition when there is at least one alternate processor in addition to the master processor. The documentation for system configuration FEATURES statement in the following books will be changed to include information about the new UNRESPONSIVE_PROCESSOR_DETECTION operand. - SC24-6178-14 z/VM V640 "CP Planning and Administration". - SC24-6271-03 z/VM V710 "CP Planning and Administration". FEATURES DISABLE UNRESPONSIVE_PROCESSOR_DETECTION specifies that detection of unresponsive processors will not occur on second level systems. A common cause of an unresponsive processor is that it is looping continuously and is no longer doing productive work. If the master processor is unresponsive, the system will appear hung and unlikely to be able to process commands. Disabling unresponsive processor detection prevents the CP from recognizing an unresponsive master or non-master processor and from initiating normal error recovery, which could be to restart the unresponsive processor or could be to abend the system. This option is intended for use in diagnostic situations when running as a second level system where CP might be stopped for long periods of time when using the CP TRACE command or similar facilities to debug code within the CP nucleus or a CPXLOADed nucleus extension. Virtual processors that aren't being traced could run enough to recognize the traced processor is being unresponsive. Specifying DISABLE UNRESPONSIVE_PROCESSOR_DETECTION only affects CP when it is running second level; when running first, this specification is ignored. FEATURES ENABLE UNRESPONSIVE_PROCESSOR_DETECTION specifies that detection of unresponsive processors occurs. A common cause of an unresponsive processor is that it is looping continuously and is no longer doing productive work. If the master processor is unresponsive, the system will appear hung and unlikely to be able to process commands. Detection of unresponsive processors is the default and allows CP to initiate normal error recovery, which could be to restart the unresponsive processor or could be to abend the system. *** Warning *** When enabling or disabling unresponsive processor detection in the z/VM system configuration file, it is recommended to add a separate FEATURES statement on a new line rather than include UNRESPONSIVE_PROCESSOR_DETECTION on an existing FEATURES line. This will avoid issues with existing FEATURES statements when IPLing CPLOAD MODULES which do not include this Apar. The documentation for Abend MCW002 in the following books will be changed. GC24-6177-12 z/VM V640 "CP Messages and Codes" book. GC24-6270-03 z/VM V710 "CP Messages and Codes" book. MCW002 Explanation Either the master processor or an alternate processor was discovered to be unresponsive. In the case when the master processor is unresponsive, no recovery can be attempted so the system was terminated. In the case of an unresponsive alternate processor, a reset was done but the processor was doing system work when it was reset, so recovery was not possible. This abend can occur if running first level, or if running second level and unresponsive processor detection is enabled. User response See z/VM: Diagnosis Guide for information on gathering the documentation you need to assist IBM in diagnosing the problem; then contact your IBM Support Center personnel.
Temporary fix
********* * HIPER * ********* FOR RELEASE VM/ESA CP/ESA R640 : PREREQ: VM65776 VM65988 VM66248 CO-REQ: NONE IF-REQ: NONE FOR RELEASE VM/ESACP/ESAR710 : PREREQ: VM66265 VM66283 CO-REQ: NONE IF-REQ: NONE
Comments
APAR Information
APAR number
VM65971
Reported component name
VM CP
Reported component ID
568411202
Reported release
630
Status
CLOSED PER
PE
NoPE
HIPER
YesHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2017-01-18
Closed date
2019-09-25
Last modified date
2021-06-29
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
UM35459 UM35460
Modules/Macros
HCPCKMST HCPDSP HCPMPC HCPOM2 HCPSRMBK HCPSXL HCPSYN HCPSYS HCPSYSCM HCPZSC
GC24617712 | SC24617814 | GC24627003 | SC24627103 |
Fix information
Fixed component name
VM CP
Fixed component ID
568411202
Applicable component levels
Fix is available
Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.
[{"Business Unit":{"code":"BU011","label":"Systems - zSystems software"},"Product":{"code":"SG27M"},"Platform":[{"code":"PF054","label":"z\/OS"}],"Version":"630","Line of Business":{"code":"LOB16","label":"Mainframe HW"}}]
Document Information
Modified date:
30 June 2021