Deallocation error log entries

Three different error log messages are associated with CPU deallocation.

The following are examples.

errpt short format - summary
The following is an example of entries displayed by the errpt command (without options):
# errpt
IDENTIFIER      TIMESTAMP       T    C    RESOURCE_NAME    DESCRIPTION
804E987A        1008161399      I    O    proc4            CPU DEALLOCATED
8470267F        1008161299      T    S    proc4            CPU DEALLOCATION ABORTED
1B963892        1008160299      P    H    proc4            CPU FAILURE PREDICTED
#
  • If processor deallocation is enabled, a CPU FAILURE PREDICTED message is always followed by either a CPU DEALLOCATED message or a CPU DEALLOCATION ABORTED message.
  • If processor deallocation is not enabled, only the CPU FAILURE PREDICTED message is logged. Enabling processor deallocation any time after one or more CPU FAILURE PREDICTED messages have been logged initiates the deallocation process and results in a success or failure error log entry, as described above, for each processor reported failing.
errpt long format - detailed description
The following is the form of output obtained with errpt -a:
  • CPU_FAIL_PREDICTED

    Error description: Predictive Processor Failure

    This error indicates that the hardware detected that a processor has a high probability to fail in a near future. It is always logged whether or not processor deallocation is enabled.

    DETAIL DATA: Physical processor number, location

    Example error log entry - long form
    	LABEL:			CPU_FAIL_PREDICTED
    	IDENTIFIER:		1655419A
    
    	Date/Time:		Thu Sep 30 13:42:11
    	Sequence Number:	53
    	Machine Id:		00002F0E4C00
    	Node Id:		auntbea
    	Class:			H
    	Type:			PEND
    	Resource Name:		proc25
    	Resource Class:		processor
    	Resource Type:		proc_rspc
    	Location:		00-25
    
    	Description
    	CPU FAILURE PREDICTED
    
    	Probable Causes
    	CPU FAILURE
    
    	Failure Causes
    	CPU FAILURE
    
    		Recommended Actions
    		ENSURE CPU GARD MODE IS ENABLED
    		RUN SYSTEM DIAGNOSTICS.
    
    	Detail Data
    	PROBLEM DATA
    	0144	1000	0000	003A	8E00	9100	1842	1100	1999	0930	4019
    	0000	0000	0000	0000	0000
    	0000	0000	0000	0000	0000	0000	0000	0000	4942	4D00	5531
    	2E31	2D50	312D	4332	0000
    	0002	0000	0000	0000	0000	0000	0000	0000	0000	0000	0000
    	0000	0000	0000	0000	0000
    	0000	0000	0000	0000	0000	0000	0000	0000	0000	0000	0000
    	0000	0000	0000	0000	0000
    	...	...	...	...	...
    
  • CPU_DEALLOC_SUCCESS

    Error Description: A processor has been successfully deallocated after detection of a predictive processor failure. This message is logged when processor deallocation is enabled, and when the CPU has been successfully deallocated.

    DETAIL DATA: Logical CPU number of deallocated processor.

    Example: error log entry - long form:
    	LABEL:			CPU_DEALLOC_SUCCESS
    	IDENTIFIER:		804E987A
    
    	Date/Time:		Thu Sep 30 13:44:13
    	Sequence Number:	63
    	Machine Id:		00002F0E4C00
    	Node Id:		auntbea
    	Class:			O
    	Type:			INFO
    	Resource Name:		proc24
    
    	Description
    	CPU DEALLOCATED
    
    
    		Recommended Actions
    		MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE
    
    	Detail Data
    	LOGICAL DEALLOCATED CPU NUMBER
    
    		0
    In this example, proc24 was successfully deallocated and was logical CPU 0 when the failure occurred.
  • CPU_DEALLOC_FAIL

    Error Description: A processor deallocation, due to a predictive processor failure, was not successful. This message is logged when CPU deallocation is enabled, and when the CPU has not been successfully deallocated.

    DETAIL DATA: Reason code, logical CPU number, additional information depending of the type of failure.

    The reason code is a numeric hexadecimal value. The possible reason codes are:
    Item Description
    2 One or more processes/threads remain bound to the last logical CPU. In this case, the detailed data give the PIDs of the offending processes.
    3 A registered driver or kernel extension returned an error when notified. In this case, the detailed data field contains the name of the offending driver or kernel extension (ASCII encoded).
    4 Deallocating a processor causes the machine to have less than two available CPUs. This operating system does not deallocate more than N-2 processors on an N-way machine to avoid confusing applications or kernel extensions using the total number of available processors to determine whether they are running on a Uni Processor (UP) system where it is safe to skip the use of multiprocessor locks, or a Symmetric Multi Processor (SMP).
    200 (0xC8) Processor deallocation is disabled (the ODM attribute cpuguard has a value of disable). You normally do not see this error unless you start ha_star manually.

    Examples: error log entries - long format

    Example 1:
    	LABEL:			CPU_DEALLOC_ABORTED
    	IDENTIFIER:		8470267F
    	Date/Time:		Thu Sep 30 13:41:10
    	Sequence Number:	50
    	Machine Id:		00002F0E4C00
    	Node Id:		auntbea
    	Class:			S
    	Type:			TEMP
    	Resource Name:		proc26
    
    Description
    CPU DEALLOCATION ABORTED
    
    Probable Causes
    SOFTWARE PROGRAM
    
    Failure Causes
    SOFTWARE PROGRAM
    
    	Recommended Actions
    	MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE
    	SEE USER DOCUMENTATION FOR CPU GARD
    
    Detail Data
    DEALLOCATION ABORTED CAUSE
    0000 0003
    DEALLOCATION ABORTED DATA
    6676 6861 6568 3200
    In this example, the deallocation for proc26 failed. The reason code 3 means that a kernel extension returned an error to the kernel notification routine. The DEALLOCATION ABORTED DATA above spells fvhaeh2, which is the name the extension used when registering with the kernel.
    Example 2:
    	LABEL:			CPU_DEALLOC_ABORTED
    	IDENTIFIER:		8470267F
    	Date/Time:		Thu Sep 30 14:00:22
    	Sequence Number:	71
    	Machine Id:		00002F0E4C00
    	Node Id:		auntbea
    	Class:			S
    	Type:			TEMP
    	Resource Name:		proc19
    
    Description
    CPU DEALLOCATION ABORTED
    
    Probable Causes
    SOFTWARE PROGRAM
    
    Failure Causes
    SOFTWARE PROGRAM
    
    	Recommended Actions
    	MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE;
    	SEE USER DOCUMENTATION FOR CPU GARD
    
    Detail Data
    DEALLOCATION ABORTED CAUSE
    0000 0002
    DEALLOCATION ABORTED DATA
    0000 0000 0000 4F4A
    In this example, the deallocation for proc19 failed. The reason code 2 indicates thread(s) were bound to the last logical processor and did not unbind after receiving the SIGCPUFAIL signal. The DEALLOCATION ABORTED DATA shows that these threads belonged to process 0x4F4A.

    Options of the ps command (-o THREAD, -o BND) allow you to list all threads or processes along with the number of the CPU they are bound to, when applicable.

    Example 3:
    	LABEL:			CPU_DEALLOC_ABORTED
    	IDENTIFIER:		8470267F
    
    	Date/Time:		Thu Sep 30 14:37:34
    	Sequence Number:	106
    	Machine Id:		00002F0E4C00
    	Node Id:		auntbea
    	Class:			S
    	Type:			TEMP
    	Resource Name:		proc2
    
    Description
    CPU DEALLOCATION ABORTED
    
    Probable Causes
    SOFTWARE PROGRAM
    
    Failure Causes
    SOFTWARE PROGRAM
    
    	Recommended Actions
    	MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE
    	SEE USER DOCUMENTATION FOR CPU GARD
    
    Detail Data
    DEALLOCATION ABORTED CAUSE
    0000 0004
    DEALLOCATION ABORTED DATA
    0000 0000 0000 0000
    In this example, the deallocation of proc2 failed because there were two or fewer active processors at the time of failure (reason code 4).