Diagnosing a host reboot with a restart light
A host gets rebooted during runtime and a restart light has occurred. This identifies and gives details on how to diagnose the various situations that might have occurred.
Diagnosis
This section describes how to identify a restart light that has occurred due to a host reboot.
2009-11-02-22.56.30.416270-240 I6733A457 LEVEL: Event
PID : 1093874 TID : 1 KTID : 2461779
PROC : db2star2
INSTANCE: NODE : 001
HOSTNAME: hostC
EDUID : 1
FUNCTION: Db2, base sys utilities, DB2StartMain, probe:3368
MESSAGE : Idle process taken over by member
DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes
996
DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes
1
A message might display in the AIX® error log showing a reboot based on the time
of the db2diag log file entry shown previously. Run errpt -a to
access the AIX error log. The
following scenarios are three possible reasons for this occurrence:- A user-initiated host shutdown and reboot has occurred.
To determine whether this situation occurred, look in the AIX errpt log for an entry similar to the following one:
LABEL: REBOOT_ID IDENTIFIER: 2BFA76F6 Date/Time: Mon Nov 2 22:56:28 EST 2009 Sequence Number: 11878 Machine Id: 00057742D900 Node Id: coralpib17 Class: S Type: TEMP WPAR: Global Resource Name: SYSPROC Description SYSTEM SHUTDOWN BY USER Probable Causes SYSTEM SHUTDOWN Detail Data USER ID 0 0=SOFT IPL 1=HALT 2=TIME REBOOT 0 TIME TO REBOOT (FOR TIMED REBOOT ONLY) 0
In this example, a user has initiated the reboot.
- Tivoli®
SA MP/RSCT
initiated the reboot.
To determine whether this situation occurred, look in the AIX errpt log for an entry similar to the following one:
LABEL: TS_CRITICAL_CLNT_ER IDENTIFIER: 75FA8C75 Date/Time: Tue Mar 31 10:58:49 EDT 2009 Sequence Number: 358837 Machine Id: 0006DA8AD700 Node Id: coralp08 Class: S Type: PERM WPAR: Global Resource Name: cthats Description Critical client blocked/exited Probable Causes Group Services daemon was blocked too long or exited Failure Causes Group Services daemon blocked: resource contention Group Services daemon blocked: protocol problems Group Services daemon exited: internal failure Group Services daemon exited: critical client failure Recommended Actions Group Services daemon blocked: reduce system load Group Services daemon exited: diagnose Group Services Detail Data DETECTING MODULE rsct,monitor.C,1.124.1.3,5520 ERROR ID 6plcyp/dyWo7/lSx/p3k37.................... REFERENCE CODE Critical client - program name hagsd Failure Code BLOCKED Action NODE REBOOT
In this example, RSCT has rebooted the host to protect critical resources in the cluster.
- A kernel panic caused a reboot.
Reviewing a KERNEL_PANIC message in the AIX errpt log, or the message written before it, can help identify the underlying trigger of a kernel panic. If the LABEL, PANIC STRING, or Detail Data fields in the message contain
MMFS_*
(For example, MMFS_GENERIC, MMFS_PHOENIX), then this can indicate GPFS is the trigger. Similarly, if any of the fields contain TS_* or RSCT*, then this can indicate that Tivoli SA MP is the trigger. To determine whether this situation occurred, look in the AIX errpt log for an entry similar to the following one:
For further details, see the Related Reference.LABEL: KERNEL_PANIC IDENTIFIER: 225E3B63 Date/Time: Mon May 26 08:02:03 EDT 2008 Sequence Number: 976 Machine Id: 0006DA8AD700 Node Id: coralpib08 Class: S Type: TEMP Resource Name: PANIC Description SOFTWARE PROGRAM ABNORMALLY TERMINATED Recommended Actions PERFORM PROBLEM DETERMINATION PROCEDURES
Troubleshooting
If the affected host is online, run the db2instance -list command,. If the db2instance -list shows that the member is reported as WAITING_FOR_FAILBACK, look for alerts in the output. Check the alert(s), you might have to clear an alert before the member can fail back to its home host. If there is still no failback, see A member cannot restart on the home host after a successful restart light.