Diagnosing a host reboot with a restart light

A host gets rebooted during runtime and a restart light has occurred. This identifies and gives details on how to diagnose the various situations that might have occurred.

Diagnosis

This section describes how to identify a restart light that has occurred due to a host reboot.

A message will display in the db2diag log file showing a restart light event, for example
2009-11-02-22.56.30.416270-240 I6733A457   LEVEL: Event
PID     : 1093874              TID  : 1    KTID : 2461779
PROC    : db2star2
INSTANCE:                      NODE : 001
HOSTNAME: hostC
EDUID   : 1
FUNCTION: Db2, base sys utilities, DB2StartMain, probe:3368
MESSAGE : Idle process taken over by member
DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes
996
DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes
1
A message might display in the AIX® error log showing a reboot based on the time of the db2diag log file entry shown previously. Run errpt -a to access the AIX error log. The following scenarios are three possible reasons for this occurrence:
  • A user-initiated host shutdown and reboot has occurred.

    To determine whether this situation occurred, look in the AIX errpt log for an entry similar to the following one:

    LABEL:           REBOOT_ID
    IDENTIFIER:      2BFA76F6
    
    Date/Time:       Mon Nov  2 22:56:28 EST 2009
    Sequence Number: 11878
    Machine Id:      00057742D900
    Node Id:         coralpib17
    Class:           S
    Type:            TEMP
    WPAR:            Global
    Resource Name:   SYSPROC
    
    Description
    SYSTEM SHUTDOWN BY USER
    
    Probable Causes
    SYSTEM SHUTDOWN
    
    Detail Data
    USER ID
               0
    0=SOFT IPL 1=HALT 2=TIME REBOOT
               0
    TIME TO REBOOT (FOR TIMED REBOOT ONLY)
               0

    In this example, a user has initiated the reboot.

  • Tivoli® SA MP/RSCT initiated the reboot.

    To determine whether this situation occurred, look in the AIX errpt log for an entry similar to the following one:

    LABEL:           TS_CRITICAL_CLNT_ER
    IDENTIFIER:      75FA8C75
    
    Date/Time:       Tue Mar 31 10:58:49 EDT 2009
    Sequence Number: 358837
    Machine Id:      0006DA8AD700
    Node Id:         coralp08
    Class:           S
    Type:            PERM
    WPAR:            Global
    Resource Name:   cthats
    
    Description
    Critical client blocked/exited
    
    Probable Causes
    Group Services daemon was blocked too long or exited
    
    Failure Causes
    Group Services daemon blocked: resource contention
    Group Services daemon blocked: protocol problems
    Group Services daemon exited: internal failure
    Group Services daemon exited: critical client failure
    
    
            Recommended Actions
            Group Services daemon blocked: reduce system load
            Group Services daemon exited: diagnose Group Services
    
    Detail Data
    DETECTING MODULE
    rsct,monitor.C,1.124.1.3,5520
    ERROR ID
    6plcyp/dyWo7/lSx/p3k37....................
    REFERENCE CODE
    
    Critical client - program name
    hagsd
    Failure Code
    BLOCKED
    Action
    NODE REBOOT
    

    In this example, RSCT has rebooted the host to protect critical resources in the cluster.

  • A kernel panic caused a reboot.

    Reviewing a KERNEL_PANIC message in the AIX errpt log, or the message written before it, can help identify the underlying trigger of a kernel panic. If the LABEL, PANIC STRING, or Detail Data fields in the message contain MMFS_* (For example, MMFS_GENERIC, MMFS_PHOENIX), then this can indicate GPFS is the trigger. Similarly, if any of the fields contain TS_* or RSCT*, then this can indicate that Tivoli SA MP is the trigger. To determine whether this situation occurred, look in the AIX errpt log for an entry similar to the following one:

    LABEL:           KERNEL_PANIC
    IDENTIFIER:      225E3B63
    
    Date/Time:       Mon May 26 08:02:03 EDT 2008
    Sequence Number: 976
    Machine Id:      0006DA8AD700
    Node Id:         coralpib08
    Class:           S
    Type:            TEMP
    Resource Name:   PANIC
    
    Description
    SOFTWARE PROGRAM ABNORMALLY TERMINATED
    
            Recommended Actions
            PERFORM PROBLEM DETERMINATION PROCEDURES
    
    
    For further details, see the Related Reference.

Troubleshooting

If the affected host is online, run the db2instance -list command,. If the db2instance -list shows that the member is reported as WAITING_FOR_FAILBACK, look for alerts in the output. Check the alert(s), you might have to clear an alert before the member can fail back to its home host. If there is still no failback, see A member cannot restart on the home host after a successful restart light.