IBM Spectrum Scale filesystem outage with a kernel panic

A kernel panic has occurred on a member host that is due to a IBM Spectrum Scale trigger. The trigger repeats on a sporadic but recurring basis.

Symptoms

The output of the db2instance -list command includes a pending failback operation, as shown in the following example:
ID        TYPE             STATE                 HOME_HOST    CURRENT_HOST    ALERT   PARTITION_NUMBER   LOGICAL_PORT    NETNAME
--        ----             -----                 ---------    ------------    -----   ----------------   ------------    -------
0         MEMBER           WAITING_FOR_FAILBACK  hostA        hostB           NO                     0              1    hostB-ib0
1         MEMBER           STARTED               hostB        hostB           NO                     0              0    hostB-ib0
2         MEMBER           STARTED               hostC        hostC           NO                     0              0    hostC-ib0
128       CF               PRIMARY               hostD        hostD           NO                     -              0    hostD-ib0
129       CF               PEER                  hostE        hostE           NO                     -              0    hostE-ib0

HOSTNAME              STATE      INSTANCE_STOPPED   ALERT
--------              -----      ----------------   -----
hostA                 INACTIVE   NO                 YES
hostB                 ACTIVE     NO                 NO
hostC                 ACTIVE     NO                 NO
hostD                 ACTIVE     NO                 NO
hostE                 ACTIVE     NO                 NO
In the previous example, hostA has a state of INACTIVE, and an ALERT field is marked as YES. This output of the db2instance -list command is seen when hostA is offline or rebooting. Since the home host for member 0, hostA is offline, member 0 has failed over to hostB. Member 0 is now waiting to failback to its home host, as indicated by the WAITING_FOR_FAILBACK state. After hostA is rebooted from the panic, member 1 will fail back to hostA.

Diagnosis

When you check the db2diag log file, you can find many log entries that indicate that a restart light operation has occurred, as shown in the following example:
2009-08-27-23.37.52.416270-240 I6733A457            LEVEL: Event
PID     : 1093874              TID  : 1             KTID : 2461779
PROC    : db2star2
INSTANCE:                      NODE : 000
HOSTNAME: hostB
EDUID   : 1
FUNCTION: Db2, base sys utilities, DB2StartMain, probe:3368
MESSAGE : Idle process taken over by member
DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes
996
DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes
0
Another way to diagnose this type of problem is to check the system log. Run the OS command errpt -a to view the contents of the AIX® errpt system log. In the AIX errpt log, you might see log entries similar in the following example, which is for hostA:
LABEL:          KERNEL_PANIC
IDENTIFIER:     225E3B63

Date/Time:       Mon May 26 08:02:03 EDT 2008
Sequence Number: 976
Machine Id:      0006DA8AD700
Node Id:         hostA
Class:           S
Type:            TEMP
Resource Name:   PANIC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
ASSERT STRING
5.1: xmemout succeeded rc=d

PANIC STRING
kx.C:2024:0:0:04A53FA8::advObjP == ofP->advLkObjP

If you see a KERNEL_PANIC log entry as shown in the previous example, the system reboot might be due to an operating system kernel panic that was triggered by a problem in the IBM Spectrum Scale subsystem. A kernel panic and system reboot can be the result of excessive processor usage or heavy paging on the system when the IBM Spectrum Scale daemons do not receive enough system resources to perform critical tasks. If you experience IBM Spectrum Scale filesystem outages that are related to kernel panics, the underlying processor usage or paging issues must be resolved first. If you cannot resolve the underlying issues, run the db2support command for the database with the -s parameter to collect diagnostic information and contact IBM Technical Support.