Disk failure causing Db2 cluster file system failure

A failure of the Db2 cluster file system, which is based on GPFS, on one member that causes all shared file systems (including the db2dump) to be unmounted for that member. GPFS stayed down.

Symptoms

Note: The db2dump directory will only be lost if it is on the GPFS file system. If it has been set to something else (For example, diagpath) then this GPFS failure will not affect it.
This is a sample output from the db2instance -list command showing a three member, two cluster caching facility environment, where there has been a host alert:
db2instance -list
ID        TYPE             STATE                HOME_HOST       CURRENT_HOST    ALERT   PARTITION_NUMBER        LOGICAL_PORT    NETNAME
--        ----             -----                ---------       ------------    -----   ----------------        ------------    -------
0         MEMBER           WAITING_FOR_FAILBACK hostA           hostB           NO                     0                   1    hostB-ib0
1         MEMBER           STARTED              hostB           hostB           NO                     0                   0    hostB-ib0
2         MEMBER           STARTED              hostC           hostC           NO                     0                   0    hostC-ib0
128       CF               PRIMARY              hostD           hostD           NO                     -                   0    hostD-ib0
129       CF               PEER                 hostE           hostE           NO                     -                   0    hostE-ib0
	
HOSTNAME              STATE      INSTANCE_STOPPED ALERT
--------              -----      ---------------- -----
hostA                 ACTIVE     NO               YES
hostB                 ACTIVE     NO               NO
hostC                 ACTIVE     NO               NO
hostD                 ACTIVE     NO               NO
hostE                 ACTIVE     NO               NO

Diagnosis / resolution

This scenario can be identified through GPFS error messages
  • Check the alert message by running db2cluster -cm -list -alert, for example
    db2cluster -cm -list -alert
    The host "hostA.torolab.ibm.com" is not able to access the following file systems: 
    "/db2cfs/db2inst1/sqllib/db2dump".  
    Check the disk connections and mount the file system. 
    See the Db2 documentation for more details. 
    This alert must be cleared manually via the command 
    db2cluster -clear -alert -host hostA.torolab.ibm.com
    While the file system is offline, the db2 members on this 
    host will be in restart light mode on other systems and will 
    be WAITING_FOR_FAILBACK
    
  • Confirm that you can access the file systems in question by using the ls or cd operating system commands.
    ls /db2cfs/db2inst1/sqllib/db2dump
    cd /db2cfs/db2inst1/sqllib/db2dump
    If the file systems are inaccessible or offline, these commands will return a message indicating that the directory does not exist or is not available.
  • If the file system is inaccessible by running the ls or cd commands, confirm if the file systems are considered mounted on the problematic host.
    • Using this example scenario, on hostA run
      mount |grep sqllib
      If it is not mounted the /db2cfs/db2inst1/sqllib file system will not be shown in the result set.
    • To mount the file systems in question run the command db2cluster -cfs -mount -filesystem fs_name
  • Check the db2diag log file at the diagpath location. If you require further information about the failure, look for relevant messages to ascertain the problem leading to the restart light. There might be a db2diag log record corresponding to the time of the restart, for example
    2009-08-27-23.37.52.416270-240 I6733A457            LEVEL: Event
    PID     : 1093874              TID  : 1             KTID : 2461779
    PROC    : db2star2
    INSTANCE:                      NODE : 000
    HOSTNAME: hostB
    EDUID   : 1
    FUNCTION: Db2, base sys utilities, DB2StartMain, probe:3368
    MESSAGE : Idle process taken over by member
    DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes
    996
    DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes
    0
    
See Restart events that might occur in Db2 pureScale environments for more information about the various restart messages including the Restart Light message.
  • Another source of reference for diagnosing problems is the system log. Run the operating system command errpt -a to view the contents of the AIX® errpt system log. In this scenario example, by looking at the AIX errpt log from hostA just before the time of the aforementioned Restart Light message, you see MMFS_* messages (For example MMFS_GENERIC, MMFS_PHOENIX in errpt) with text that is similar to the following text:
    message: "GPFS: 6027-752 Lost membership in cluster hostA. Unmounting file systems.
  • Check the disks, disk connections, and fibre channel cards. The root cause in this example scenario was faulty interconnects between the host and the SAN.
  • For IBM Technical Support to analyze the diagnostic data, obtain a db2support package by running db2support output_directory -d database_name on each member in the cluster. Follow the instructions at 'Submitting diagnostic information to IBM Technical Support for problem determination' to upload data to IBM Technical Support: