Disk failure causing Db2 cluster file system failure
A failure of the Db2 cluster file system, which is based on GPFS, on one member that causes all shared file systems (including the db2dump) to be unmounted for that member. GPFS stayed down.
Symptoms
Note: The db2dump directory
will only be lost if it is on the GPFS file
system. If it has been set to something else (For example, diagpath)
then this GPFS failure will
not affect it.
This is a sample output from the db2instance -list command
showing a three member,
two cluster caching facility environment,
where there has been a host alert:
db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER WAITING_FOR_FAILBACK hostA hostB NO 0 1 hostB-ib0
1 MEMBER STARTED hostB hostB NO 0 0 hostB-ib0
2 MEMBER STARTED hostC hostC NO 0 0 hostC-ib0
128 CF PRIMARY hostD hostD NO - 0 hostD-ib0
129 CF PEER hostE hostE NO - 0 hostE-ib0
HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
hostA ACTIVE NO YES
hostB ACTIVE NO NO
hostC ACTIVE NO NO
hostD ACTIVE NO NO
hostE ACTIVE NO NO
Diagnosis / resolution
This scenario can
be identified through GPFS error
messages
- Check the alert message by running db2cluster
-cm -list -alert, for
example
db2cluster -cm -list -alert The host "hostA.torolab.ibm.com" is not able to access the following file systems: "/db2cfs/db2inst1/sqllib/db2dump". Check the disk connections and mount the file system. See the Db2 documentation for more details. This alert must be cleared manually via the command db2cluster -clear -alert -host hostA.torolab.ibm.com While the file system is offline, the db2 members on this host will be in restart light mode on other systems and will be WAITING_FOR_FAILBACK
- Confirm that you can access the file systems in question by using
the ls or cd operating system
commands.
If the file systems are inaccessible or offline, these commands will return a message indicating that the directory does not exist or is not available.ls /db2cfs/db2inst1/sqllib/db2dump cd /db2cfs/db2inst1/sqllib/db2dump
- If the file system is inaccessible by running the ls or cd commands,
confirm if the file systems are considered mounted on the problematic
host.
- Using this example scenario, on hostA run
If it is not mounted the /db2cfs/db2inst1/sqllib file system will not be shown in the result set.mount |grep sqllib
- To mount the file systems in question run the command db2cluster -cfs -mount -filesystem fs_name
- Using this example scenario, on hostA run
- Check the db2diag log file at the diagpath location.
If you require further information about the failure, look for relevant
messages to ascertain the problem leading to the restart light. There
might be a db2diag log record corresponding to
the time of the restart, for example
2009-08-27-23.37.52.416270-240 I6733A457 LEVEL: Event PID : 1093874 TID : 1 KTID : 2461779 PROC : db2star2 INSTANCE: NODE : 000 HOSTNAME: hostB EDUID : 1 FUNCTION: Db2, base sys utilities, DB2StartMain, probe:3368 MESSAGE : Idle process taken over by member DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes 996 DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes 0
- Another source of reference for diagnosing problems is the system
log. Run the operating system command errpt -a to
view the contents of the AIX® errpt system
log. In this scenario example, by looking at the AIX errpt log from
hostA
just before the time of the aforementioned Restart Light message, you see MMFS_* messages (For example MMFS_GENERIC, MMFS_PHOENIX in errpt) with text that is similar to the following text:message: "GPFS: 6027-752 Lost membership in cluster hostA. Unmounting file systems.
- Check the disks, disk connections, and fibre channel cards. The root cause in this example scenario was faulty interconnects between the host and the SAN.
- For IBM Technical Support to analyze the diagnostic data, obtain a db2support package by running db2support output_directory -d database_name on each member in the cluster. Follow the instructions at 'Submitting diagnostic information to IBM Technical Support for problem determination' to upload data to IBM Technical Support: