A member cannot restart on the home host after a successful restart light
After a host failure, the member fails over to a guest host in restart light mode successfully, but then cannot fail back, and you are unable to restart the member on the home host manually.
Why failback might not succeed
- If the host is still down
- Check for hardware failure or power loss.
- If /var is full
- Db2® will not be able to start on the current host, and will attempt to restart in restart light mode on another.Note: Failback might not occur in this situation since /var would still be full. Ensure that there is plenty of space. The recommendation is for at least 3GB.
- The cluster filesystem's log files are stored in /var/adm/ras. Any core
files relating to the cluster manager (if core dumping is enabled) will go to
/var/ct/<domain>/run/mc/*/*. Check this path to see if core files are there
- Old files in these paths might be cleaned up along with any old system logs.
- Increase the disk space for the /var file system to have at least 3 GB free space.
- Start GPFS, run db2cluster -cfs -start host <failedHost>
- Run db2cluster -cm -list -alert to list the alert
- Run db2cluster -cm -clear -alert to clear the alert
- If there is still a problem, run db2support <output directory> -d <database name> -s and contact IBM Technical Support.
- The cluster filesystem's log files are stored in /var/adm/ras. Any core
files relating to the cluster manager (if core dumping is enabled) will go to
/var/ct/<domain>/run/mc/*/*. Check this path to see if core files are there
- If the host is up, but you cannot attach or restart due to communication failure, no communication through RDMA
- Perform the following diagnosis steps to determine if a communication failure through RDMA is the reason why the failback is not succeeding.
- Ping between the failed host and the cluster caching facility hosts.
- Run lsdev -C | grep -i <ib | roce> to verify that the RDMA components are in the Available State. The state should display as available.
- Use the ibstat -v to check the RDMA state. Verify that the port is active and that the link is up.
- Check the cfdump.out*, core files,
cfdiag*.log, mgmnt_lwd_log to see if there are failures
from the cluster caching facilities not
starting up, . If there are failures, run db2instance
-list, this will show primary in something other than the STARTED
state, and secondary in something other than the PEER state.
- If cfdump.out shows no initialized or object information, then it is likely that the cluster caching facility did not start successfully.
- If the cfdump.out has the information, then the cluster caching facility had started successfully at some point.
- Check the physical IB or RoCE network cable connections
- Perform an RDMA ping across the cluster by
running:
db2cluster -verify -req -rdma_ping - If you cannot communicate, run db2support <output directory> -d <database name> -s and contact IBM Technical Support.
- If the host is up, but sys log shows a fibre channel card problem/disk error/SAN cable connection problem (causing a connection to the GPFS disk to fail on the host)
-
Note: The sys log referenced here is simply the log file that contains system event information. The location of this file is different for each supported operating system. For example, on AIX systems, the system log can be inspected by running errpt; On Linux systems, run journalctl; on Windows, open the Event viewer from under Administrative tools, expand Windows logs and select the System option.See Disk failure causing Db2 cluster file system failure or if GPFS does not get remounted
- Run db2cluster -cfs -list -host -state .
mount | grep mmfsto see if any results show filesystem type=mmfs.- Check connections, cards, disks, and restart GPFS using db2cluster -cfs -start host <failedHost>.
Why the host fails
- To find out why the host fails
- Check for hardware failure or power loss.
- See Diagnosing a host reboot with a restart light for steps to diagnose the host failure on hostA.