PowerHA Problem Determination
Here are few notes that I've made of some of the more useful day to day commands that help look at problems in your PowerHA (HACMP) clusters. If you're running older version of HACMP such as 5.4 and below, then some of the paths may differ. Most of this should be correct for 5.5, and 6.1, but your need to do some revision and testing for 7.1.
Its normally a good idea to get a status of the cluster and try to work out how serious the problem is. Depending on the cluster/nodes state some of the commands will need to be run on a active node, in the case of a fail-over, the standby node:
/usr/es/sbin/cluster/utilities/clfindres - shows current status of resource groups.
/usr/es/sbin/cluster/clstat - shows cluster status and sub-state in real time, needs clinfo to be running.
Now these 2 commands are good to show you the current status of the cluster, but you can also get more information with:
lssrc -ls clstrmgrES - shows the cluster manager state
lssrc -ls topsvcs - shows heartbeat information
If you want even more details about the cluster then the following commands should be able to help you further:
/usr/es/sbin/cluster/utilities/cltopinfo - show current topology status and some information about the cluster.
/usr/es/sbin/cluster/utilities/cldisp – shows cluster information such as monitor and rery intervals.
/usr/es/sbin/cluster/utilities/cllscf - list the network configuration of a HACMP cluster.
/usr/es/sbin/cluster/utilities/cllsif - show network interface information.
/usr/es/sbin/cluster/utilities/clRGinfo -p -t - show current RG state.
/usr/es/sbin/cluster/utilities/clRGinfo –m - shows RG monitor status.
/usr/es/sbin/cluster/utilities/clshowres - shows short resource group information.
Cluster Logs -
Once you have a idea of the status of the cluster you can start looking at the log files to try to determin the problem, now a good place to start is first the cluster.log
/usr/es/adm/cluster.log - I tend to filter this as follows so that I can get a idea of the events that have occured, but also look out for 'config_too_long'.
# cat /usr/es/adm/cluster.log |grep ' EVENT ' |more - Should look something like this:
Jan 13 18:51:29 <node1> HACMP for AIX: EVENT COMPLETED: node_up <node1> 0
Jan 13 18:51:30 <node1> HACMP for AIX: EVENT START: node_up_complete <node1>
Jan 13 18:51:30 <node1> HACMP for AIX: EVENT START: start_server cluster_AS
Jan 13 18:51:30 <node1> HACMP for AIX: EVENT COMPLETED: start_server cluster_AS 0
Jan 13 18:51:30 <node1> HACMP for AIX: EVENT COMPLETED: node_up_complete <node1> 0
Jan 13 18:59:16 <node1> HACMP for AIX: EVENT START: node_down <node1>
Jan 13 18:59:16 <node1> HACMP for AIX: EVENT START: stop_server cluster_AS
Jan 13 19:00:34 <node1> HACMP for AIX: EVENT COMPLETED: stop_server cluster_AS 0
Jan 13 19:01:04 <node1> HACMP for AIX: EVENT START: release_service_addr
From these log entries you can see what events have completed and failed, generally you will look out for 'config_too_log', EVENT FAIL or events that have not completed. As all events that have a start should also follow in the log at some point with a completed, else this indicates a problem too. Once you have picked any problem you might have you can use these as a key to look further into this log, or the hacmp.log as this lists the same events but in much more detail, this should give you some more clues as to where to look from there. Depending on what you find the problem may well be RSCT related or HA core. Now if the problem relates to the HA core then you can look further into logs such as the following:
/var/hacmp/log/clstrmgr.debug - clstrmgrES activity, even more detail then hacmp.out, good place to look if events are missing from other logs.
/var/hacmp/log/cpsoc.log - the cspoc log for the node that the command was issued on. It contains time-stamped, formatted messages generated by HACMP C-SPOC commands.
/var/hacmp/clcomd/clcomd.log - cluster communication daemon log.
/var/hacmp/clverify/clverify.log - cluster verification log, good place to look to make sure the cluster was working before the problem.
/var/hacmp/adm/history/cluster.mmddyyyy - It contains time-stamped, formatted messages generated by HACMP scripts.
RSCT (Reliable Scalable Cluster Technology)
These are stored in /var/ha/log, and the RSCT relates to to site to site comunication and heart-beating, this means that the RSCT logs are a good place to look if you are having problem with the fail-over adapters or the HB'ing disks. The logs are broken up into a number of related areas
grpsvcs.* - Group Services log; hacmp cluster comunication log.
nim.topsvcs.<interface>* - Network Interface Module Topology Services log; deals with specific interface comunication and login, there will be one of these of each cluster interface on each node, plus the disk/serial HB'ing devices.
nmDiag.nim.topsvcs.<interface>* - Network Module Diagnostic message; this relates only to the network devices.
topsvcs.* - Topology Services log; summary and some further detail of all the network topology that is occuring with in the cluster, a good place to look to get a status of the adapters and cluster status.
topsvcs.default - Daily Topology Services log which is run at the begining of the day to confirm the topsvcs status