A fix is available
APAR status
Closed as program error.
Error description
************************************************************** * USERS AFFECTED: * Systems running the AIX 7200-00 Technology Level * with bos.cluster.rte below the 7.2.0.2 level. ************************************************************** * PROBLEM DESCRIPTION: * After reboot of one node, the CAA cluster state * may be inconsistent in a cluster using multicast * communication mode, if there is an issue with * multicast communication, but unicast communication * is working. * 'lscluster -m' of node1: * ------------------------ * Calling node query for all nodes... * Node query number of nodes examined: 2 * * Node name: node1 * Cluster shorthand id for node: 1 * ... * State of node: UP NODE_LOCAL * ... * Node name: node2 * Cluster shorthand id for node: 2 * ... * State of node: DOWN * ... * 'lscluster -m' of node2: * ------------------------ * Calling node query for all nodes... * Node query number of nodes examined: 2 * * Node name: node1 * Cluster shorthand id for node: 1 * ... * State of node: UP * ... * Node name: node2 * Cluster shorthand id for node: 2 * ... * State of node: UP NODE_LOCAL * ... * In the above example node2 was the last node, which * has been rebooted. * syslog.caa of node1 looks like: * ------------------------------- * ... * <timestamp> node1 caa:info unix: kcluster_lock.c * count_active_nodes 200 num_nodes_active 2 * *up_node_cnt 1 db_node_cnt 1 * <timestamp> node1 caa:err|error unix: * kcluster_clusterwide.c * kcluster_clusterwide 841 clusterwide query * node timeout: cmd = 0x20, from node id = 2 * ... * <timestamp> node1 caa:err|error unix: * kcluster_clusterwide.c * kcluster_clusterwide 841 clusterwide query * node timeout: cmd = 0x20, from node id = 2 * ... * syslog.caa of node2 looks like: * ------------------------------- * ... * <timestamp> node2 caa:info unix: kcluster_syscalls.c * _xcluster_create 2614 * Clusterwide locking services are starting. * ... * <timestamp> node2 caa:info unix: kcluster_lock.c * count_active_nodes 200 num_nodes_active 2 * *up_node_cnt 0 db_node_cnt 1 * <timestamp> node2 caa:info unix: kcluster_lock.c * wait_on_node_bringup 255 All nodes are active. * ... * <timestamp> node2 caa:info unix: kcluster_lock.c * count_active_nodes 200 num_nodes_active 2 * *up_node_cnt 0 db_node_cnt 1 * <timestamp> node2 caa:info unix: kcluster_lock.c * xcluster_lock 607 xcluster_lock: lock * 2 acquired, num_nodes_active: 2 * <timestamp> node2 caa:info unix: kcluster_lock.c * xcluster_lock 608 xcluster_lock: nodes * which responded: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * ... * <timestamp> node2 caa:info clusterÝ2490836¨: caa_config.c * cl_th_sock 5317 258 Node node1 * is DOWN, and we are not trying to JOIN it or STOP it. * Skipping. * ... ************************************************************** * RECOMMENDATION: * Install APAR IV82651. * Prior to fix availability, an interim fix is available from * either * ftp://aix.software.ibm.com/aix/ifixes/iv82651/ * https://aix.software.ibm.com/aix/ifixes/iv82651/ * The ifix can be installed using Live Update (LU). * If LU is not used, installation of the ifix requires a * reboot. **************************************************************
Local fix
Use unicast communication mode.
Problem summary
************************************************************** * USERS AFFECTED: * Systems running the AIX 7200-00 Technology Level * with bos.cluster.rte below the 7.2.0.2 level. ************************************************************** * PROBLEM DESCRIPTION: * After reboot of one node, the CAA cluster state * may be inconsistent in a cluster using multicast * communication mode, if there is an issue with * multicast communication, but unicast communication * is working. * 'lscluster -m' of node1: * ------------------------ * Calling node query for all nodes... * Node query number of nodes examined: 2 * * Node name: node1 * Cluster shorthand id for node: 1 * ... * State of node: UP NODE_LOCAL * ... * Node name: node2 * Cluster shorthand id for node: 2
Problem conclusion
If it is known that a certain number of nodes is heartbeating to the repository, do not attempt to acquire clusterwide locks until the number of nodes gossiping is equal to it.
Temporary fix
********* * HIPER * *********
Comments
APAR Information
APAR number
IV82651
Reported component name
AIX V7.2
Reported component ID
5765CD200
Reported release
720
Status
CLOSED PER
PE
NoPE
HIPER
YesHIPER
Submitted date
2016-03-14
Closed date
2016-03-14
Last modified date
2016-11-09
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
AIX V7.2
Fixed component ID
5765CD200
Applicable component levels
R720 PSY U870356
UP16/05/04 I 1000
PTF to Fileset Mapping
U870356 bos.cluster.rte 7.2.0.2
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSVEF8","label":"AIX 7.2 Enterprise Edition"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"720","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11S","label":"AIX 7.2 HIPERS, APARs and Fixes"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"720","Edition":"","Line of Business":{"code":"","label":""}}]
Document Information
Modified date:
09 November 2016