Network and switch issues

This section describes solutions for potential network and switch issues.

Unexpected network interface failure in switched networks

Problem

Unexpected network interface failures can occur in PowerHA SystemMirror configurations by using switched networks if the networks and the switches are incorrectly configured.

Solution

Take care to configure your switches and networks correctly.

Verification multicast communication

Problem
By default, PowerHA SystemMirror uses unicast communications for heartbeat. For cluster communication, you can optionally select to configure a multicast address or have CAA automatically select the multicast address if your network is configured to support multicast communication. If you use multicast communication, do not create a cluster until you verify that multicast packets can be sent successfully across all nodes that are part of the cluster.
Solution

To test end-to-end multicast communication for all nodes used to create the cluster on your network, run the mping command to send and receive packets between nodes.

If you are running PowerHA SystemMirror Version 7.1.1, or later, you cannot create a cluster if the mping command fails. If the mping command fails, your network is not set up correctly for multicast communication. If so, review the documentation for your switches and routers to enable multicast communication.

You can run the mping command with a specific multicast address; otherwise, the command uses a default multicast address. You must use the multicast addresses that are used for creating the cluster as input for the mping command.
Note: The mping command uses the interface that has the default route. To use the mping command for testing multicast communication on a different interface that does not have the default route, you must temporarily add a static route with the required interface to the multicast IP address.

The following example shows a success case and a failure case for the mping command, where Node A is the receiver and Node B is the sender.

Success case:
Receiver

root@nodeA:/# mping -r -R -c 5
mping version 1.1
Listening on 227.1.1.1/4098:

Replying to mping from 9.3.207.195 (nodeB.aus.stglabs.ibm.com) bytes=32 seqno=0 ttl=1
Replying to mping from 9.3.207.195 (nodeB.aus.stglabs.ibm.com) bytes=32 seqno=1 ttl=1
Replying to mping from 9.3.207.195 (nodeB.aus.stglabs.ibm.com) bytes=32 seqno=2 ttl=1
Replying to mping from 9.3.207.195 (nodeB.aus.stglabs.ibm.com) bytes=32 seqno=3 ttl=1
Replying to mping from 9.3.207.195 (nodeB.aus.stglabs.ibm.com) bytes=32 seqno=4 ttl=1

Sender

root@nodeB:/# mping -R -s -c 5
mping version 1.1
mpinging 227.1.1.1/4098 with ttl=1:

32 bytes from 9.3.207.190 (nodeA.aus.stglabs.ibm.com) seqno=0 ttl=1 time=0.985 ms
32 bytes from 9.3.207.190 (nodeA.aus.stglabs.ibm.com) seqno=1 ttl=1 time=0.958 ms
32 bytes from 9.3.207.190 (nodeA.aus.stglabs.ibm.com) seqno=2 ttl=1 time=0.998 ms
32 bytes from 9.3.207.190 (nodeA.aus.stglabs.ibm.com) seqno=3 ttl=1 time=0.863 ms
32 bytes from 9.3.207.190 (nodeA.aus.stglabs.ibm.com) seqno=4 ttl=1 time=0.903 ms

--- 227.1.1.1 mping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.863/0.941/0.998 ms
Failure case:
Receiver

root@nodeA:/# mping -r -R -c 5 -6
mping version 1.1
Listening on ff05::7F01:0101/4098:

Replying to mping from fe80::18ae:19ff:fe72:1a15 bytes=48 seqno=0 ttl=1
Replying to mping from fe80::18ae:19ff:fe72:1a15 bytes=48 seqno=1 ttl=1
Replying to mping from fe80::18ae:19ff:fe72:1a15 bytes=48 seqno=2 ttl=1
Replying to mping from fe80::18ae:19ff:fe72:1a15 bytes=48 seqno=3 ttl=1
Replying to mping from fe80::18ae:19ff:fe72:1a15 bytes=48 seqno=4 ttl=1

Sender

root@nodeB:/# mping -R -s -c 5 -6
mping version 1.1
mpinging ff05::7F01:0101/4098 with ttl=1:


--- ff05::7F01:0101 mping statistics ---
5 packets transmitted, 0 packets received, 100% packet loss
round-trip min/avg/max = 0.000/0.000/0.000 ms
Note: To verify a result, you must check the sender side of the mping command only. Also, note the percentage of packet loss. To verify whether multicast is working on a network, you must perform the mping tests with both nodes tested as both the sender and receiver. Typically, the non-verbose output provides you the necessary information. However, if you choose to use the -v flag with the mping command, a good knowledge about the internals of the program is necessary, without which the verbose output can be misunderstood. You can also check the return code from the sender side of the mping command. If an error occurs, the sender side returns 255. Upon success, it returns 0.

Cluster Aware AIX (CAA) selects a default multicast address if you do not specify a multicast address when you create the cluster. The default multicast address is created by combining the logical OR of the value (228.0.0.0) with the low 24 bits of the IP address of the node. For example, if the IP address is 9.3.199.45, then the default multicast address would be 228.3.199.45.

The Internet Protocol version 6 (IPv6) addresses are supported by PowerHA SystemMirror Version 7.1.2, or later. When IPv6 addresses are configured in the cluster, Cluster Aware AIX (CAA) activates heartbeat for the IPv6 addresses with an IPv6 multicast address. You must verify that the IPv6 connections in your environment can communicate with multicast addresses.

To verify that IPv6 multicast communications are configured correctly in your environment, you can run the mping command with the -6 option. When you run the mping command, it verifies the IPv6 multicast communications with the default IPv6 multicast address. To specify a specific IPv6 multicast address, run the mping command with the -a option and specify an IPv6 multicast address. You do not need to specify the -6 option when you use the -a option. The mping command automatically determines the family of the address that is passed with the -a option.

Persisting IPv6 addresses during system restart

Problem

Internet Protocol version 6 (IPv6) is designed for dynamic configuration as is the AIX operating system. IPv6 addresses do not persist during a system reboot operation.

Solution

To configure IPv6 addresses after a reboot, you can manually run the autoconf6 command. Alternatively, PowerHA SystemMirror will run the autoconf6 command automatically before starting cluster services.

To configure the autoconf6 command to run automatically for the AIX operating system, complete the following steps to change the /etc/rc.tcpip file:
  1. Uncomment the following lines to run the autoconf6 command:
    # Start up autoconf6 process
    start /usr/sbin/autoconf6
    Note: You can specify individual interfaces by entering the -i flag. For example,
    # Start up autoconf6 process
    start /usr/sbin/autoconf6 "" "-i en1"
  2. Uncomment the following lines to start the ndpd daemons:
    # Start up ndpd-host daemon
    start /usr/sbin/ndpd-host "$src_running"
    
    # Start up the ndpd-router daemon
    start /usr/sbin/ndpd-router "$src_running"

Cluster nodes cannot communicate

Problem

If your configuration has two or more nodes that are connected by a single network, you might experience a partitioned cluster. A partitioned cluster occurs when cluster nodes cannot communicate. In normal circumstances, a service network interface failure on a node causes the Cluster Manager to recognize and handle a swap_adapter event, where the service IP label or IP address is replaced with another IP label or IP address. Heartbeats are exchanged by the way of shared disks. However, there is a chance that the node becomes isolated from the cluster. Although the Cluster Managers on other nodes are aware of the attempted swap_adapter event, they cannot communicate with the now isolated (partitioned) node because no communication path exists.

Solution

Ensure that the network is configured for no single point of failure.

Distributed SMIT causes unpredictable results

Problem

Using the AIX utility DSMIT on operations other than starting or stopping PowerHA SystemMirror cluster services, can cause unpredictable results.

Solution

DSMIT manages the operation of networked IBM® System p™ processors. It includes the logic necessary to control execution of AIX commands on all networked nodes. To avoid a conflict with PowerHA SystemMirror, use DSMIT only to start and stop PowerHA SystemMirror cluster services.

Recovering from PCI hot plug NIC failure

Problem

If an unrecoverable error causes a PCI hot-replacement process to fail, the NIC might be abandoned in an unconfigured state and the node might be abandoned in maintenance mode. The PCI slot that is holding the NIC or the new NIC might be damaged.

Solution

User intervention is required to get the node back in fully working order.

IP label for PowerHA SystemMirror disconnected from AIX interface

Problem

When you define network interfaces to the cluster configuration by entering or selecting an IPPowerHA SystemMirror label, PowerHA SystemMirror discovers the associated AIX network interface name. PowerHA SystemMirror expects this relationship to remain unchanged. If you change the name of the AIX network interface name after you configure and synchronize the cluster, PowerHA SystemMirror does not function correctly.

Solution

If this problem occurs, you can reset the network interface name from the SMIT PowerHA SystemMirror System Management (C-SPOC) panel.

Packets lost during data transmission

Problem

If data is intermittently lost during transmission, it is possible that the maximum transmission unit (MTU) is set to different sizes on different nodes. For example, if Node A sends 8 K packets to Node B, which can accept 1.5 K packets, Node B assumes that the message is complete and the data might be lost.

Solution

Run the cluster verification utility to ensure that all of the network interface cards on all cluster nodes during the same network have the same setting for MTU size. If the MTU size is inconsistent across the network, an error is displayed, and you can determine which nodes to adjust.

Troubleshooting multicast

Problem
Use the mping command to test whether your nodes can send and receive multicast packets. If the mping command fails, you need to identify what the problem is in your network environment.
Solution
To troubleshoot multicast problems in your network, review the following guidelines:
  • Review the documentation for the switches that are used for multicast communication.
  • Disable Internet Group Management Protocol (IGMP) snooping on the switches that are used for multicast communication.
    Note: If your network infrastructure does not allow IGMP snooping to be disabled permanently, you might be able to troubleshoot problems by temporarily disabling snooping on the switches, and then adding more network components one at a time.
  • Eliminate any cascaded switches between the nodes in the cluster. In other words, have only a single switch between the nodes in the cluster.

Troubleshooting unicast

Problem
By default, PowerHA SystemMirror uses unicast socket-based communications between nodes in the cluster.

If you are having problems with unicast communications, follow general network troubleshooting procedures. For example,:

  • Use the ifconfig and netstat commands to verify the IP address configuration and routing.
  • Use the ping and traceroute commands to verify that nodes and adapters can communicate.
If the problem cannot be identified, use the iptrace command to trace low-level packet activity.

Troubleshooting virtual local area networks

Problem

To troubleshoot VLAN interfaces defined to PowerHA SystemMirror and detect an interface failure, consider these interfaces as interfaces defined on single adapter networks.

In particular, list the network interfaces that belong to a VLAN in the ping_client_list variable in the /usr/es/sbin/cluster/etc/clinfo.rc script and run the clinfo command. Whenever a cluster event occurs, clinfo monitors and detects a failure of the listed network interfaces. Due to the nature of virtual local area networks, other mechanisms to detect the failure of network interfaces are not effective.