IBM Support

Etherchannel failover will not fail back to primary in certain Environments

Troubleshooting


Problem

Etherchannel failover will not fail back to primary but it can failover to secondary.

Symptom

In certain network environments such as Cisco ACI SDN switches
an NIB etherchannel may not be able to recover to the primary
adapter even if forced.

Cause

The issue is related to how the ACI handles broadcast and ARP broadcast requests. By default, the fabric does not flood ARP requests to all bridge domain members. It handles ARP broadcasts as unicast packets and sends them to the correct to the endpoint. The ACI does this to reduce overhead for broadcast traffic across the fabric. In this case, where ARP is being used only to generate return packets as a test of network availability, this function may prevent the failover from occurring, because the failover interface sees an inactive connection as its request has not been returned as expected. This is why when we pull the cable failover occurs, but there are no connections when the NIC is not active.

Diagnosing The Problem

The unique aspect of the AIX NIB is the way a failback to the primary/main adapter
is done. A failback to primary is not completed(pending) until the driver gets at least
one packet on the inactive primary adapter port. To ensure this, the active backup
port sends out arp broadcast packets although any packet hitting the primary adapter
port is sufficient to complete the transition back to active primary.

The problem in customer environment is that when the backup adapter is active, the primary
adapter does not get a single packet. Not even broadcast or stray gateway control packets.
So the failback to primary is forever pending. Notice the stats below where ent0 is the
primary adapter and ent4 is the backup. 

entstat_ent8.before:ETHERNET STATISTICS (ent8) :
entstat_ent8.before:Packets: 203379057                           Packets: 260198994
entstat_ent8.before:Packets Dropped: 0                            Packets Dropped: 1
entstat_ent8.before:ETHERNET STATISTICS (ent0) :
entstat_ent8.before:Packets: 197435248                           Packets: 252710262 <<<<
entstat_ent8.before:Packets Dropped: 0                            Packets Dropped: 0
entstat_ent8.before:ETHERNET STATISTICS (ent4) :
entstat_ent8.before:Packets: 5943819                              Packets: 7488742
entstat_ent8.before:Packets Dropped: 0                            Packets Dropped: 1

entstat_ent8.after:ETHERNET STATISTICS (ent8) :
entstat_ent8.after:Packets: 203606270                           Packets: 260476841
entstat_ent8.after:Packets Dropped: 0                            Packets Dropped: 1
entstat_ent8.after:ETHERNET STATISTICS (ent0) :
entstat_ent8.after:Packets: 197435248                           Packets: 252710262 <<<< no change!
entstat_ent8.after:Packets Dropped: 0                            Packets Dropped: 0
entstat_ent8.after:ETHERNET STATISTICS (ent4) :
entstat_ent8.after:Packets: 6171048                              Packets: 7766607
entstat_ent8.after:Packets Dropped: 0                           Packets Dropped: 1

Resolving The Problem

In order to make this failover function work, request customer to enable unknown unicast and ARP flooding for this bridge domain.

[{"Product":{"code":"SWG10","label":"AIX"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"}],"Version":"Version Independent","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
15 September 2021

UID

isg3T1026143