Topic
  • 4 replies
  • Latest Post - ‏2010-04-26T17:12:53Z by Gian_Samuele
gsamuele
gsamuele
1 Post

Pinned topic Problems Bonding - ibmveth LPAR Linux P550

‏2010-03-30T12:58:21Z |
Hi everybody,

I'm writing this email because I have a problem with some LPAR with Linux on a P550 system.

I have 2 P550 servers with one VIO Server each and 2 Virtual Switchs: Ethernet1 and Ethernet2. Each P550 Server is connected with two physical Cisco Switch (SWC1 and SWC2) on 6 ports: 3 ports in port channeling on SWC1, other 3 ports in port channeling on SWC2. The VIO Server on each P550 server has the following network configuration:

1.- Two Virtual Ethernet Adapters: one connected to the Virtual Switch
Ethernet1 (VEA1) and the other connected to the Virtual Switch
Ethernet2 (VEA2).
2.- One Link Aggregation with 3 physical ports of the servers attached
to the Physical Cisco Switch SWC1 (LNAGG1).
3.- One Link Aggregation with 3 physical ports of the servers attached
to the Physical Cisco Switch SWC2 (LNAGG2).
4.- Two SEA: one connected with VEA1 and LNAGG1 (SEA1) and the other
connected with VEA2 and LNAGG2 (SEA2).

Each Linux LPAR has two virtual ethernet adapters: one connected to the Virtual Switch Ethernet1 and the other connected to the Virtual Switch Ethernet2. This two network adapters has been configured with bonding active/backup (mode 1) with arping as the monitor mechanism for bonding with a ping to the default gateway. The LPAR runs Redhat Enterprise Linux 5 and I am experiencing the following problem:
when I have more than one LPAR Linux on the same VLAN and I stopped one of the Physical Cisco Switch (ex. SWC1) wich supports the active ethernet interface of the bonding, the system doesn't detect the failure and I am unable to reach anymore all the systems of the same VLAN through network connectivity. I also noticed that when I brought down the failed ethernet interface manually (ex. ifconfig eth0 down) on just one of the Linux LPARs the bonding switches to the backup interface in all the systems belonging to the same VLAN and then I'm back into business having a normal network functionality using the backup interface as the active one.

I'm writing this problem because I also have LPARs with AIX 6.1 on the same systems and those systems works very well with this kind of configuration, so I figured out that this problem could be related to the ibmveth module on the linux systems.

I noticed that this solution is not recommended, but then I saw the documentation on the IBM developerworks Wiki (http://www.ibm.com/developerworks/wikis/display/LinuxP/Bonding+configuration) and this is the configuration that I have (sort of). The only difference I noticed with the documentation is that they specified to use the ibmveth module version 1.05 but my Linux distribution with the last kernel only provides me with the ibmveth module version 1.03, and I have tried to find the version specified on the documentation but without success, that's why I'm using this forum.

If somebody could provide me with any help related to this problem, I'll really appreciate it... thanks in advance...

Best regards,
Gianfranco Samuele
Euroinformatica s.r.l.
Abruzzi - Italy
P.D.: because of your name, I was tempted to write in spanish to make myself more clear about the situation, but I hope that you understand my problem even if I used english :P ... Thanks!!
Updated on 2010-04-26T17:12:53Z at 2010-04-26T17:12:53Z by Gian_Samuele
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: Problems Bonding - ibmveth LPAR Linux P550

    ‏2010-04-02T19:14:41Z  
    when I have more than one LPAR Linux on the same VLAN and I stopped one of the Physical Cisco Switch (ex. SWC1) wich supports the active ethernet interface of the bonding, the system doesn't detect the failure and I am unable to reach anymore all the systems of the same VLAN through network connectivity. I also noticed that when I brought down the failed ethernet interface manually (ex. ifconfig eth0 down) on just one of the Linux LPARs the bonding switches to the backup interface in all the systems belonging to the same VLAN and then I'm back into business having a normal network functionality using the backup interface as the active one.

    This suggests to me that the two instances of bonding (both running the ARP monitor) are tricking one another into believing that the arp_ip_target is reachable, when in fact it is not. This is a relatively common problem when running multiple instances of the ARP monitor behind a common choke point (the virtual switches, in your case).

    The reason this happens is that, by default, the ARP monitor only checks that the slaves have sent and received traffic (any traffic). Validating that traffic is to or from a particular bond requires enabling the "arp_validate" option to bonding, e.g., "arp_validate=all" in the options (the same place that the mode, et al, are set, most likely in the ifcfg-bond0 file, but perhaps elsewhere depending on your configuration and distro).

    Now, there's a catch. Two, actually.

    First, your distro might not be recent enough to have arp_validate. I know that RHEL 5 U2 does (I just checked the kernel), but I don't have an older kernel handy to see when it was added.

    Second, assuming that your kernel has arp_validate, if you are configuring VLANs on the linux host atop the bond, arp_validate doesn't work for VLAN destinations until very recent kernels (RHEL 5 U5). If you're hiding the VLANs within the switches, this doesn't matter; this only matters if you're configuring the VLANs atop bonding.

    -J
  • Gian_Samuele
    Gian_Samuele
    3 Posts

    Re: Problems Bonding - ibmveth LPAR Linux P550

    ‏2010-04-07T09:21:18Z  
    when I have more than one LPAR Linux on the same VLAN and I stopped one of the Physical Cisco Switch (ex. SWC1) wich supports the active ethernet interface of the bonding, the system doesn't detect the failure and I am unable to reach anymore all the systems of the same VLAN through network connectivity. I also noticed that when I brought down the failed ethernet interface manually (ex. ifconfig eth0 down) on just one of the Linux LPARs the bonding switches to the backup interface in all the systems belonging to the same VLAN and then I'm back into business having a normal network functionality using the backup interface as the active one.

    This suggests to me that the two instances of bonding (both running the ARP monitor) are tricking one another into believing that the arp_ip_target is reachable, when in fact it is not. This is a relatively common problem when running multiple instances of the ARP monitor behind a common choke point (the virtual switches, in your case).

    The reason this happens is that, by default, the ARP monitor only checks that the slaves have sent and received traffic (any traffic). Validating that traffic is to or from a particular bond requires enabling the "arp_validate" option to bonding, e.g., "arp_validate=all" in the options (the same place that the mode, et al, are set, most likely in the ifcfg-bond0 file, but perhaps elsewhere depending on your configuration and distro).

    Now, there's a catch. Two, actually.

    First, your distro might not be recent enough to have arp_validate. I know that RHEL 5 U2 does (I just checked the kernel), but I don't have an older kernel handy to see when it was added.

    Second, assuming that your kernel has arp_validate, if you are configuring VLANs on the linux host atop the bond, arp_validate doesn't work for VLAN destinations until very recent kernels (RHEL 5 U5). If you're hiding the VLANs within the switches, this doesn't matter; this only matters if you're configuring the VLANs atop bonding.

    -J
    Hi Jay,

    Thanks for your reply. I'll try the option arp_validate=all in the bonding module and
    I'll let you know how it goes.

    I think I'll not have any problem with the kernel since I upgraded it to one of the
    latest version available in the Redhat Network for version 5 (2.6.18-164.9.1.el5),
    so I think that the arp_validate option is fully supported.

    Best regards,
    Gianfranco Samuele
  • Gian_Samuele
    Gian_Samuele
    3 Posts

    Re: Problems Bonding - ibmveth LPAR Linux P550

    ‏2010-04-07T09:28:11Z  
    Hi Jay,

    Thanks for your reply. I'll try the option arp_validate=all in the bonding module and
    I'll let you know how it goes.

    I think I'll not have any problem with the kernel since I upgraded it to one of the
    latest version available in the Redhat Network for version 5 (2.6.18-164.9.1.el5),
    so I think that the arp_validate option is fully supported.

    Best regards,
    Gianfranco Samuele
    Hi everybody,

    There's a new stable version of the kernel available (2.6.18-194.el5 - 03/17/2010) which resolves some
    issues with bonding:

    net niu: fix deadlock when using bonding (Andy Gospodarek) 547943
    net bonding: fix alb mode locking regression (Andy Gospodarek) 533496
    net fixup problems with vlans and bonding (Andy Gospodarek) 526976
    net bonding: allow arp_ip_targets on separate vlan from bond device (Andy Gospodarek) 526976

    I'll give it a try a let you know if it helps with my problem...

    Best regards,
    Gianfranco Samuele
  • Gian_Samuele
    Gian_Samuele
    3 Posts

    Re: Problems Bonding - ibmveth LPAR Linux P550

    ‏2010-04-26T17:12:53Z  
    Hi everybody,

    There's a new stable version of the kernel available (2.6.18-194.el5 - 03/17/2010) which resolves some
    issues with bonding:

    net niu: fix deadlock when using bonding (Andy Gospodarek) 547943
    net bonding: fix alb mode locking regression (Andy Gospodarek) 533496
    net fixup problems with vlans and bonding (Andy Gospodarek) 526976
    net bonding: allow arp_ip_targets on separate vlan from bond device (Andy Gospodarek) 526976

    I'll give it a try a let you know if it helps with my problem...

    Best regards,
    Gianfranco Samuele
    Hi everybody,

    It works!!!

    I upgraded the kernel to the latest stable version (2.6.18-194.el5 - 03/17/2010)
    and used the option arp_validate=all on the bonding configuration and the system
    switchs to the slave nic when the active one goes down with all the servers turned
    on belonging to the same vlan.

    Thanks Jay for your help and tips...
    Gianfranco Samuele