Details Matter, episode #349523, Part Deux
By David Ross
In the previous entry, I began detailing the odyssey of getting a wireless network up and running for our hands-on lab systems. We made it to Saturday, and our connections were still not stable.
When I arrived in the lab room Saturday morning, I found that a rogue SSID had introduced itself in some way. It very well could have been there on Friday, but maybe I had not refreshed my network view to see it. The networking team was fighting a fire of their own elsewhere, so I was left with one other proctor to figure out what was going on. We went one row at a time, turning off the wireless radios and rescanning the network. Of course, it was about 90% of the way through the process that we had shut off the offending radio on some laptop or laptops and knew which row had the bad SSID configuration.
So, we did what we could and waited. Around lunchtime, the network team finally was free and they began their investigations from their end while we tested in the room. They were looking at MAC addresses and behavior, making changes, and so on. Around 2:00, I told them that we simply had to have the network stable and working, or they could bring “15 miles of wire and 24 switches”. They made further changes, and about 2:30 they reset the radios and configurations on the individual access points (IAPs).
About 3:30 PM, we were still not stable, and we were running out of time. Some laptops were connected and staying connected and others were not. So, the network team (who were awesome, by the way), returned to the room. Joe managed to calm me down, and we discussed what was happening. He explained that we had laptops in the room that had connected and stayed actively connected for over 2 hours. Other laptops in the room they could had watched as they dropped and reconnected every five to seven minutes. Joe assured me we had a very clean, optimal network defined, and that we had no radio-to-radio interference and no discernible noise sources; in other words, the network configuration was NOT the issue.
Some of the things we investigated include:
Drivers? All laptops were built from the same original base OS master, so if it was a driver issue, it would affect ALL or NONE of the laptops, in theory.
Chipsets? We chose 15 laptops around the room, and all were using identical network chips, so the odds were low that we had many different radio chips.
Location? If the problem laptops were all in one area, it would be a radio or IAP issue, but instead, it was distributed across the room.
Testing before Las Vegas? I had used some of the laptops on my home 802.11N network for days, downloading code and such and never had a dropped connection. We had connected several masters to the IBM public network in an IBM facility with no problems. One proctor had taken one to a McDonald's to test VPN connectivity and had no issues.
How much bandwidth was being used? Not much, as we were only doing HTTP and remote desktop calls and not much else.
When did drops occur? Generally, it would happen after the machine was idle for a time, but I had also experienced dropped connections in the middle of executing scripts that involved a small download of about 5MB to update some files.
If dropped, did the laptop reconnect? Most laptops attempted to reconnect, but many would time out before doing so.
In other words, I had no single one thing I could point to as the issue, which was the cause of my frustration. We did find that there was a 'feature' of the IAP that would drop a mobile connection when throughput dropped to a threshold, and it was possible that it was dropping non-mobile connections as well. They disabled this feature, just to be safe, but we were still dropping.
Enter another smart person, Chris. Since we determined that the network was not the issue, and the hardware was identical, he began asking other questions. Did we have the latest Linux driver? Was it stable? We began investigating, and found that indeed, our particular driver had some possible issues with 802.11N networks. In fact, one characteristic was that in some cases, the radio began 'hopping' from IAP to IAP. This would not have shown at home, since I only had one IAP. In the office, there is typically only one IAP in range, so once a connection is established, it never sees another. Ditto McDonald's. At the venue, however, we were within range of anywhere from 6 to over a dozen IAPs, all broadcasting the same SSID, and it was possible that some of the laptops were hopping (why some did not remains a mystery to me). Furthermore, this behavior would only appear when you had a venue such as what we had – and indeed, were looking for it to happen.
So, we came up with a plan. I told Joe that I would prefer 2MB stable versus 300MB bandwidth that was flaky. We switched the lab network to 802.11A since we did not have high bandwidth requirements, disabled 802.11N in the kernel module, and waited. This took some time to switch and test, and about 7:30PM we had confirmed that every laptop was connecting, and most had stayed connected for at least 30 minutes. Those that did drop were able to reconnect almost immediately. While not completely satisfied, I believed it was usable for our purposes, and indeed it was. We survived the week with the occasional dropped connection, but overall, it was satisfactory.
One last note: we had some older Windows XP virtual images that connected to remote systems either with a VPN or a telnet session (for mainframe-based labs). These would drop and reconnect multiple times, while the host laptop would stay connected. Replacing the images with a newer version of Windows resolved the problem – it was probably an issue related to being back-level on service packs or drivers that was the root of that issue.
The point? Test all you can, but understand that some environments are impossible to predict or test until you have the actual hardware in place. Try to plan for all contingencies – the 802.11N issue, for example, is something I probably should have looked into before we finalized our master image. In the meantime, I have added to my understanding with lessons learned for next year.
David Ross is a senior Technical Enablement Specialist with IBM Corporation in the Tivoli Cloud Enablement group. David specializes in Tivoli Service Automation Manager and IBM SmartCloud Provisioning. He joined IBM in 2000.