This week I started working on a new case for a customer. I'm trying to diagnose repeated error messages being logged by an IBM SVC Cluster that indicate problems communicating with the back-end storage that is being virtualized by the SVC. These messages generally indicate SAN congestion problems. The customer has Cisco MDS 9513 switches installed. They're older switches but not all that uncommon. What is uncommon is finding the switches at NX-OS version 5.X.X. I see downlevel firmware but this one is particularly egregious. This revision is several years out of date. Later versions of code contain numerous bug fixes both from Cisco and for the associated upstream Linux security updates that get incorporated into NX-OS. Also, while NX-OS versions don't officially go out of support, any new bugs identified won't be fixed as this version is no longer being actively developed
This level of firmware merits further investigation. Looking deeper on the switches I find this partial switch module list:
--- ----- ----------------------------------- ------------------ ----------
6 48 1/2/4 Gbps FC Module DS-X9148 ok
7 48 1/2/4 Gbps FC Module DS-X9148 ok
8 48 1/2/4 Gbps FC Module DS-X9148 ok
9 48 1/2/4 Gbps FC Module DS-X9148 ok
These modules are older than the firmware on the switches, and support ended 3 years ago. If this customer has problems with them (or the switches they are installed in) and the problem is traced back to the modules, there is not much that IBM Support can do. If a problem is traced to a bug in the firmware, the customer can't upgrade the firmware to something more current because of these old, unsupported modules still in the switches. This limits IBM's ability to provide support. The hardware is no longer supported and much of the data we can look at in the firmware was not introduced until the next major revision level of NX-OS - v6.2(13). There were also some options and improvements added to lower thresholds and timeout values to increase the frequency of some logging for performance issues.
I could see several 2Gb devices attached to these modules, which is probably why they are still installed. I could also see some of these slow devices zoned to the SVC which is connected to the SAN at 8Gbps. This violates a best practice of not zoning devices together where the port speeds are greater than 2x difference. So, a 2 Gb device should not be zoned to 8Gb. A 4Gb device should not be zoned to 16Gb, etc. The slow device will turn into a slow-drain device sooner rather than later. I suspect this is the customer's problem but can't confirm it because of lack of data due to the age of the hardware and firmware.
The recommendations I gave this customer:
1. move the applications on those slow servers to servers with a 4 or (ideally) 8Gb connection to the SAN on the other newer modules in the switches. This will allow for decommissioning of those modules and move to a best-practice solution.
2. decommission those old modules and upgrade them if the port density is needed. this will allow for firmware upgrades which are beneficial for all the reasons noted above
3. start planning for a refresh on the switches themselves. While the switch chassis will be supported for some time yet, they have already been end of life for a few years.