Performance problems are still the most malicious issues on my list. They come in many flavors and most of them have two things in common: 1) They are hardly SAN defects and 2) They need to be solved as quickly as possible, because they really have an impact. If just a switch crashed or an ISL dropped dead or even an ugly firmware bug blocks the communication of an entire fabric, it might ring all alarm bells. But that's something you (hopefully) have your redundancy for. Performance problems on the other hand can have a high impact on your applications across the whole data center without a concerning message in the logs, if your systems are not well prepared for it. Beside of the preparation steps I pointed out here there is a tool in Brocade's FabricOS especially for performance problems: The bottleneck monitor or short:
If a performance problem is escalated to the technical support the next thing most probably happening is that the support guy asks you to clear the counters, wait up to three hours while the problem is noticeable, and then gather a supportsave of each switch in both fabrics.
Why 3 hours?
A manual performance analysis is based on certain 32 bit counters in a supportsave. In a device that's able to route I/O of several gigabits per second, 32 bits aren't a huge range for counters and they will eventually wrap if you wait too long. But a wrapped counter is worthless, because you can't tell if and how often it wrapped. So all comparisons would be meaningless.
Beside the wait time the whole handling of the data collections including gathering and uploading them to the support takes precious time. And then the support has to process and analyze them. After all these hours of continously repeating telephone calls you get from management and internal and/or external customers, the support guy hopefully found the cause of your performance problem. And keeping point 1) from my first paragraph in mind, it's most probably not even the fault of a switch*). If he makes you aware to a slow drain device, you would now start to involve the admins and/or support for the particular device.
You definitely need a shortcut!
And this shortcut is the bottleneckmon. It's made to permanently check your SAN for performance problems. Configured correctly it will pinpoint the cause of performance problems - at least the bigger ones. The bottleneckmon was introduced with FabricOS v6.3x and some major limitations. But from v6.4x it eventually became a must-have by offering two useful features:
Congestion bottleneck detection
This just measures the link utilization. With the fabric watch license (pre-loaded on many of the IBM-branded switches and directors) you can do that already for a long time. But the bottleneckmon offers a bit more convenience and brings it in the proper context. The more important thing is:
Latency bottleneck detection
This feature shows you most of the medium to major situations of buffer credit starvation. If a port runs out of buffer credits, it's not allowed to send frames over the fibre. To make a long story short if you see a latency bottleneck reported against an F-Port you most probably found a slow drain device in your SAN. If it's reported against an ISL, there are two possible reasons:
- There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the fabric.
- The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller than expected. Or QoS is configured on the link but you don't have QoS-Zones prioritizing your I/O. This could have a huge negative impact! Another reason could be a mis-configured longdistance ISL.
Whatever it is, it is either the reason for your performance problem or at least contributing to it and should definitely be solved. Maybe this article can help you with that then.
With FabricOS v7.0 the bottleneckmon was improved again. While the core-policy which detects credit starvation situations was pretty much pre-defined before v7.0 you're now able to configure it in the minutest details. We are still testing that out more in detail - for the moment I recommend to use the defaults.
So how to use it?
At first: I highly recommend to update your switches to the latest supported v6.4x code if possible. It's much better there than in v6.3! If you look up bottleneckmon in the command reference, it offers plenty of parameters and sub-commands. But in fact for most environments and performance problems it's enough to just enable it and activate the alerting:
myswitch:admin> bottleneckmon --enable -alert
That's it. It will generate messages in your switch's error log if a congestion or a latency bottleneck was found. Pretty straightforward. If you are not sure you can check the status with:
myswitch:admin> bottleneckmon --status
And of course there is a show command which can be used with various filter options, but the easiest way is to just wait for the messages in the error log. They will tell you the type of bottleneck and of course the affected port.
And if there are messages now?
Well, there is still the chance, that there are actually situations of buffer credit starvation the default-configured bottleneckmon can't see. However as you read an introduction here, I assume you just open a case at the IBM support.
You'll Never Walk Alone! :o)
*)Depending on country-specific policies and maintenance contracts a performance analysis as described above could be a charged service in your region.