Brocade Fabric OS v7.1 in the eyes of a troubleshooter
seb_ 060000QVK2 Visits (10438)
To like working in tech support, you have to be the most optimistic guy around. You have to be even more optimistic about the product you support than the sales guy trying to sell it. Why? Because the product can be as fantastic as possible - jam-packed with jaw-dropping features - as a tech support guy you will only witness the bugs. However, the bugs are not what's annoying me. Well, at least most of them. :o) Every software necessarily has bugs. They are my job, the reason of its mere existence. What's really annoying me is, when I know that there is a problem, but the RAS package is just not good enough to enable me troubleshooting it.
Therefore, I was pleasantly surprised when I read the release notes of the Fabric OS v7.1 codestream. There are a lot of tweaks and features that make the life of a troubleshooter easier. And it's not only about finding problems, it's about preventing them, too. So here is just a first selection of what I like:
Can I trust the counters?
"FOS v7.1 has been enhanced to display the time when port statistics were last cleared." says the release note. This sounds trivial, but it's essential for the troubleshooting of many problem types like performance problems, physical problems and so on. Times when we had to go through the CLI history - in the hope that the counters were cleared via CLI after a proper login - seem to be over now.
Link Reset Type in the fabriclog
A small enhancement, but a time-saving one. To get a time-based overview about the state-changes of the ports, you usually have a look into the fabriclog. But there you often only see that there were link resets. The interesting thing would be to find out who initiated them - the local port or the remote one. The LR_IN and LR_OUT counters in portshow were an insufficient source of information here as they show only absolute numbers. In Fabric OS v7.1 they type is simply part of the message and you see it at a glance.
For many admins the best practice to replace an SFP is to disable the port, then replace the SFP and afterwards re-enable the port again. I know many people who did this and I felt always uncomfortable to tell them, "Rip it out while it runs, otherwise the switch won't recognize it correctly." But that's the way it is before v7.1: If the port is not running while you replace an SFP, it might not notice that for example the 4G LW SFP that was in there before is now an 8G SW SFP. Beside of any ugly additional bug that was possible based on that later on, the behavior itself was a pain. In v7.1 you don't have to care for that. Sfpshow will show you the correct information. Additionally sfpshow will also tell you when the last automatic polling of the SFP's serial data took place.
Honest long distance
If you read SAN Myths Uncovered 2: The LD mode (Brocade) on my blog before, you know that the whole long distance stuff in Brocade switches is a little bit... let's say "optimistic". For long distance ISLs (other than long distance end-device connections) you only configure the length of the connection and the switch calculates the necessary amount of buffers. But as it does that by using the maximum frame size, you'll end up with a buffer shortage for basically all real-world use cases. In Fabric OS v7.1 new functions take account of this fact. The command portbuffershow (by the way a mandatory candidate for every data collection) will show you the average frame size now. So sooner or later I can mothball my article about How to determine the average frame size. And this value then can be used to optimize the buffer settings in the completely overhauled portcfglongdistance command. Now it will calculate the buffers based on your average frame size. Furthermore, it allows you to configure the absolute number of buffers yourself if you want. You don't need to tell your switch anymore that a distance is 200km just to assign enough buffers to span 60km with your real-world average frame size being far less than the maximum one. It's that kind of clarity that prevents misconceptions and evitable performance problems.
This is not an exhaustive list of all the good new things. There are definitely more good features in direction of RAS like enhancements for credit recovery, Diagnostic Ports, FDMI, Edge Hold Time, FCIP and many others. In my eyes they'll make the platform even more robust and after all, it will hopefully give me a little more time to write more blog articles in the future. :o)
Oh wait... is this the call to update to v7.1 immediately?
Well, no, it's not. It's just an outlook for the things to come. Better plan your updates carefully. You know, it's just a blog article by the most optimistic guy around... ;o)