I claim that in 2012 performance problems will keep their place amongst the most frequent and most impacting problems in the SAN. In many of the cases the client's users really notice a performance impact and so the admin calls for support. Other support cases are opened because of performance related messages like the ones from Brocade's bottleneckmon or Cisco's slowdrain policy for the Port Monitor. Beside of that there are also cases that look not really like performance problems from the start but turn out to occur because of the same reasons like them. "I/O abort" messages in the device log, link resets, messages about frame drops, failing remote copy links, failing backup jobs or even worse failing recoveries - these could all be "performance problems in disguise".
When I analyze the data then and find out that a slow drain device or congestion is the real reason for the problem I write my findings down and try to give the client some hints about possible next steps. For example by mentioning my earlier blog article about How to deal with slow drain devices.
Do you know what's mean about it?
Often clients never heard of slow drain devices before. Longtime storage administrators are confronted with a term that sounds like a support guy made it up to fingerpoint to another vendor's product. Of course I usually explain what it is, what it means for the fabric and for the connected devices. But to be honest, I would be sceptical, too. I would go to the next search engine and query "slow drain device". The first finds are from this blog and from the Brocade community pages and there are some questions about that topic. Considering the substance of posts in public forums, I would check Brocade's own SAN glossary. Guess what? Not a word about slow drain devices - Which is no surprise as it's from 2008. I would check wikipedia. Nothing. My fellow blogger Archie Hendryx mentioned that it's missing in the SNIA dictionary, too. And he's right: Nothing!
So why is that so?
Why are the terms "HTML" and "export" explained in the dictionary of the Storage Networking Industry Association but there is not a single appearance of the term "slow drain device" on the complete SNIA website (according to their in-built search function)? Well I don't know but of course we can change that. The SNIA dictionary makers are asking for contribution, so if you have a term that has a meaning in the storage industry, feel free to send them a definition for the next release. I thought about doing that as well for some of the SAN performance-related terms I didn't find in the dictionary. Below you'll find some definitions that I wrote. But I'm not inerrable and therefore I would like to have an open discussion about them. Let me know what you think about them. Let me know if your understanding of a term (used in the area of SAN performance of course) differs from mine. Let me know if my wording hurts the ears of native English speakers. Let me know if you have a better definition. Let me know if there are important terms missing. And let me know if you think that a term is not really so generally used or important that it should appear in the SNIA dictionary - side by side to sophisticated terms like Tebibyte :o).
slow drain device - a device that cannot cope with the incoming traffic in a timely manner.
Slow drain devices can't free up their internal frame buffers and therefore don't allow the connected port to regain their buffer credits quickly enough.
congestion - a situation where the workload for a link exceeds its actual usable bandwidth.
Congestion happens due to overutilization or oversubscription.
buffer credit starvation - a situation where a transmitting port runs out of buffer credits and therefore isn't allowed to send frames.
The frames will be stored within the sending device, blocking buffers and eventually have to be dropped if they can't be sent for a certain time (usually 500ms).
back pressure - a knock-on effect that spreads buffer credit starvation into a switched fabric starting from a slow drain device.
Because of this effect a slow drain device can affect apparently unrelated devices.
bottleneck - a link or component that is not able to transport all frames directed to or through it in a timely manner. (e.g. because of buffer credit starvation or congestion)
Bottlenecks increase the latency or even cause frame drops and upper-level error recovery.