How to fight back pressure against an appliance in a core edge fabric
seb_ 060000QVK2 Comments (4) Visits (10675)
A slow drain device often has a huge impact on the performance of many other devices in a SAN environment. That happens, because they block resources in a fabric other devices use as well. The main example for such a resource are ISLs, particularly the Virtual Channel(s) within those ISLs that are used to reach the slow drain device. But as soon as you have an appliance in the SAN, this could turn into such a blocked resource as well.
Disclaimer: There are several definitions and types of appliances. Within this article an appliance is a device "in the middle" between the hosts and the storages with a specific task such as a compression, encryption, virtualization or deduplication appliance. While I had the SAN Volume Controller (SVC) in mind while I wrote this, it applies to many other products matching this definition. The common thing is that the performance they can provide is to some degree dependent on their destination device's performance.
Fortunately many of the fabrics I saw over the recent years were designed using a core-edge approach. If the device is in the communication path of many of the devices in a SAN it's best practice to attach it directly to the core. But a slow drain device can still block it. This is how it happens:
In this sketch the appliance sends data towards a slow drain device. It will not be able to process the incoming frames quickly enough - they pile up in its HBA's ingress buffers (1). As the appliance is still sending frames but the edge switch cannot send them further to the slow drain device, they also pile up in the ingress buffer of the ISL port of the edge switch (2). Now this could already impair the performance of the other host connected to the same edge switch like the slow drain device - if the frames towards it use the same VC. Some microseconds later the same might happen to the frames from the appliance entering the core (3). They pile up there as well and as soon as that happens, this so called back pressure reaches the appliance itself then. As there are no VCs on the F-to-N-port connection used to attach the appliance to the core, the chance is high that the appliance cannot send any frames out to the SAN anymore - no matter to which destination (4).
Well, that means you just turned your appliance into a slow drain device itself! The performance of the whole environment is heavily impaired now:
In step (5) the frames from the other hosts towards the appliance pile up in the core as well and then the back pressure spreads further to the hosts connected to the edge switches as well (6).
Worst case, hmm?
After the ASIC hold time is reached (usually 500ms) the switches will begin to drop frames to free up buffers again. But as all switches have the same ASIC hold time, you'll end up in the situation that while the edge switch reach these 500ms first, the core switch will start to drop the frames likewise before the buffer credit replenishment information (VC_RDY) from the edge switches arrive. So not only the frames from the communication with the initial slow drain device will be dropped, but most of the others down the path as well. As as the appliance itself turned into a slow drain device, the same might happen to the frames piled up because of that, too.
So what to do against it?
The first thing is: give the F-ports of the appliance as much buffers as possible. Prio 1 should be that it should be able to send its frames out into the fabric, so the chances are higher that when the frames of the open I/O against the slow drain device are out there, there could be still some buffer credits left to send stuff to other devices. For clustered appliances like the SVC it's even more important, because they use these ports for their cluster-internal communication as well. Blocked ports could result in cluster segmentation then (SVC: single nodes rebooting due to "Lease expiry"). To assign more buffers to the switch port (= more buffer credits for the port of the appliance), use
portcfgfportbuffers --enable [slot/]port buffers
Update: Please keep in mind, that adding more buffers to an F-Port is of course disruptive for the link!
To check how much buffers are available, you can check
But in many cases this is not enough. Some time ago, Brocade released Fabric Resiliency Best Practices with some good advises. In my opinion every SAN admin with Brocade gear should have read it. It recommends:
While Fabric Watch is used more and more and especially in the FICON world - but also for open systems - I see some of our customers using port fencing, I hardly see anyone utilizing the Edge Hold Time feature. For a situation as described above it could really improve the situation for the appliance and the other hosts dramatically. It can be set to any value between 100ms and 500ms. It was introduced in FOS v6.3.1b. So if you expect hosts connected to an edge switch to behave slow draining in certain situations, in my opinion the Edge Hold Time of that switch should be set as low as possible. Of course it's always depending on your environment and how likely it is to be impaired by a slow drain device, but 100ms is a long time in a SAN. If you also have some legacy devices connected to these edge switches, check if a decreased hold time could be a problem for them.
It can be enabled and configured using the "configure" command, where it can be found in:
configure Not all options will be available on an enabled switch. To disable the switch, use the "switchDisable" command. Configure... Fabric parameters (yes, y, no, n): [no] yes Configure edge hold time (yes, y, no, n): [yes] Edge hold time: (100..500) 
You don't need to disable the switch to change the Edge Hold Time and as one of the fabric parameters it will be included in a configupload.
As it seems to be used very seldom in the field I would like to get some feedback if you actually used it. Please give me a hint if and in which situation it helped you. Thanks!
But don't forget: The most important thing is to get rid of the slow draining behavior!