One of the ugliest things that can happen in a SAN is a big performance problem introduced by a slow drain device (or slow draining device). Why is it so ugly? Well, if a full fabric or a full data center drops down - due to a fire for example - it's definitely ugly, too. But such situations can be covered by redundancy (failover to another fabric, to another data center, etc), because the trigger is very clear. Whereas a performance degredation due to a slow drain device is not so obvious - at least not for the most hosts, operators or automatic failover mechanisms. Frames will be dropped randomly, paths fail but with the next TUR (Test Unit Ready) they seem to work again, just to fail again minutes later. Error recovery will hit the performance and the worst thing: If commonly used resources are affected - like ISLs - the performance of totally unrelated applications (running on different hosts, using different storage) is impaired.
So you have a slow drain device. If you have a Brocade SAN you might have found it by using the bottleneckmon or you noticed frame discards due to timeout on the TX side of a device port. If you have a Cisco SAN you probably used the creditmon or found dropped packets in the appropriate ASICs. Or maybe your SAN support told you where it is. Nevertheless, let's imagine the culprit of a fabric-wide congestion is already identified. But what now?
The following checklist should help you to think about why a certain device behaves like a slow drain device and what you can do about it. I don't claim this list to be exhaustive and some of the checks may sound obvious, but that's the fate of all checklists :o)
- Check the firmware of the device:
Check the configuration:
- Is this the latest supported HBA firmware?
- Are the drivers / filesets up-to-date and matching?
- Any newer multipath driver out there?
- Check the release notes of all available firmware / driver version for keywords like "performance", "buffer credits", "credit management" and of course "slow drain" and "slow draining".
- If you found a bugfix in a newer and supported version, testing it is worth a try.
- If you found a bugfix in a newer but unsupported version, get in contact with the support of the connected devices to get it supported or info about when it will be supported.
Check the workload:
- Is it configured according to available best practices? (For IBM products, often a Redbook is available.)
- Is the speedsetting of the host port lower than the storage and switches? Better have them at the same line rate.
- Queue depth - better decrease it to have fewer concurrent I/O?
- Load balanced over the available paths? Check you multipath policies!
- Check the amount of buffers. Can this be modified? (direction depends on the type of the problem).
Check the concept:
- Do you have a device with just too much workload? Virtualized host with too much VMs sharing the same resources? Better separate them.
- Too much workload at the same time? Jobs starting concurrently? Better distribute them over time.
- Multi-type virtualized traffic over the same HBA? One VM with tape access share a port with another one doing disk access? Sequential I/O and very small frame sizes on the same HBA? Maybe not the best choice.
Check the logs for this device for any incoming physical errors. Of course, error recovery slows down frame processing.
Check the switch port for any physical error. If you have bit errors on the link, the switch may miss the R_RDY primitives (responsible for increasing the sender's buffer credit counter again after the recipent processed a frame and freed up a buffer).
Use granular zoning (Initiator-based zoning, better 1:1 zones) to have the least impact of RSCNs. (A device that has to check the nameserver again and again has less time to process frames.)
If all other fails: Look for "external" tools and workarounds:
- If the slow drain device is an initiator, does it communicate with too many targets? (Fan-out problem)
- If the slow drain device is a target, is it queried by too many initiators? (Fan-in problem)
- Is it possible to have more HBAs / FC adapters? On other busses maybe?
- Is the device connected as an L-Port but capable to be an F-Port? Configure it as an F-Port, because the credit management of L-Ports tends to be more vulnerable for slow drain device behavior.
- Does the slow drain host get its storage from an SVC or Storwize V7000? Use throttling for this host. Other storages may have similar features.
- Brocade features like Traffic Isolation Zones, QOS and Trunking can help to cushion the impact of slow drain devices.
- Have a Brocade fabric with an Adaptive Networking license? Give Ingress Rate Limiting a try.
- Last resort: Use port fencing or an automated script to kick marauding ports out of the SAN.
The list above is just a collection of things I already saw in problem cases. Having said this, it might be updated in the future if I encounter more reasons for slow drain device behavior. Of course I'm very interested in your opinion and more reasons or ways to deal with them!