Why inter-node traffic across ISLs should be avoided
seb_ 060000QVK2 Comments (9) Visits (8802)
If you use a SAN Volume Controller it usually is the linchpin of your SAN. Except for the FICON and tape related stuff everything is connected to it. It is the single host for all your storage arrays and the single storage for all your host systems. Because of this crucial role the SVC has some special requirements regarding your SAN design. The rules can be seen in the manuals or in the SVC infocenter (just search for "SAN fabric"). One of these rules is "In dual-core designs, zoning must be used to prevent the SAN Volume Controller from using paths that cross between the two core switches.".
I made this sketch to illustrate that. As you see it's not a complete fabric, but just the devices I want to write about. Sorry for the poor quality, my sketching-kungfu is a bit outdated :o)
This is just one of two fabrics. The both SVC nodes are connected to the both core switches. The edge switch is connected to both core switches and beside of the SVC business let's assume there is a host connected to the edge switch using a tape library connected to the cores. There would be other edge switches, more hosts and of course storage arrays as well. Now the rule says that the SVC node ports are only allowed to see each other locally - therefore on the same switch.
So why is that so?
Of course you could say that this is the support statement and if you want to use a SAN Volume Controller you just have to stick to that. But from time to time I see customers with dual-core fabrics who don't follow that rule. Of course initially when the SVC was integrated into the fabric, the rule was followed because it was most probably done by a business partner or an IBM architect according to the rules and best practice. But later then after months or years - maybe even the SAN admin changed - new hosts were put into the fabric and in an initiator-based zoning approach, each adapter was zoned to all its SVC ports in the fabric. Et voilà! The rule is infringed. The SVC node ports see each other over the edge switch again and the inter-node traffic passes 2 ISLs instead of none.
What is inter-node communication?
Beside of the mirroring of the write cache within an I/O group there is a system to keep the cluster state alive. It includes a so called lease which passes all nodes of a cluster (up to 8 nodes in 4 I/O groups) in a certain time to ensure that communication is possible. These so called lease cycles start again and again and they do even overlap so if one lease is dropped somehow and the next cycle finishes in time, everything is still fine. The lease frames will be passed from node to node within the cluster several times. But if there are severe problems in the SAN the cluster has to trigger the necessary actions to keep the majority of the nodes alive. Such an action would be to warm-start the least responsive node or subset of nodes. You will read "Lease Expiry" in your error log. In a worst case scenario where the traffic is heavily impacted to a degree that the inter-node communication is not possible at all, it might happen that all nodes do a reboot and if the impact stays in the SAN they might do that again and won't be able to serve the hosts.
The result - BIG TROUBLE!
Just as a small disclaimer to prevent FUD (Fear, Uncertainty and Doubt): This is not a design weakness of the SVC or something like that. All devices in a SAN are vulnerable to the risk I want to describe. In addition from all the error handling behavior of the SVC as I know it the SVC seems to be designed to rather allow an access loss than to allow data corruption. It is still the last resort but it's better than actually loose data.
Back to the dual-core design. The following sketch just shows that with the wrong zoning, the lease could take the detour over the edge switch instead of going directly from node 1 to node 2 via core 1 or core 2. It would pass 2 ISLs.
Why should I care?
There are several technical reasons why ISLs should be avoided for that kind of traffic but from SAN support point of view I consider this one as the mose important: slow drain devices! Imagine one day the host would act as a slow drain device for any reason. The tape would send its frames to the host passing the cores and the edge switch. As the host is not able to cope with the incoming frames now, it would not free up its internal buffers in a timely manner and would not send permission to send more frames (R_RDYs) to the switch quickly enough. The frames pile up in the edge switch and congest its buffers. The congestion back-pressures to the cores and finally to the tape drive. As the frames wait within the ASICs some of them will eventually hit the ASIC hold-time of 500ms and get dropped. This causes error recovery and based on the intensity of the slow drain device behavior it would kill the tape job. Bad enough?
But hey! The SVC needs these ISLs!
And that's were it gets ugly. In the sketch above the ISL between the core 1 and the edge switch will become a bottleneck not only for that tape related traffic but for the SVC inter-node communication as well. It will not only cause performance problems (due to the disturbed write cache mirroring) but also could lead to the situation that the frames from several SVC lease cycles in a row would be delayed massively or even dropped causing lease expiries resulting in node reboots.
That's why keeping an eye on the proper zoning for the SVC is so important and that's the reason for that rule.
Just as a short anecdote related to that: Some years ago I had a customer with a large cluster where not the drop of leases but the massive delay of them caused the problem. As every single pass of the lease from one node to the next was only just within the time-out values the subset of nodes that was really impaired by the congestion saw no reason to back out and reboot but as the overall time-out for the lease cycles was reached at a certain point in time, the wrong (because healthy) nodes rebooted then and the impaired ones were kept alive. Not so good... As far as I know some changes were done in the SVC code later to improve its error handling in such situations but the rule is as valid as ever:
Avoid inter-node traffic across ISLs!