Is under-utilization of a SAN bad? And what have slow drain devices to do with that?
seb_ 060000QVK2 Visits (7177)
There is an interesting discussion ongoing in the Linkedin group The Storage Group. The question is "What is the REAL cost of Fibre Channel?". To my surprise the participants in this discussion relatively quickly came to the conclusion that the problem is over-provisioning resp. under-utitization. My personal opinion was:
"I would like to come back to the over-provision / under-utilization part. Being a tech support guy, I think a bit different about that. State of the art is 16G FC now but of course I see the majority of customers being on 8G or even 4G. Eventually they will move to higher speeds. Not because all of them really need the higher speed, but it's just the switches and HBAs in sales and marketing at the moment. The "speed race" is driven mostly by the vendors and the customers who really need that line rate. But is it bad for the others? I don't think so. A 16G switch is not really 2x the price of a 8G switch or 4x the price of a 4G. In fact I see the prices sinking on a per port base with increasing functionality on the other hand. And then you stand there with your host X. It has a demand for let's say 200MB/s in total and you connected it to 2 redundant fabrics running with 8G, 1 port per fabric.
That makes: 200MB demand versus 1600MB available. WOW! YOU ARE TOTALLY UNDER-UTILIZED! Shame on you!
Well not really. Actually it's good to have redundancy. You know that. First of all "real" redundancy means you are at least 50% under-utilized per se. Plus the higher line rate that made no difference in the price compared to the lower line rate. That means it is normal that you end up over-provisioned and under-utilized.
In fact things start to get ugly if you really use all your links near 100%. I start to see that scenario more often recently when customers put VMs on ESX hosts without really knowing their I/O demand. Many of them work till the next outage (SFPs _WILL_ break some day, a software bug could crash a switch, etc) and then you see that you have no real redundancy, because you utilize your links too high. On the other hand many of these ESX hosts with many VMs doing different unknown workload tend to turn to slow drain devices as soon as I/O peaks of certain VMs come together at the same time. Then at the latest you notice that under-utilization of a network is not really a bad thing :o)"
Especially the ESX hosts turning to slow drain devices bug me most these days. Nobody really seems to know the demand of their VMs and the internal statistics of the ESX seem to be very limited for that matter. If you look on a port of a slow drain device, it will most probably still look under-utilized from a bandwidth perspective, because the missing buffers plus the error recovery will keep the plain MB/s numbers down. But in fact the port is exhaustively saturated then. And in addition the the eventually dropped frames in the SAN lead to timeouts also within the slow draining host. At the end it looks like: "My ESX is far away from utilizing its link completely but the SAN is bad! We have timouts!".
So what's the demand?
Some customers have the luxury (Should this really considered to be luxury?) of having a VirtualWisdom probe installed to monitor the exact performance values in real-time constantly. Archie Hendryx shows some of the things you could see there in practice in his whitepaper "Destroying the Myths surrounding Fibre Channel SAN". But if you don't have such gear and you don't know the demand it might be worth to have an additional ESX host for testing. It must not be the biggest machine, don't worry. Every day you would take another candidate out of your bulk of VMs with unknown I/O bandwidth (or CPU / memory / etc) demand and put it on that test server with vMotion. Being relatively unimpaired by the other VMs (at least within the ESX), you can measure all the performance values then for 24 hours and - provided no error recovery or external congestion - takes place, these are the real demands of that VM. And only based on these demands you really know which VMs are allowed to come together on the same bare metal. Only so you will have a chance to actually improve the under-utilization in a controlled manner without slamming your SAN into the realms of chaos. The approach seems very simple and straight forward for me, but I see nobody doing this. So what's my error in reasoning, dear reader?
(Thanks to Harout S Hedeshian for the picture.)