How to handle a congestion bottleneck
seb_ 060000QVK2 Visits (8538)
I check the referrers of this blog from time to time to get to know where my readers are coming from. For many of them I cannot actually see it, because often "bridge"-pages are used - for example by the social networking sites. But a fair amount come from searches in google and other search engines. Some search queries there seem to repeat very often. Maybe I will write more articles about the others - because hey, this seems to be the stuff you're coming for :o) - but this time it's about congestion bottlenecks.
Congestion bottlenecks - beside of latency bottlenecks - are one of the two things the Brocade bottleneckmon can detect. The default setup will alert you - if you enabled bottleneckmon with alerting - for all situations where 80% of all seconds within a 5 minutes interval had 95% link utilization. That is a big number! Of course you can also modify the setup to be more aggressive or to spare you some messages in an environment that is usually "under fire"...
But I encourage you to take it seriously!
In my opinion, a healthy SAN should NEVER have congestion bottlenecks. With "healthy" I mean of course the time of normal operation. Not when you have an incident at the moment and there is no redundancy in some parts because for example the second fabric has a problem or one out of two ISLs between two switches had to be disabled... I wrote an article last year about that and I think it fits well within the topic.
Rule of thumb: Link utilization should be up to 50% only.
And of course it should not be only 50% because you configured too few buffers! The setup of the link should always allow it to transport up to 100% of the workload that's physically possible. Otherwise you will have no real redundancy again!
But how to handle them now that I have them?
So you see these [AN-1004] messages in the error log and you know the port. What now? This is more about your SAN concept than defects or software features. The congestion bottleneck happens because the utilization of a link approaches its physical capabilities. Here are some ideas:
And often forgotten:
In many cases the congestion bottlenecks will be observed only at specific times. Usually the devices in your SAN don't have the same workload all the times. There is time when people sleep, there is time when people come to work and switch on their VDI'ed PCs, there is time when the backups run and there is time when big batch jobs run. A proper planning and scheduling is mandatory in today's data centers! Don't let the big workloads run at the same time. Spread them accross the course of the 24 hours you have. The same is true for the course of the week, the months, the quarter, the year.
The fewest environments are totally under-sized for the average mix of workload - but the demand of the components of this mix over time is the heart of your storage environment's performance!
If you need help to better manage your workloads, I'm sure your local IBM Sales rep or IBM business partner can bring you in contact with the right performance expert to work these things out for your special situation.