Is your storage environment robust enough?
seb_ 060000QVK2 Visits (10861)
There it is again: one of those supp
Working in the storage support I often come across little misconfigurations or other for itself minor conditions (like stuck VCs) causing problems. Keeping in mind that - in most of the cases - a storage environment is designed with dual fabric redundancy it should not really be a problem to do maintenance actions if every step is done in a controlled manner and not particularly in peak workload times. So resetting just one of let's say 8 ISLs or doing a re-login with 1 of 4 HBA ports of a host or another action that will not take anything offline but will drop a few frames in one of the fabrics should not really be an issue. Yet still I often hear the following statement:
What a killer phrase... Not a single I/O error! How to answer that? Not a single I/O error? Every hiccup in the storage environment would cause an application to crash or at least to harm the user experience in an unacceptable way? Well, this is actually not a showstopper - quite the contrary. This is where work really starts, because a statement like that is a cry for help. So if you say something like that about your applications, you feel most probably like a tightrope walker over a pit of spears. Be assured: that's not normal. It's not the way it should be and it's certainly not in a condition that should be used further for a longer time.
Never touch a running system?
This can be so false. Maybe your environment is running at the moment, but just not looking at it will not make it any better. This is not Schrödinger's cat that could still be alive if you just don't open the box. Hope should never be your operational mode. Because errors will happen. SFPs will wear out. Workloads will change. Migrations will happen. Maintenance activities will take place. Firmware updates have to be done. Hardware parts will fail and software will never be bug-free.
I don't want to scare you, I just tell you what will happen - even if you had no bigger problems so far. There will be I/O errors. They are inevitable. And you won't be prepared then. Most probably they will happen when you don't expect and/or don't want them to happen: 3am on a Sunday or amidst the quarter-end accounts workload.
A storage environment can be a complex system with many different hardware platforms, operating systems and workloads involved. It's full of dependencies, requirements, compatibility issues and often different vendors and support providers. And it's little short of an interdisciplinary challenge to find the perfect setup from the application over the middleware and the operating systems including clustering and virtualization, down through the I/O drivers and the fibrechannel stack, the HBAs, the SAN, storage virtualization, through HA, DR, encryption, data reduction and RAID or RAIDless data protection methods until the information finally ends up on the HDD or flash chip. If all these components are set up properly to mesh like cog wheels a small physical error or a few frame drops should not be noticeable at all. For exactly that reason you have all the redundancies, error recovery methods, and the time-out values.
For Brocade fabrics there is a nice comprehensive document to assist you in hardening your SAN:
(For Cisco I have no current similar document in mind. If you know one, I would be happy to add it here.)
If you feel you could need a helping hand with that, the Storageneers are prepared to support you. Not just the ones contributing to this blog, but all the IBM storage experts all over the world. If even thoughts about single I/O errors send shivers down your spine, engage with your IBM sales rep as soon as possible and allow us to get you out of that misery.
Turn your storage environment from a sword of Damocles back into what it should be: the solid foundation of your business operations!
(You are not sure how to approach IBM properly to get the support you need? Send me a mail (seb