If you work in the technical support, generally spoken your job is to fix what's broken. But working in the SAN support most of the time is about solving complex problems. The SAN connects everything with everything in the storage world and often that's a lot. Oh yes, there are well-planned and "troubleshooting-friendly" environments out there, managed by top-skilled administrators using state-of-the-art tools, while having enough time between daily routine and important projects to spot problems before they even have an impact on the applications. At least I believe that these things exist, but most of the time I did not even see a part of it. There are excellent multi-tenancy capable products out there, maintained by a single part time admin or an operator some thousand miles away monitoring the environments of a dozen clients. And when there is a problem, this poor guy is called by all the angry people relying on a working IT up to the C-levels. Then he opens a case at his SAN vendor.
Let's switch to the support guy. He takes the new case and reads. "Massive problem, SCSI error!". Yes, most of the time there is just a statement like this. That's okay for the beginning, because the so called "Request Receipt Center" just creates cases administratively (OMG, is that even a word in the English language?). The first level of support, the so called Frontend will call you then and ask you about the problem. And they (hopefully) will bring the information in a pattern called "EDANT" to have it in a structured way and to be able to hand it over (horizontally for shift changes or vertically for escalation) to others. This first call (sometimes 2..n) is crucial because the most important thing is to actually understand the problem. That sounds trivial, but it's not. In fact the whole problem determination will fail or at least significantly lag if this set of information is not complete or contains false statements.
I know you will be under pressure. I know you have thousand other things to do. I know some sales guy probably promised you "Our excellent support will solve all problems - if there'd ever be one - just by hearing the tone of your voice for 1.4 seconds!". But again, to enable the support guy to actually understand your problem is the most important thing and you can hugely accelerate that process by preparing the information using the EDANT pattern.
So what's this EDANT pattern exactly? I have to admit, we stole it from the software guys. You will notice that by the wording. EDANT means:
E is for Environment. You (hopefully) know your environment and maybe you described it to IBMers several times before, maybe an IBM architect even designed it. But to be honest IBMers don't share a collective consciousness like the Borg :o), on the other hand things change. So what's needed is a good description of the environment related to the current problem. This includes among others:
- A layout with the related switches and devices and the ports used to connect them.
- The machine/model information of related switches, hosts, storages, etc
- The firmware/OS/driver levels of all components.
- Time gaps between the components. (Better use NTP!)
- If you use SAN extenders, describe them. Use CWDM/DWDM/TDM? How long? Type? Vendor? Cards? Versions? Transparency? Use FCIP? Bandwidth? Quality?
- Additional specialities: any interop stuff going on? This is a test SAN? This is pre-production? This is designed without redundancy? Stuff like this...
D is for Description. Please describe your problem as precise and as comprehensive as possible.
- When did it start?
- What did happen?
- Where can you notice it?
- What do the switches report?
- What do the other devices report?
- What was done when the problem happened?
- What is the impact?
A is for Actions Done. Opening a case is most probably not the first thing you do, when the phones begin to ring. When a case reachs me, "someone" did already "something". Maybe you have a plan for situations like this. Maybe someone requests "Do things!". Maybe you switched off "culprit candidates". All this should be documented as accurate as possible. With time stamps! And of course with results. Everything that changed in the environment since the problem occured is worth to mention, including counter resets. Do as much as possible from CLI (Command Line Interface) and use session logging. Precious!
N is for Next Actions. This section is for everything you already plan (maintenance windows, replacements, recovery actions, internal and external deadlines) and for everything you expect from the support. The second point is not trival, too. Of course you want the support to solve the problem. But what is most important? Do you need a workaround first, to get things working again? Do you need an RCA (Root Cause Analysis) the next day? Does the problem has to be solved over night and a contact person will be available to provide data and further info? Provide your expectation to get the right help.
T is for Test Case. Okay, this one is clearly from the software support. It's the data collections and any additional data and description of it, like the session logs mentioned above. Screenshots, performance data or scripts belong here too. Usually the support offers a way to upload all the stuff. Please be aware that for example IBM doesn't keep data collections from cases till the end of days. So if you uploaded something for another already closed case 6 months ago, it's most probably gone.
Using this pattern to structure the info should avoid any communication based delays. It may sound like much stuff in the beginning. But it's definitely worth it.