How to avoid support data amnesia
seb_ 060000QVK2 Visits (10317)
There are two major categories of problems usually reported to the technical support:
So what to do?
It's not that difficult. For the ongoing or re-occurring problems of category 1 it's all about setting a baseline and gathering the right data at the right time. You don't want the investigation being flawed by high-volume error messages that have nothing to do with the problem. Often there are error counters that increased sometime in the past without any relation to the actual problem you have right now. False indications and misled troubleshootings are time-consuming and in the worst case lead to wrong assumptions and therefore wrong and maybe even harming action plans. You certainly don't want that.
But for the category 2 problems deleting the messages from the past and clearing the current error counters would be a big problem. While this is often the first reaction of an admin or even the first suggested action of a support person, it will void any root cause analysis. If you make tabula rasa, what would be left to investigate? Therefore it's most important to gather as much as possible before destroying what's needed to find out what happened. And deliberately clearing the counters is not the only way to do so.
So the best approach is to gather data from all devices and components that are related to the problem, starting with the device reporting the problem and then the ones connected to it.
But even if no action takes place, evidences will be lost over time. Many of these one-time problems will recover itself. Nobody really did something and still it works again. That's then cases where I get data from today to find the reason for a problem from two weeks ago. Nobody cleared any counter or error log and still no root cause can be found anymore.
So while you don't see much in the normal error log it could still be that the target logged something internally. Take the IBM SAN Volume Controller (SVC) for example. Towards its virtualized backend storages the SVC acts as an initiator. You'll find lots of information about error recovery that took place against them (if there was any). But you'll hardly find anything regarding the hosts. That's where the SVC is the target. And still it's important to gather its data - as early as possible and as much as possible. These internal logs wrap quite quickly, but if you gather them in time chances are good, that they still contain the timeframe of the problem. For SVC (and the whole Storwize family) it's usually in the livedumps (a.k.a. statesaves), so better create new ones. The other products usually have extended data collections, too.
So always keep in mind: For everything that happened in the past: Gather the data, before you actually do something. Or:
Before you jump, save a dump!