In a previous blog, I talked about the importance for collecting and providing diagnostic data. For the Tivoli System Automation for Multiplatforms (TSA MP) product, this means running its automated data collection utility called "getsadata".
However, there are details about a problem situation that cannot be obtained by a tool, script, or bunch of commands. The most obvious is the problem description itself. So what does a good problem description entail ?
Well a timeline for one. Lets say Support staff have to dig into the log and trace data, a timeline will allow us to get to the most relevant messages much more quickly. Consider that at least one of the core trace files we use can contain thousands of lines of trace messages for only a few seconds of time. This means a quicker turn-around for you. On the flip side, if there has been any incorrect interpretation of your original problem description, we might start looking at the wrong time period with the log and trace data if you haven't given us a clear timeline of events ... ultimately this could result in an analysis that really doesn't make sense to you, because its not relevant to the problem that you are focused on ... bottom line, time wasted for all of us. So, timeline, timeline, timeline, and don't forget timeline
A common theme with problem descriptions we see is the incorrect use of terminology. This is not a criticism. This is the reality of many of our customers being thrown into the deep end, supporting a solution with a product they don't have a lot of experience with and don't have time to attend any formal education for. Where this becomes a problem is in how Support start interpret what is really being described or asked. So to alleviate this problem, the single most important piece of supporting information you can provide with your problem description is the output of 'lssam -nocolor'. You would be surprised how many times we have been able to explain a situation and answer a client's question without looking at any log or trace data, just by checking what 'lssam' has captured. But for this to be useful, you need to remember to save off the output of 'lssam' at the time you're observing the problem you need help with. The output of 'lssam' is just a snapshot for a very brief instant in time.
What can 'lssam' tell support? Firstly, its shows us what resources TSAMP is managing and how they are grouped. Its tells us which nodes (servers) these resources can run on or are running on. Of course, the primary reason 'lssam' exists is to show you the operational state (OpState) for each resource (online, offline, pending online|offline, failed offline, and so on), and this is certainly valuable information for Support staff, particularly if you're looking for guidance on what to do next as part of recovery efforts. But again, the OpState information is only valuable if it reflects the states that you observed during the problem period. Make it a habit to run 'date >> lssam.out; lssam -nocolor >> lssam.out' whenever you see something unusual or something you *think* you may want to follow-up with Support about.
Other useful hints:
1) If you're referring to a server as the primary or the standby or the failover server, please attach hostnames (nodenames) to them in your problem description since the concepts of primary, standby, etc are meaningless to TSAMP and therefore meaningless to TSAMP Support staff.
2) If you're referring to an application that failed to start or didn't failover, and so on, then please tell us what the resource name is for that application. The TSAMP product can be used to make practically any application highly available so there is a possibility we won't know what application you are referring to unless you tell us the "resource" name within the TSAMP automation policy is associated with your application ... this also goes back to what we would see in the output of 'lssam'.
3) Say an application (or resource) failed to start and you don't have 'lssam' output that shows this, say because recovery efforts were already performed. But you want the root cause to be determined, then you need to provide more details about the failed start attempt, for example, on what node(s) did the resource fail to start and "when" was this failed start. What did you do to try and make it start (start the domain, change a resource group's Nominal state to online, etc) ? How did you recover, assuming its not currently down/offline ?
4) Often telling us what you expected in addition to what you observed can help is understand what you're reporting as a problem.
5) Then there are the classic questions like, has this ever worked ? And what has changed recently (yes I know, nothing was changed )
Finally, as I said in a previous blog around diagnostic data collection, providing a detailed problem description at the time you open a PMR will result in a quicker answer and/or resolution steps from the TSAMP Support staff. Note that the electronic Service Request (SR) webpage (https://www-947.ibm.com/support/servicerequest/Home.action) is the ideal method for opening new PMRs (and even updating existing ones) as this will allow you to control the problem description and give you and immediate opportunity to upload supporting data ... you definitely cannot rely on the call center phone operators to accurately enter a problem description you dictate over the phone ... most of those problem descriptions I find to be useless, to be blunt