What is needed to troubleshoot an unexpected reboot in a clustered environment ?
Gareth Holl 100000C8M7 Visits (1901)
The simple answer is RSCT (the "cluster" software) trace data from key daemons (sub-systems) like IBM.ConfigRM, cthags, cthats, and possibly even RMCd. IBM Support would use this trace data to confirm whether or not it was RSCT that forced the reboot, and if so, then why.
However what isn't always that simple is collecting these trace files before they wrap. The trace files for all the RSCT and TSAMP core daemons and Resource Managers are setup to be a fixed size and First-In, First Out (FIFO). Some of these trace files are very busy and therefore historical trace data can be lost rather quickly.
Lets say there was an incident over night, like at 2am in the morning, that resulted in one of the clustered servers being forced to reboot by the cluster software, in its effort to protect itself or critical resources. You arrive at work in the morning, lets say at 8am, and find that your application is now running on the standby server. Firstly, lets celebrate the fact that TSAMP (and RSCT) did its job by keeping your application highly available. But now you want to find out why there was a failover to the standby server. You find the primary server had been rebooted around 2am. Unfortunately in the 6+ hours that it took before you found out about this incident, the vital trace data could already be written over.
There are a couple ways to deal with this kind of situation. For one, your could setup trace spooling, which is a means of saving trace files to a separate directory before the active trace files are overwritten. However, the configuration of trace spooling can be tricky, almost an art, and there is still a need to manually collect the trace spooled data before it eventually has to be cleared to make way for new historical data. If you want more information on trace spooling, you could start here :
The alternative is to automatically run the TSAMP/RSCT diagnostic data collector (getsadata) before the trace data is lost. In other words, collect the data soon after the problem event, without any user invention. This is quick and painless ... it simply involves creating a crontab that runs at startup. Refer to the following URL for setup details:
The output tar file from running getsadata will be located in the /tmp directory by default, however you can add the -outdir option to the getsadata command within the crontab if you want the output to be stored else where ... this could be important if /tmp is automatically cleared after each reboot.