How to diagnose a problem with the Db2 pureScale Feature

When you encounter a problem, it is important to define and isolate the problem accurately before you attempt to resolve it. Use the diagnostic steps in this topic to guide yourself through the problem-determination process with the troubleshooting tools that are included with the Db2 pureScale Feature. In many cases, you might be able to resolve the problem yourself.

The major areas covered in the troubleshooting documentation include the following ones:

Obtaining diagnostic information, both logs and traces
Installation, getting started with, and uninstallation of the Db2 pureScale Feature
Component failures, such as host or members failures, including how to identify and resolve states, alerts, and restart conditions
Communication failures (RDMA)
Cluster file system problems and failures (GPFS)
Cluster manager software failures
Problem scenarios where you should call IBM Service

Is it a current problem or a recurring problem?

Once you have installed the Db2 pureScale Feature and are up and running, problems broadly fall into one of two categories: Problems that affect your server right now, and problems that occurred at some point in the past but were probably recovered from automatically by the Db2 pureScale Feature.

Problems that affect your server right now: These problems typically exhibit one or several readily apparent symptoms that affect your server in some way. For example, you might receive user reports that there is a performance slowdown or you might be able to observe that there is a loss of capacity in the system yourself. Overall, the system continues to be available. Such symptoms are indicative of a problem with a member or a cluster caching facility ( CF), which you can investigate directly.
If you observe a complete system outage, there are several possible culprits, but the first one you should check is whether there is at least one CF up and running to provide essential infrastructure services to the instance. Without a CF, there is no component available to handle locking and caching for members, and the instance cannot run.
Problems that occurred at some point in the past: These problems typically do not appear to affect your server now, but you might have some indication that there was a problem in the past. For example, you might see some unexplained log entries that point to something going on in your system that should be diagnosed. The highly available and fault tolerant character of the Db2 pureScale Feature can mask some problems, because the instance continues to recover from most component failures without requiring your intervention, even if a component fails repeatedly. Effectively, you need to monitor for intermittent or recurring problems over time to determine whether or not there exists a hidden problem on your server that keeps getting fixed by itself only to recur later on, and then resolve the underlying cause.

Diagnostic information you should look at

You need to understand when and why the problem occurs, based on your problem definition and the diagnostic data available to you. As you go through the diagnostic data, look for events or alerts that correlate the symptoms in your problem definition with a possible cause.

The Db2 pureScale-specific commands you use for troubleshooting can be run from any member or cluster caching facility host to obtain instance-wide information. Even if the database is unavailable or the instance is stopped, reporting of some data is often still possible, as long as the Db2 pureScale cluster manager component is available on the host where you issue the command.

The recommended sequence of diagnostic steps for the Db2 pureScale Feature is:

Issue the db2instance -list command (Linux only) to identify current issues affecting the instance. Check for alerts that affect host computers first, then check the CFs and the members for alerts. You can find the current state of members, cluster caching facilities, and hosts in the output, which will indicate whether alerts, errors or recovery-related states exist.
Issue the db2cluster -cm -list -alert command (AIX only) or the db2cm -list -alert command (Linux only) to check for recommended actions for these alerts. The output of the db2cluster command or the db2cmwill tell you what alerts exist, what the effects of the alert conditions are, and the suggested actions to resolve them. There are some alerts that require administrators to manually reset the alert field using the db2cluster -cm -clear -alert (AIX) or db2cm -clear-alert (Linux) command. If the alerts persist or if you are unable to resolve them, engage Db2 support.
Check the diagnostic logs available to you for any hints that might indicate a possible cause for the problem. If you know roughly when the problem started, you can narrow down your search by checking log entries with a corresponding timestamp.
- Check the DIAGPATH file path for any diagnostic files produced and check the FODC directories
- Check the CF_DIAGPATH file path for any cluster caching facility diagnostic data produced
- Check the db2diag.log diagnostic log file for recent diagnostic information and error messages
- Check the notification log for recent messages than might indicate the starting point of the problem
- Look for log entries that correlate with the time of failure
  - On AIX® environments configure syslog and additionally look for entries in errpt -a
  - On Linux environments, look for messages in /var/log/messages

Next steps

The result of stepping through the available diagnostic information will determine what troubleshooting scenarios you should look at next. Once you have narrowed down the scope of the problem, you can navigate through the Db2 pureScale troubleshooting documentation to find the context that most likely applies to your problem definition. Often, you will find two types of troubleshooting content, very context-specific answers to frequently asked troubleshooting questions (FAQs), and more comprehensive troubleshooting scenarios that show you how to interpret the diagnostic data and resolve the problem

For example, if the diagnostic information you looked at shows that a member has an alert and is waiting to fail back to its home host after having failed over to a guest host for recovery purposes (an operation known as a restart light), you can locate the associated failure scenario in the troubleshooting documentation by looking at the Db2 pureScale instance operation scenarios, and then at the subsection for a members or host with alert. Not all possible failure scenarios are covered, but many are.