Determine if a cluster problem exists
Start here to diagnose your cluster problems.
At times, it may seem that your cluster is not operating correctly. When you think a problem exists, you can use the following to help determine if a problem exists and the nature of the problem.
- Determine if clustering is active on your system.
To determine if cluster resource services are active, look for the two jobs - QCSTCTL and QCSTCRGM - in the list of system jobs. If these jobs are active, then cluster resource services are active. You can use the Work Management function in IBM® Navigator for i or in IBM Navigator for i to View jobs or use the WRKACTJOB (Work with Active Jobs) command to do this. You can also use the DSPCLUINF (Display Cluster Information) command to view status information for the cluster.
- Additional jobs for cluster resource services may also be active. Cluster jobs provides information about how cluster resource services jobs are formatted.
- Determine the cause of a CPFBB26 message.
This error can mean that either the CRG job is not active or the cluster is not active. Use the DSPCLUINF (Display Cluster Information) command to determine if the node is active. If the node is not active, start the cluster node. If it is active, you should also check the CRG to determine whether the CRG has problems.Message . . . . : Cluster Resource Services not active or not responding. Cause . . . . . : Cluster Resource Services is either not active or cannot respond to this request because a resource is unavailable or damaged.
Look for the CRG job in the list of system jobs. You can use the Work Management function in IBM Navigator for i or in IBM Navigator for i to View jobs or use the WRKACTJOB (Work with Active Jobs) command to do this. You can also use the DSPCRGINF (Display CRG Information) command to view status information for the specific CRG, by specifying the CRG name in the command. If the CRG job is not active, look for the CRG job log to determine the cause of why it was ended. Once the problem is fixed, you could restart the CRG job with CHGCLURCY (Change Cluster Recovery) command or by ending and restarting cluster on that node.
- Look for messages indicating a problem.
- Ensure that you can review all messages associated with a cluster command, by selecting F10, which toggles between "Include detailed messages" and "Exclude detailed messages". Select to include all detailed messages and review them to determine if other actions are necessary.
- Look for inquiry messages in QSYSOPR that are waiting for a response.
- Look for error messages in QSYSOPR that indicate a cluster problem. Generally, these are in the CPFBB00 to CPFBBFF range.
- Display the history log (DSPLOG CL command) for messages that indicate a cluster problem. Generally, these are in the CPFBB00 to CPFBBFF range.
- Look at job logs for the cluster jobs for severe errors.
These jobs are initially set with a logging level at (4 0 *SECLVL) so that you can see the necessary error messages. You should ensure that these jobs and the exit program jobs have the logging level set appropriately. If clustering is not active, you can still look for spool files for the cluster jobs and exit program jobs.
- If you suspect some kind of hang condition, look at call stacks
of cluster jobs.
Determine if there is any program in some kind of DEQW (dequeue wait). If so, check the call stack of each thread and see if any of them have getSpecialMsg in the call stack.
- Check for cluster vertical Licensed Internal
Code (VLIC) logs entries.
These log entries have a 4800 major code.
- Use NETSTAT command to determine
if there are any abnormalities in your communications environment.
NETSTAT returns information about the status of Internet Protocol network routes, interfaces, TCP connections, and UDP ports on your system.
- Use Netstat Option 1 (Work with TCP/IP interface status) to ensure that the IP addresses chosen to be used for clustering show an 'Active' status. Also ensure that the LOOPBACK address (127.0.0.1) is also active.
- Use NETSTAT Option 3 (Work with TCP/IP Connection Status) to display the port numbers (F14). Local port 5550 should be in a 'Listen' state. This port must be opened using the STRTCPSVR *INETD command evidenced by the existence of a QTOGINTD (User QTCP) job in the Active Jobs list. If clustering is started on a node, local port 5551 must be opened and be in a '*UDP' state. If clustering is not started, port 5551 must not be opened or it will, in fact, prevent the successful start of clustering on the subject node.
- Use PING to verify if there is a communications problem. If you try to start a cluster node and there is a communications problem, you may receive an internal clustering error (CPFBB46). However, PING does not work between IPv4 and IPv6 addresses, or if a firewall is blocking it.