Cluster: Best Practice to Recover From Cluster Hang

Troubleshooting

Problem

This document will provide some basic steps to try and recover from a possible hang in the cluster.

Symptom

Cluster commands or menu options appear to hang.

Resolving The Problem

Best practices if a user runs into a cluster hang problem, the following sequence of steps would be recommended in attempt to get out of the hang situation:

*************************************************************************************************************************
The OVERALL, VERY FIRST step before proceding should be to collect QMGTOOLS to gather cluster data or work with support to gather necessary debug docs.
Refer to the following document: "PowerHA: How To Use QMGTOOLS MustGather Data Capture Tool (for High Availability Only)"
*************************************************************************************************************************

IMPORTANT - Run CHGCLURCY on the hung node. The order for the ending is by starting with CRG jobs first, then QCSTCRGM and finally QCSTCTL if needed.

1. First run CHGCLURCY CLUSTER(*) CRG(CRG) NODE(*) ACTION(*END) for any of the CRG jobs that may be hanging (where CRG is the name of an actual CRG or Admin Domain). If the CRG job has not ended within 35 seconds, the command will issue an end job on the CRG job. The command will wait up to 35 seconds for the job to end. If the job has not ended within that time, the command returns with a message that the job did not end.
CHGCLURCY can be called again to end the CRG job, but will most likely not have any effect.

*NOTE - If the hung problem is only on a particular CRG job, there is no need to continue ending the rest of the cluster jobs (QCSTCRGM or QCSTCTL). The CRG job that was ended in Step 1 can be restarted again with CHGCLURCY *STRCRGJOB. If the desire is to end everything and restart clean on that node, then proceed to step 2 to end QCSTCRGM.

2. If the CRG job(s) is ended, then proceed to run CHGCLURCY CLUSTER(*) CRG(QCSTCRGM) NODE(*) ACTION(*END) to try and end the QCSTCRGM job. If the QCSTCRGM job has not ended within 35 seconds, the command will issue an end job on the QCSTCRGM job. The command will wait up to 35 seconds for the job to end. If the job has not ended within that time, the command returns with a message that the job did not end. CHGCLURCY can be called again to end the QCSTCRGM job, but will most likely not have any effect.

*NOTE - If everything goes well, ending QCSTCRGM job would also end QCSTCTL job (and vice versa, ending QCSTCTL job would also end QCSTCRGM job).

3. If the QCSTCRGM job ends and the QCSTCTL job is still running, then run CHGCLURCY CLUSTER(*) CRG(QCSTCTL) NODE(*) ACTION(*END) to try and end the QCSTCTL job. If the QCSTCTL job has not ended within 35 seconds, the command will issue an end job on the QCSTCTL job. The command will wait up to 35 seconds for the job to end. If the job has not ended within that time, the command returns with a message that the job did not end. CHGCLURCY can be called again to end the QCSTCTL job, but will most likely not have any effect.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Example API calls to end system cluster jobs for rare scenarios

Ending CRG jobs example:
CALL PGM(QWCCTLSJ) PARM('*END' 'CRG' 'QSYS' 'nnnnnn') - where nnnnnn = actual job number.

Ending QCSTCRGM example:
CALL PGM(QWCCTLSJ) PARM('*END' 'QCSTCRGM' 'QSYS' 'nnnnnn') - where nnnnnn = actual job number.

Ending QCSTCTL example:
CALL PGM(QWCCTLSJ) PARM('*END' 'QCSTCTL' 'QSYS' 'nnnnnn') - where nnnnnn = actual job number.

***If the goal is to get out of the hang by ending and restarting all nodes, after completing the steps above ensure that the QCSTCTL job is in an OUTQ status (use WRKJOB QCSTCTL) on all nodes before attempting to start the nodes again.***

Related Information

How to obtain and install QMGTOOLS

How to use QMGTOOLS for High Availability data capture

[{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SWG60","label":"IBM i"},"ARM Category":[{"code":"a8m3p000000F8x5AAC","label":"High Availability-\u003ECluster"}],"ARM Case Number":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"7.4.0;and future releases"}]

Tips

Cluster: Best Practice to Recover From Cluster Hang

Troubleshooting

Problem

Symptom

Resolving The Problem

Related Information

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?