Troubleshooting
Problem
This document will provide some basic steps to try and recover from a possible hang in the cluster.
Symptom
Cluster commands or menu options appear to hang.
Resolving The Problem
Best practices if a user runs into a cluster hang problem, the following sequence of steps would be recommended in attempt to get out of the hang situation:
*************************************************************************************************************************
The OVERALL, VERY FIRST step before proceding should be to collect QMGTOOLS to gather cluster data or work with support to gather necessary debug docs.
Refer to the following document: "PowerHA: How To Use QMGTOOLS MustGather Data Capture Tool (for High Availability Only)"
*************************************************************************************************************************
*************************************************************************************************************************
The OVERALL, VERY FIRST step before proceding should be to collect QMGTOOLS to gather cluster data or work with support to gather necessary debug docs.
Refer to the following document: "PowerHA: How To Use QMGTOOLS MustGather Data Capture Tool (for High Availability Only)"
*************************************************************************************************************************
IMPORTANT - Run CHGCLURCY on the hung node. The order for the ending is by starting with CRG jobs first, then QCSTCRGM and finally QCSTCTL if needed.
1. First run CHGCLURCY CLUSTER(*) CRG(CRG) NODE(*) ACTION(*END) for any of the CRG jobs that may be hanging (where CRG is the name of an actual CRG or Admin Domain). If the CRG job has not ended within 35 seconds, the command will issue an end job on the CRG job. The command will wait up to 35 seconds for the job to end. If the job has not ended within that time, the command returns with a message that the job did not end.
CHGCLURCY can be called again to end the CRG job, but will most likely not have any effect.
CHGCLURCY can be called again to end the CRG job, but will most likely not have any effect.
*NOTE - If the hung problem is only on a particular CRG job, there is no need to continue ending the rest of the cluster jobs (QCSTCRGM or QCSTCTL). The CRG job that was ended in Step 1 can be restarted again with CHGCLURCY *STRCRGJOB. If the desire is to end everything and restart clean on that node, then proceed to step 2 to end QCSTCRGM.
2. If the CRG job(s) is ended, then proceed to run CHGCLURCY CLUSTER(*) CRG(QCSTCRGM) NODE(*) ACTION(*END) to try and end the QCSTCRGM job. If the QCSTCRGM job has not ended within 35 seconds, the command will issue an end job on the QCSTCRGM job. The command will wait up to 35 seconds for the job to end. If the job has not ended within that time, the command returns with a message that the job did not end. CHGCLURCY can be called again to end the QCSTCRGM job, but will most likely not have any effect.
*NOTE - If everything goes well, ending QCSTCRGM job would also end QCSTCTL job (and vice versa, ending QCSTCTL job would also end QCSTCRGM job).
3. If the QCSTCRGM job ends and the QCSTCTL job is still running, then run CHGCLURCY CLUSTER(*) CRG(QCSTCTL) NODE(*) ACTION(*END) to try and end the QCSTCTL job. If the QCSTCTL job has not ended within 35 seconds, the command will issue an end job on the QCSTCTL job. The command will wait up to 35 seconds for the job to end. If the job has not ended within that time, the command returns with a message that the job did not end. CHGCLURCY can be called again to end the QCSTCTL job, but will most likely not have any effect.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Example API calls to end system cluster jobs for rare scenarios
Ending CRG jobs example:
CALL PGM(QWCCTLSJ) PARM('*END' 'CRG' 'QSYS' 'nnnnnn') - where nnnnnn = actual job number.
Ending QCSTCRGM example:
CALL PGM(QWCCTLSJ) PARM('*END' 'QCSTCRGM' 'QSYS' 'nnnnnn') - where nnnnnn = actual job number.
Ending QCSTCTL example:
CALL PGM(QWCCTLSJ) PARM('*END' 'QCSTCTL' 'QSYS' 'nnnnnn') - where nnnnnn = actual job number.
CALL PGM(QWCCTLSJ) PARM('*END' 'CRG' 'QSYS' 'nnnnnn') - where nnnnnn = actual job number.
Ending QCSTCRGM example:
CALL PGM(QWCCTLSJ) PARM('*END' 'QCSTCRGM' 'QSYS' 'nnnnnn') - where nnnnnn = actual job number.
Ending QCSTCTL example:
CALL PGM(QWCCTLSJ) PARM('*END' 'QCSTCTL' 'QSYS' 'nnnnnn') - where nnnnnn = actual job number.
***If the goal is to get out of the hang by ending and restarting all nodes, after completing the steps above ensure that the QCSTCTL job is in an OUTQ status (use WRKJOB QCSTCTL) on all nodes before attempting to start the nodes again.***
Related Information
[{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SWG60","label":"IBM i"},"ARM Category":[{"code":"a8m3p000000F8x5AAC","label":"High Availability-\u003ECluster"}],"ARM Case Number":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"7.4.0;and future releases"}]
Was this topic helpful?
Document Information
Modified date:
30 August 2024
UID
nas8N1019731