We had a tiny problem with clustering on one of our test systems, and I thought it was worth listing the steps you need to take when looking at clustering problems.
The problem was that one remote queue manager knew about a queue, another remote queue manager did not know about a queue.
We used the checklist below to track down set up problems
- We went into SDSF for the joblog, of the "problem queue manager"
- used SE to edit the joblog
- X ALL
- F CSQX ALL to show all of the CSQX messages
- looked at the CSQX messages and noted the problems
For example on the "problem" queue manager we spotted
CSQX038E MQPA CSQXREPO Unable to put message to SYSTEM.CLUSTER.REPOSITORY.
CSQX448E MQPA CSQXREPO Repository manager stopping because of errors. Restart in 600 seconds.
We also saw cluster channels stop and go into retry. Clustering updates from other systems, could not reach the "problem" queue manager because the channel was not working.
We fixed the page set full problem, and then the channels started, and soon the clusters were OK.
The lesson we learned was check for simple problems before thinking it is a deep problem.
We also explored
- Why the page set filled up - we had lots of temporary queues defined on the page set. These should have been defined on "application" page sets
- The page set was set to no expand. This was set by someone who did not understand it.