We've had a couple of conversations with customers about what to do when things go wrong with production One person said the ops people used to have a red book ( a bit coffee stained) with procedures for the operators when things go wrong - but she said that this book has not been mentioned for a few years. This then lead to the question - what is the best way of providing this information for the operators.
Afterwards we had a discussion and we thought the best way may be to have a wiki. (This may be a bit radical to some of us old z/OS hands). At Hursley we have used https://www.mediawiki.org/wiki/MediaWiki within the change team - there are others available.
A wiki allows updates from many people, and may allow you to track changes to pages. It display the output in your favorite web browser, and has a search capability.
A wiki may be better than a collection of .doc files, as you may need to search all of the files. Having one big .doc file makes it hard to update - and your documentation needs to be easy to update.
If you have a wiki - remember to print off the instructions on what to do if you have a power outage - as if there is no power, you may not be able to look at the wiki to see what to do!
What is the point of the run book?
This "book" is meant to contain information for the operations staff, and be updated by the applications programmers, MQ sysprogs, and other staff.
It contains about the resources a business application uses, and what to do when problems occur. For example have a page for the Payroll application
Application:PAYROLL - business critical!
The PA* transactions running on
CICSPA1, CICSPA2 and CICSPA3
QM: MQ1 on and MQ2
- Payroll_input - shared queue
- Payroll_output - shared queue
- PAYROLL_TRIGGER shared queue
Other MQ objects
The application sends messages to other queue managers, using XMITQ LINUX1 and LINUX2
If there are problems phone the PAYROLL application call the applications people on 0800 PAYROLL APP (dont phone the MQ sysprogs)
What can go wrong?
If PAYROLL application does not seem to be working.
Check the queues to the remote systems. (Then have a link to a common wiki page such as)
Checking transmission queues
Use the +cpf DIS Q(..) curdepth to display the current depth.
Wait for a second, repeat the command and see if the curdepth is decreasing.
What channel is using this queue. +cpf DIS CHL(*) where(XMITQ,EQ,name)
Use DIS CHS(..) and check the channel is active, processing messages etc, last message sent.
How is it used?
The ops should be able to use this wiki to display information about the applications and help them fix the important problems first.
- If there are two problems: Queue1 has 100,000 messages and Queue2 has 2 messages - which do you fix first? The answer may be Queue2 which has the funds transfers over 1$billion.
- If the ops need to stop a CICS region - what is impacted?
- This Linux server over here - needs to be rebooted - what uses it? Ahh Payroll uses it - and we are meant to get paid today - so not a good time to reboot it.
Of course you could continue with your coffee stained print out from 5 years ago.
Have any of you used a wiki for this, or have a better idea - if so please let me know firstname.lastname@example.org