I was asked to review a customer's application because it was using lots of CPU - caused by MQ usage. When I asked for some basic overview documentation there was a long silence, and eventually someone said we dont think we have any, what do you think we need? I came up with the little list below - it is not complete but it should help you get started. This does not go into the application and look at coding standards - it is just architecture stuff.
Charts/Pictures ( up to date) of
- Geographic layout eg how many sites - and distance between them
- Systems layout - and where the application fits - eg the green boxes are where application1 runs. There are z/Linux boxes here, z/OS in the middle. and some x86 servers over there. The red boxes are where application 2 runs
- Connections between systems - including multiple paths
- MQ servers
- Databases (and tables) - especially if they are not on the machine
- What facilities are shared between applications and application isolation . So if application1 goes wild - will it impact application 2
- Mind maps (eg freemind) may be useful for this
- For example number of transactions/messages/database updates per hour over a day, and peak hour per month. CPU used
- Expected growth of application usage eg flat or 20% growth per month
- Size of data sent over the network - is it 100 bytes to 100 MB per transactions?
- If you suddenly had double activity tomorrow - what would happen ? How would you know?
What single point of failures (SPOF) have you identified?
- What actions do you take if you hit them - eg restart over there. This is a half hour outage
- What testing/checking do you do to see if you have other SPOF? For example
- Sysplex not enabled
- MQ shared queue not being used - so if LPAR goes does - messages are not processed
- Only one application processing an input queue
- One connection/path to a remote system. No back end availability
- Single DNS
- Single instance of remote database - no failover support
What monitoring do you have in each component
- Eg queue depth, response time, reporting of time-outs
- What do you do for each one?
Operations 'book' - may be a document or file.
- If queue X fills up - do the operations people know which application is impacted - and how important this application is to the business?
- If channel Y stops - what is the action?
- If application reports table Y has no space or other problem - what happens?
- Do they have prioritized list of actions eg do customer transactions before internal applications
- Is this book up to date?
- Is it updated for new or changed applications?
Do applications report problems?
- Applications need to report any problems they find with enough information to identify the application program, where in the program, what the problem is and any error code.
- Applications that just quietly die with no information are very hard to diagnose.
- Problems may include
- MQ queue full
- No MQ message found
- Unable to access database table
- Unable to insert into table
- No response received ( application time out)
- Security problem
- Invalid data received.