What to investigate when analyzing performance

Always start by looking at the overall system before you decide that you have a specific CICS® problem. Check total processor usage, direct access storage device activity, and paging. Performance degradation is often due to application growth that is not matched by corresponding increases in hardware resources. If so, solve the hardware resource problem first. You might still need to follow on with a plan for multiple regions.

Information from at least three levels is required:

CICS: Examine the CICS interval or end-of-day statistics for exceptions, queues, and other symptoms that suggest overloads on specific resources. A shorter reporting period can isolate a problem. Consider software and hardware resources; for example, utilization of VSAM strings or database threads, files, and TP lines. Check runtime messages that are sent to the console and to transient data destinations, such as CSMT and CSTL, for persistent application problems and network errors.
Use tools such as the CICS Explorer® and RMF to monitor the online system and identify activity that correlates to periods of bad performance. Collect CICS monitoring facility history and analyze it, using tools such as CICS Performance Analyzer or IBM Z® Decision Support to identify performance and resource usage exceptions and trends. For example, note processor-intensive transactions that give little or no I/O. These transactions can monopolize the processor, causing erratic response in other transactions with more normally balanced activity profiles. These transactions might be candidates for isolation in another CICS region.
MVS: Use SMF data to discover any relationships between periods of poor CICS performance and other concurrent activity in the MVS system. Use RMF data to identify overloaded devices and paths. Monitor CICS region paging rates to make sure that there is sufficient real storage to support the configuration.
Network: The proportion of response time that is spent in the system is small compared with transmission delays and queuing in the network. Use tools such as Tivoli® NetView for z/OS® to identify problems and overloads in the network. Without automatic tools, you depend on the subjective opinions of a user that performance deteriorates.

In CICS, the performance problem is either a poor response time or an unexpected and unexplained high use of resources. In general, you must look at the system in some detail to see why tasks are progressing slowly through the system, or why a certain resource is being used heavily. The best way of looking at detailed CICS behavior is by using CICS auxiliary trace. Be aware that turning on auxiliary trace, though the best approach, can worsen existing poor performance while it is in use. The approach is to get a picture of task activity first, listing only the task traces, and then to focus on particular activities: specific tasks, or a specific time interval. For example, for a response time problem, you might want to look at the detailed traces of one task that is observed to be slow. There are a number of possible reasons; for example, the tasks might be trying to do too much work for the system. The system might be real-storage constrained, or many of the CICS tasks are waiting because there is contention for a particular function.

Information sources to help analyze performance

Any performance measurement tool, including statistics and the CICS monitoring facility, can potentially help in diagnosing problems. Consider each performance tool as usable in some degree for each purpose: monitoring, single-transaction measurement, and problem determination. CICS statistics can reveal heavy use of a particular resource. For example, you might find a large allocation of temporary storage in main storage, a high number of storage control requests per task (perhaps 50 or 100), or high program use counts that imply heavy use of program control LINK.

Both statistics and CICS monitoring might show exceptional VSAM shared resource conditions arising in the CICS run. Statistics can show waits on strings, waits for VSAM shared resources, waits for storage in GETMAIN requests, and other waits. The waits also generate CICS monitoring facility exception class records.

While these conditions are also evident in CICS auxiliary trace, they might not be obvious, and the other information sources are useful in directing the investigation of the trace data. In addition, you can gain useful data from the investigation of CICS outages. If there is a series of outages, investigate common links between the outages.

The QR TCB CPU Dispatch Ratio

A TCB CPU Dispatch Ratio is the accumulated CPU time as a fraction of accumulated dispatch time, expressed as a percentage. In a CICS environment this ratio is only of value for the QR TCB and is meaningless for other TCBs. The QR TCB CPU Dispatch Ratio is an indicator of how much processor resource is assigned to the QR TCB by the operating system and hardware, when compared to the amount of processor resource requested by the CICS dispatcher.

For a given interval, a high ratio indicates that when CICS dispatched a task on the QR TCB, processor resource was made available by the operating system and hardware almost without interruption until the CICS task had completed. In this case, the CPU time is closely correlated with the overall elapsed time (the CICS dispatch time).

A low ratio indicates that despite CICS requesting processor resource, a combination of the operating system, hardware, or both resulted in frequent or long delays waiting for a physical processor. In this case, the CPU time is significantly smaller than the overall elapsed time.

The QR TCB CPU dispatch ratio is reported in a number of ways:

Using a system memory dump:
The CICS IPCS system memory dump formatter DS keyword gives the ratio in the Dispatcher Statistics - CICS TCB Mode Statistics section.
Using CICS statistics SMF records:
The CICS statistics utility program, DFHSTUP gives the ratio in the Dispatcher Statistics section.
Using the sample statistics program, DFH0STAT:
The sample program gives the ratio in the Dispatcher TCB Modes section.
Using the CICS message DFHDS0102:
DFHDS0102 messages regularly report the QR TCB CPU Dispatch Ratio and can be used as an indicator of a potential shortage of CPU resource. The default interval for DFHDS0102 messages is five minutes, but the interval can be set by specifying the system initialization parameter INITPARM=(DFHQRCPU='nn') where nn is a number of minutes in the range 01 - 59.

The QR TCB CPU Dispatch Ratio is calculated by using the CICS dispatcher statistics. CICS accumulates the amount of CPU time that is used by the TCB every time the QR TCB is dispatched. When using a system memory dump, CICS statistics SMF records, or the sample statistics program, the ratio is calculated using the CPU and dispatch time accumulated since statistics were last reset. When using DFHDS0102, the ratio is the average utilization of the QR TCB since the last time the message was issued.

As best practice, collect a series of measurements to determine the usual range of the QR TCB CPU Dispatch Ratio for your typical production environment, and use these measurements to recognize whether this ratio is showing a change from normal behavior.

Common reasons for a low ratio

Within a busy system it is normal for CICS work to queue for processor resource, therefore a dispatch ratio of less than 100% is acceptable. A CICS region may suffer performance problems such as poor transaction response times if this ratio falls to a low value. A low value for the QR TCB CPU Dispatch Ratio is typically less than 70%. Any of the following conditions can cause a low ratio:

The LPAR is busy. The CICS region is competing with other address spaces for CPU and the operating system cannot allocate processor resource when requested.
The LPAR fair share is reached or capped. The operating system has dispatched the CICS QR TCB onto a logical processor, but the hardware cannot dispatch the logical processor onto a physical processor.
CICS is subject to capped resources in the LPAR. The LPAR may not be fully utilized, but operating system controls have restricted the amount of processor resource available to the CICS region.
Application code issuing non-CICS API requests (for example, MVS macro requests) which result in the QR TCB being blocked until the request completes.
Excessive system paging is taking place.

The following conditions can also cause a low ratio; however, they are considered normal situations and do not require further investigation:

During CICS system initialization. A low ratio is observed immediately after control is given to CICS and is considered to be normal, as CICS uses many MVS system services during initialization, all of which are being processed by the QR TCB.
The region is a Terminal-Owning Region with the system initialization parameter HPO set to YES. In this case, VTAM® is subtasking the arriving work onto SRBs and the only CPU that is being used is for routing work elsewhere.
When non-threadsafe applications in the region access VSAM RLS files, VSAM completes the file access request on an SRB, and the CPU consumed is not accumulated by the QR TCB dispatcher statistics. See the description of the RLSCPUT field in Performance data in group DFHFILE for details.

The QR TCB Dispatch / Interval ratio

The QR TCB Dispatch / Interval ratio is a way to describe and measure how busy a CICS region's QR TCB is. This is important because a common cause of CICS transaction response time problems is a QR TCB that is too busy. To understand the ratio, it helps to have a little background on the QR TCB and the CICS dispatcher.

When no CICS transactions are ready to run on the QR TCB, the CICS dispatcher puts the QR TCB into an MVS wait. That is wait time. When one or more transactions are ready to run on the QR TCB, the QR TCB wakes up out of its MVS wait and the CICS dispatcher gives control of the QR TCB to the ready transactions, one at a time. That is dispatch time. When a transaction returns control of the QR TCB to the CICS dispatcher, and there is no transaction ready to run on the QR, the CICS dispatcher puts the QR TCB back into an MVS wait. The QR TCB is now in wait time again.

In any given interval of time, the QR TCB will spend part of that interval in dispatch time, and part in wait time. The ratio of dispatch time to the interval of time, called the QR TCB Dispatch / Interval ratio, is a measure of how busy the QR TCB is. Assuming that in a 5-minute interval of time, the QR TCB has a total of 3 minutes of dispatch time and 2 minutes of wait time, the QR TCB Dispatch / Interval Ratio for that interval is 60%. The QR TCB is 60% saturated in that interval. This is calculated as 3 minutes of dispatch time divided by the 5 minutes of interval time.

If the QR TCB is 100% saturated for an interval, that means that the QR TCB is very busy. Whenever a CICS transaction gives control of the QR TCB back to the CICS dispatcher, there is always another transaction ready to run. The CICS dispatcher never puts the QR TCB into a no-work MVS wait because there is always another transaction waiting to be given control of the QR TCB by the CICS dispatcher.

If the QR TCB Dispatch / Interval ratio is close to 0% for an interval, that means that the QR TCB is not busy at all. There is rarely a transaction ready to run on the QR for that interval. The QR TCB is in its no-work MVS wait for most of the interval.

It is important to monitor the QR TCB Dispatch / Interval ratio. If the QR TCB becomes too busy and too saturated, it becomes a bottleneck point that causes transaction response times to increase. The closer the QR TCB Dispatch / Interval ratio gets to the 90% range and higher, there will be more and more times where lots of transactions are all ready to run on the QR TCB at the same time. Only one transaction at a time runs on the QR TCB, while the other transactions just wait. A CICS region whose QR TCB Dispatch / Interval ratio is too high is likely to experience transaction response time problems and may benefit from splitting workload to separate AORs.

An ideal range for the QR TCB Dispatch / Interval ratio of a CICS region is 50% or less. At that level of saturation, the QR TCB will not be a bottleneck, and there is more room for this CICS region to absorb a spike in workload, or to run during a period of heavy CPU contention without pushing the QR TCB Dispatch / Interval ratio into the 90% or higher range.

The QR TCB CPU saturation ratio, along with the QR TCB CPU / dispatch ratio is reported in CICS message DFHDS0102. DFHDS0102 messages regularly report the QR TCB CPU Dispatch ratio and can be used as an indicator of a potential shortage of CPU resource.