Using the Performance Toolkit to Monitor
Analysis of a performance problem requires the ability to distinguish excessive values of specific performance indicators from healthy ones. Unfortunately there are very few generally accepted simple thresholds upon which to base this decision, because many of these thresholds depend on the size (storage, CPU power) and I/O configuration of the system being analyzed. In other words, do not rely too much on any rules of thumb (they may not be applicable to your system or the current problem) but compare some important performance indicators on your system and check them for significant changes at different response time levels.
C1ES needed
to tell you when performance was bad, and also many other performance
values which usually give some indication as to the reason for the
bad performance. The data can be viewed using either of the following
displays: - REDISP screen (command 'REDISP' or PF 2). It allows you to view a maximum of the current day's history data simultaneously, but you will have to detect any correlations between different variables yourself.
- REDHIST screen (command 'REDHIST fn ft fm'), for viewing previous days' HISTLOG and HISTSUM files. It contains values for all the performance variables in the REDISP display, and also from most of the other 'by time' logs available for realtime monitoring.
- Variable Correlation Coefficients screen (see the
CORREL
subcommand in the z/VM: Performance Toolkit Reference). It shows the correlation coefficients for all the performance variables in a HISTLOG or HISTSUM file, for a selected base variable. Correlation coefficients indicate how closely a variable's values follow changes in the base variable's values. Values close to 1 indicate good correlation, that is, there appears to be a close relationship between the variables, and this can provide pointers to the probable origin of performance problems. - Graphical CPU History Data Displays (commands PLOTSUM, GRAPHSUM, PLOTDET, and GRAPHDET) which show how the selected variables changed with time.
- Performance Variable Correlation Displays (commands PLOTVAR and GRAPHVAR). They allow you to correlate any of the performance variables with each other, and the resulting graphics are usually an excellent base for analyzing performance problems.
Any of these displays can help you in detecting the reason for the problem on hand so you can select the one you feel most familiar with. You will find examples for variable correlation displays later on in this chapter. Here we will start by using the REDISPLAY screen which will show you all the original raw data.
FCX101 CPU nnnn SER nnnnn Interval 09:12:19 - 09:54:27 Perf. Monitor
TIME >LOGN ACT TR-T NT-T C1ES TR-Q NT-Q TR/S NT/S %PQ %IQ %LD %EL Q1 Qx Q1L
09:15 1295 213 0.07 1.47 1.17 .8 14.3 11.2 9.7 18 22 6 0 13 17 2
09:16 1294 218 0.06 1.67 1.10 .8 16.4 12.0 9.7 9 31 23 0 11 16 6
09:17 1296 206 0.06 1.69 1.16 .6 15.8 11.0 9.3 17 27 13 0 11 27 3
09:18 1291 212 0.07 1.71 1.16 .7 14.4 10.3 8.4 4 50 7 0 14 13 2
09:20 1288 200 0.06 1.42 1.10 .7 13.9 11.2 9.7 5 52 18 0 5 12 2
09:21 1292 205 0.07 2.97 1.04 .7 24.4 10.0 8.2 20 50 6 0 4 13 1
..... .... ... .... .... .... ... .... .... ... .. .. .. .. .. .. ..
..... .... ... .... .... .... ... .... .... ... .. .. .. .. .. .. ..
09:46 1060 310 0.10 2.96 1.78 1.4 33.6 13.7 11.3 19 17 8 0 22 29 3
09:47 1067 270 0.11 5.22 1.83 1.1 47.1 10.6 9.0 12 17 6 0 10 35 1
09:49 1079 290 0.09 4.03 1.76 1.1 38.8 11.4 9.6 18 13 5 0 18 34 3
09:50 1083 289 0.10 5.39 1.75 1.0 50.1 10.6 9.2 30 12 11 0 20 32 5
09:52 1093 282 0.11 4.86 1.96 1.3 47.0 11.0 9.6 11 28 4 0 15 35 2
09:54 1101 256 0.10 2.88 1.83 1.1 26.4 10.7 9.1 32 20 7 0 14 25 3
- Check the class 1 elapsed time slice shown under heading
C1ES. Look for periods of good response times and of bad response times. In the example above, good response times should be below 1.2 seconds for the system shown. So, the good, or at least acceptable, response times are shown in the 9:15 - 9:21 time span and the bad, or unsatisfactory response times are shown in the 9:46 - 9:54 time span.Note: you will have to establish your own thresholds for good or bad response times (the value depends very much on the 'job mix' which is active on the system).
- Try to locate the problem area by looking for other columns where
high and low values are closely related to high and low response times.
- Shift the display window to the right (enter the command
TRANSACT or press PF11 until the transaction-related fields are shown)
and check for a probable paging problem by looking at column
%PQwhich shows the percentage of in-queue users found in page wait at the end of each interval. It must be related toC1ESif paging really is the problem, but do not forget that all wait state values (%PQ, %IQ, and%EL) are just samples and not averages for the whole interval, so it would not be reasonable to expect perfect correlation. Look at columnPG/S(page rate) if you suspect a paging problem. (It should also correlate toC1ES.) Skip to section Analyzing a Paging Bottleneck if paging seems to be the problem.FCX101 CPU nnnn SER nnnnn Interval 09:12:19 - 09:54:27 Perf. Monitor TIME CPU %CP %EM %WT %SY - - IO/S VIO/S PG/S XPG/S DIAG PRIV LOGN ACT 09:15 382 135 247 18 28 . . 397 230 549 ... 1468 1295 213 226 09:16 386 125 261 14 25 . . 341 189 483 ... 1227 1294 218 260 09:17 393 125 268 7 27 . . 408 206 540 ... 1219 1296 206 257 09:18 395 117 278 5 24 . . 329 194 478 ... 1199 1291 212 214 09:20 338 122 216 62 24 . . 318 179 452 ... 1430 1288 200 228 09:21 377 143 234 23 28 . . 327 188 411 ... 1822 1292 205 232 ..... ... ... ... .. . . . .. .. .. ... .... .... ... 234 ..... ... ... ... .. . . . .. .. .. ... .... .... ... 236 09:46 399 161 238 1 37 . . 509 294 824 ... 1328 1060 310 229 09:47 396 142 254 4 32 . . 483 284 693 ... 1298 1067 270 228 09:49 400 141 259 0 32 . . 555 295 665 ... 1233 1079 290 230 09:50 397 151 246 3 34 . . 510 280 738 ... 1365 1083 289 218 09:52 399 154 245 1 37 . . 493 296 779 ... 1297 1093 282 210 09:54 398 141 257 2 31 . . 441 257 641 ... 1292 1101 256 213 - Check for a storage problem. A severe storage problem could
be indicated by users ending up in the eligible list, resulting in
any number other than zero in column
%EL. This column shows the percentage of in-queue users found in resource wait, and if your CPU is not being overloaded there is a good chance the resource being waited for is storage. Skip to section Analyzing a Storage Problem if you suspect some of your users end up in storage wait. - Look for an I/O problem by checking the values
in column
%IQwhich show the percentage of in-queue users found in I/O wait at the end of each interval.%IQshould be related toC1ESif you really have a general I/O bottleneck. However, there can be variations here. The I/O problem may be restricted to a group of users which depend on the same overloaded disk volume, while all others do not experience it.%IQmight then not indicate this problem very clearly. Look also at columnIO/S(total I/O rate) which should correlate withC1ESand%IQtoo. Skip to section Analyzing an I/O Bottleneck if general I/O seems to be at fault.Note that a certain number of users in I/O wait is usually much less of a problem than the same number of users in page wait: I/O wait is part of normal processing and will not disappear even on very lightly loaded systems. Page wait, on the other hand, will always increase response times. What
%IQvalue is 'normal' depends mainly on the job mix being run on any given system. Use the values as a standard which you find on your system while it is fairly busy with acceptable response times. - Finally, check for a CPU problem by verifying the values
in column
CPU(total CPU busy percentage). If they are at the system's maximum much of the time, and if the sum ofQ1+Qxis high while the values in columns%PQand%IQare low, then skip to section Analyzing an Overloaded CPU. This seems to have been the bottleneck in the example shown: The%IQvalue is definitely lower for the 'bad' measurements, and, although we find higher values also for the%PQvalues, CPU load is already at its maximum, so we do not yet see much of the effects of a developing paging bottleneck.
- Shift the display window to the right (enter the command
TRANSACT or press PF11 until the transaction-related fields are shown)
and check for a probable paging problem by looking at column
In the above guidelines, we have always taken the wait states of
in-queue users as indicators for system bottlenecks. The REDISPLAY
screen shows these values together with the internal system response
time C1ES and other system load figures, which is why we have used
it. The %PQ and %IQ values shown
there are not averages but represent just single samples, and they
will usually vary quite a bit. Have a look at the user wait state
log screen (command 'USTLOG') for a more reliable and more detailed
overview over user wait states, based this time on the CP monitor
high-frequency sampling technique.
FCX135 CPU nnnn SER nnnnn Interval 07:26:00 - 11:06:00 Perf. Monitor
<-SVM and->
End Time %ACT %RUN %CPU %LDG %PGW %IOW %SIM %TIW %CFW %TI %EL %DM %IOA %OTH
>>Mean>> 4 3 4 0 16 0 8 22 24 4 0 16 3 0
09:01:00 4 2 3 0 23 0 9 24 25 4 0 10 4 0
09:06:00 5 2 3 0 28 0 10 22 21 3 0 11 4 0
09:11:00 4 2 3 0 26 0 9 24 20 3 0 13 3 0
09:16:00 3 2 2 0 34 0 8 23 15 4 0 11 3 0
09:21:00 4 3 4 0 27 0 8 24 18 3 0 13 3 0
09:26:00 3 3 4 0 32 0 10 23 16 3 0 9 3 0
09:31:00 3 2 2 0 38 0 9 22 15 3 0 9 3 0
09:36:00 2 4 3 0 35 0 8 21 16 3 0 10 3 0
09:41:00 2 4 3 0 31 0 10 18 18 4 0 12 4 0
09:46:00 3 3 4 0 36 0 9 17 19 3 0 9 3 0
09:51:00 3 4 3 0 34 0 8 20 20 3 0 8 2 0
09:56:00 2 3 1 0 39 0 7 19 21 2 0 8 2 0
........ .. .. .. .. .. .. .. .. .. .. .. .. .. ..
FCX114 CPU nnnn SER nnnnn Interval 11:16:00 - 11:21:00 Perf. Monitor
. ____ . . . . . . . . . . . . .
<-SVM and->
Userid %ACT %RUN %CPU %LDG %PGW %IOW %SIM %TIW %CFW %TI %EL %DM %IOA %OTH
>System< 3 4 4 0 31 0 10 23 14 3 0 9 2 0
VSCS 100 8 3 0 7 0 2 42 0 37 0 0 0 0
VTAM 100 1 4 0 5 0 2 14 0 60 0 0 14 0
FTPQH 100 0 0 0 24 0 0 0 0 5 0 71 0 0
RSCS 99 1 3 0 15 0 32 0 8 0 0 0 42 0
SDTRACK 88 8 6 0 36 0 42 4 0 0 0 0 5 0
TOOLS 78 0 1 0 32 0 60 0 7 0 0 0 0 0
RACFVM 68 7 2 0 8 0 6 64 2 0 0 0 10 0
CHCAL 43 2 13 0 14 0 18 51 2 0 0 0 0 0
FTPDS2 43 0 0 0 33 0 0 4 0 6 0 57 0 0
........ .. .. .. .. .. .. .. .. .. .. .. .. .. ..
- CPU wait states in column
%CPUas an indicator for a CPU bottleneck - Page wait states in column
%PGWas an indicator for a paging bottleneck - I/O wait states in columns
%IOWand in column%SIMas indicators for an I/O bottleneck
Because the values in both these screens are based on 'high-frequency'
sampling data from the CP monitor, they are much more reliable than
the information in the REDISP output (which is based on single samples).
However, you must be aware that I/O bottlenecks will usually not show
up as high %IOW values: All diagnose I/O activity
by CMS machines will be shown as 'instruction simulation wait' in
the %SIM column, and not as %IOW.
The above example is from a system with a paging problem, as indicated by the high page wait percentages.

Concentrate on the last bar marked System Average.
Like the >System< values of the USTAT display
it shows the average non-dormant wait states for all users, and it
should give you reliable pointers to potential problem areas (the
example is not from the same machine as the previous USTAT display).