Using the Performance Toolkit to Monitor

Analysis of a performance problem requires the ability to distinguish excessive values of specific performance indicators from healthy ones. Unfortunately there are very few generally accepted simple thresholds upon which to base this decision, because many of these thresholds depend on the size (storage, CPU power) and I/O configuration of the system being analyzed. In other words, do not rely too much on any rules of thumb (they may not be applicable to your system or the current problem) but compare some important performance indicators on your system and check them for significant changes at different response time levels.

To do this, you will need the history data saved for the REDISPLAY screen, or, even better, the performance data saved in performance history files on disk. The necessary data will be available only if you have previously entered the command FC MONCOLL PERFLOG ON (for history files on disk), or at least the command FC MONCOLL ON (to activate permanent data collection), and if the performance monitor had already been running for some time. The data available in the redisplay buffer and in the performance history files include the internal system response time values C1ES needed to tell you when performance was bad, and also many other performance values which usually give some indication as to the reason for the bad performance. The data can be viewed using either of the following displays:

REDISP screen (command 'REDISP' or PF 2). It allows you to view a maximum of the current day's history data simultaneously, but you will have to detect any correlations between different variables yourself.
REDHIST screen (command 'REDHIST fn ft fm'), for viewing previous days' HISTLOG and HISTSUM files. It contains values for all the performance variables in the REDISP display, and also from most of the other 'by time' logs available for realtime monitoring.
Variable Correlation Coefficients screen (see the CORREL subcommand in the z/VM: Performance Toolkit Reference). It shows the correlation coefficients for all the performance variables in a HISTLOG or HISTSUM file, for a selected base variable. Correlation coefficients indicate how closely a variable's values follow changes in the base variable's values. Values close to 1 indicate good correlation, that is, there appears to be a close relationship between the variables, and this can provide pointers to the probable origin of performance problems.
Graphical CPU History Data Displays (commands PLOTSUM, GRAPHSUM, PLOTDET, and GRAPHDET) which show how the selected variables changed with time.
Performance Variable Correlation Displays (commands PLOTVAR and GRAPHVAR). They allow you to correlate any of the performance variables with each other, and the resulting graphics are usually an excellent base for analyzing performance problems.

Any of these displays can help you in detecting the reason for the problem on hand so you can select the one you feel most familiar with. You will find examples for variable correlation displays later on in this chapter. Here we will start by using the REDISPLAY screen which will show you all the original raw data.

Either switch to the REDISPLAY screen (command REDISP or PF 2) or print the contents of the REDISPLAY buffer using the command PRINT REDISP, followed by PRINT OFF.

 FCX101      CPU nnnn  SER nnnnn  Interval 09:12:19 - 09:54:27    Perf. Monitor

 TIME >LOGN  ACT TR-T NT-T C1ES TR-Q NT-Q TR/S NT/S %PQ %IQ %LD %EL  Q1  Qx Q1L
 09:15 1295  213 0.07 1.47 1.17   .8 14.3 11.2  9.7  18  22   6   0  13  17   2
 09:16 1294  218 0.06 1.67 1.10   .8 16.4 12.0  9.7   9  31  23   0  11  16   6
 09:17 1296  206 0.06 1.69 1.16   .6 15.8 11.0  9.3  17  27  13   0  11  27   3
 09:18 1291  212 0.07 1.71 1.16   .7 14.4 10.3  8.4   4  50   7   0  14  13   2
 09:20 1288  200 0.06 1.42 1.10   .7 13.9 11.2  9.7   5  52  18   0   5  12   2
 09:21 1292  205 0.07 2.97 1.04   .7 24.4 10.0  8.2  20  50   6   0   4  13   1
 ..... ....  ... .... .... ....  ... .... ....  ...  ..  ..  ..  ..  ..  ..  ..
 ..... ....  ... .... .... ....  ... .... ....  ...  ..  ..  ..  ..  ..  ..  ..
 09:46 1060  310 0.10 2.96 1.78  1.4 33.6 13.7 11.3  19  17   8   0  22  29   3
 09:47 1067  270 0.11 5.22 1.83  1.1 47.1 10.6  9.0  12  17   6   0  10  35   1
 09:49 1079  290 0.09 4.03 1.76  1.1 38.8 11.4  9.6  18  13   5   0  18  34   3
 09:50 1083  289 0.10 5.39 1.75  1.0 50.1 10.6  9.2  30  12  11   0  20  32   5
 09:52 1093  282 0.11 4.86 1.96  1.3 47.0 11.0  9.6  11  28   4   0  15  35   2
 09:54 1101  256 0.10 2.88 1.83  1.1 26.4 10.7  9.1  32  20   7   0  14  25   3

Then, on the output received:

Check the class 1 elapsed time slice shown under heading C1ES. Look for periods of good response times and of bad response times. In the example above, good response times should be below 1.2 seconds for the system shown. So, the good, or at least acceptable, response times are shown in the 9:15 - 9:21 time span and the bad, or unsatisfactory response times are shown in the 9:46 - 9:54 time span.
Note: you will have to establish your own thresholds for good or bad response times (the value depends very much on the 'job mix' which is active on the system).
Try to locate the problem area by looking for other columns where high and low values are closely related to high and low response times.
- Shift the display window to the right (enter the command TRANSACT or press PF11 until the transaction-related fields are shown) and check for a probable paging problem by looking at column %PQ which shows the percentage of in-queue users found in page wait at the end of each interval. It must be related to C1ES if paging really is the problem, but do not forget that all wait state values (%PQ, %IQ, and %EL) are just samples and not averages for the whole interval, so it would not be reasonable to expect perfect correlation. Look at column PG/S (page rate) if you suspect a paging problem. (It should also correlate to C1ES.) Skip to section Analyzing a Paging Bottleneck if paging seems to be the problem.
```
 FCX101      CPU nnnn  SER nnnnn  Interval 09:12:19 - 09:54:27    Perf. Monitor

 TIME   CPU  %CP  %EM  %WT  %SY -  -  IO/S VIO/S PG/S XPG/S DIAG PRIV LOGN ACT
 09:15  382  135  247   18   28 .  .   397   230  549   ... 1468 1295  213 226
 09:16  386  125  261   14   25 .  .   341   189  483   ... 1227 1294  218 260
 09:17  393  125  268    7   27 .  .   408   206  540   ... 1219 1296  206 257
 09:18  395  117  278    5   24 .  .   329   194  478   ... 1199 1291  212 214
 09:20  338  122  216   62   24 .  .   318   179  452   ... 1430 1288  200 228
 09:21  377  143  234   23   28 .  .   327   188  411   ... 1822 1292  205 232
 .....  ...  ...  ...   ..   .  .  .   ..    ..   ..    ... .... ....  ... 234
 .....  ...  ...  ...   ..   .  .  .   ..    ..   ..    ... .... ....  ... 236
 09:46  399  161  238    1   37 .  .   509   294  824   ... 1328 1060  310 229
 09:47  396  142  254    4   32 .  .   483   284  693   ... 1298 1067  270 228
 09:49  400  141  259    0   32 .  .   555   295  665   ... 1233 1079  290 230
 09:50  397  151  246    3   34 .  .   510   280  738   ... 1365 1083  289 218
 09:52  399  154  245    1   37 .  .   493   296  779   ... 1297 1093  282 210
 09:54  398  141  257    2   31 .  .   441   257  641   ... 1292 1101  256 213
```
- Check for a storage problem. A severe storage problem could be indicated by users ending up in the eligible list, resulting in any number other than zero in column %EL. This column shows the percentage of in-queue users found in resource wait, and if your CPU is not being overloaded there is a good chance the resource being waited for is storage. Skip to section Analyzing a Storage Problem if you suspect some of your users end up in storage wait.
- Look for an I/O problem by checking the values in column %IQ which show the percentage of in-queue users found in I/O wait at the end of each interval. %IQ should be related to C1ES if you really have a general I/O bottleneck. However, there can be variations here. The I/O problem may be restricted to a group of users which depend on the same overloaded disk volume, while all others do not experience it. %IQ might then not indicate this problem very clearly. Look also at column IO/S (total I/O rate) which should correlate with C1ES and %IQ too. Skip to section Analyzing an I/O Bottleneck if general I/O seems to be at fault.
  Note that a certain number of users in I/O wait is usually much less of a problem than the same number of users in page wait: I/O wait is part of normal processing and will not disappear even on very lightly loaded systems. Page wait, on the other hand, will always increase response times. What %IQ value is 'normal' depends mainly on the job mix being run on any given system. Use the values as a standard which you find on your system while it is fairly busy with acceptable response times.
- Finally, check for a CPU problem by verifying the values in column CPU (total CPU busy percentage). If they are at the system's maximum much of the time, and if the sum of Q1 + Qx is high while the values in columns %PQ and %IQ are low, then skip to section Analyzing an Overloaded CPU. This seems to have been the bottleneck in the example shown: The %IQ value is definitely lower for the 'bad' measurements, and, although we find higher values also for the %PQ values, CPU load is already at its maximum, so we do not yet see much of the effects of a developing paging bottleneck.

You will also find all of these system load fields on the general CPU screen, but the REDISPLAY screen is better suited for a general overview because the many measurements shown on the same screen allow you to see how other key system load values correlate with interactive response time, and this should give you the necessary pointers for proceeding.

In the above guidelines, we have always taken the wait states of in-queue users as indicators for system bottlenecks. The REDISPLAY screen shows these values together with the internal system response time C1ES and other system load figures, which is why we have used it. The %PQ and %IQ values shown there are not averages but represent just single samples, and they will usually vary quite a bit. Have a look at the user wait state log screen (command 'USTLOG') for a more reliable and more detailed overview over user wait states, based this time on the CP monitor high-frequency sampling technique.

 FCX135      CPU nnnn  SER nnnnn  Interval 07:26:00 - 11:06:00    Perf. Monitor
                                                          <-SVM and->
 End Time   %ACT  %RUN %CPU %LDG %PGW %IOW %SIM %TIW %CFW %TI %EL %DM %IOA %OTH
 >>Mean>>      4     3    4    0   16    0    8   22   24   4   0  16    3    0
 09:01:00      4     2    3    0   23    0    9   24   25   4   0  10    4    0
 09:06:00      5     2    3    0   28    0   10   22   21   3   0  11    4    0
 09:11:00      4     2    3    0   26    0    9   24   20   3   0  13    3    0
 09:16:00      3     2    2    0   34    0    8   23   15   4   0  11    3    0
 09:21:00      4     3    4    0   27    0    8   24   18   3   0  13    3    0
 09:26:00      3     3    4    0   32    0   10   23   16   3   0   9    3    0
 09:31:00      3     2    2    0   38    0    9   22   15   3   0   9    3    0
 09:36:00      2     4    3    0   35    0    8   21   16   3   0  10    3    0
 09:41:00      2     4    3    0   31    0   10   18   18   4   0  12    4    0
 09:46:00      3     3    4    0   36    0    9   17   19   3   0   9    3    0
 09:51:00      3     4    3    0   34    0    8   20   20   3   0   8    2    0
 09:56:00      2     3    1    0   39    0    7   19   21   2   0   8    2    0
 ........     ..    ..   ..   ..   ..   ..   ..   ..   ..  ..  ..  ..   ..   ..

The display shows user wait states by time, similar to the REDISPLAY screen. If you want to see that same information by user, select the user wait state screen (command USTAT):

 FCX114      CPU nnnn  SER nnnnn  Interval 11:16:00 - 11:21:00    Perf. Monitor
 .          ____     .    .    .    .    .    .    .    .   .   .   .    .    .
                                                          <-SVM and->
 Userid     %ACT  %RUN %CPU %LDG %PGW %IOW %SIM %TIW %CFW %TI %EL %DM %IOA %OTH
 >System<      3     4    4    0   31    0   10   23   14   3   0   9    2    0
 VSCS        100     8    3    0    7    0    2   42    0  37   0   0    0    0
 VTAM        100     1    4    0    5    0    2   14    0  60   0   0   14    0
 FTPQH       100     0    0    0   24    0    0    0    0   5   0  71    0    0
 RSCS         99     1    3    0   15    0   32    0    8   0   0   0   42    0
 SDTRACK      88     8    6    0   36    0   42    4    0   0   0   0    5    0
 TOOLS        78     0    1    0   32    0   60    0    7   0   0   0    0    0
 RACFVM       68     7    2    0    8    0    6   64    2   0   0   0   10    0
 CHCAL        43     2   13    0   14    0   18   51    2   0   0   0    0    0
 FTPDS2       43     0    0    0   33    0    0    4    0   6   0  57    0    0
 ........     ..    ..   ..   ..   ..   ..   ..   ..   ..  ..  ..  ..   ..   ..

Evaluation of the wait state figures in these screens is similar to the one made for the REDISPLAY screen. Look for frequent:

CPU wait states in column %CPU as an indicator for a CPU bottleneck
Page wait states in column %PGW as an indicator for a paging bottleneck
I/O wait states in columns %IOW and in column %SIM as indicators for an I/O bottleneck

Because the values in both these screens are based on 'high-frequency' sampling data from the CP monitor, they are much more reliable than the information in the REDISP output (which is based on single samples). However, you must be aware that I/O bottlenecks will usually not show up as high %IOW values: All diagnose I/O activity by CMS machines will be shown as 'instruction simulation wait' in the %SIM column, and not as %IOW.

The above example is from a system with a paging problem, as indicated by the high page wait percentages.

You can also use the graphic user status display (command USTATG) if you have GDDM and a display terminal with graphics capability:

Concentrate on the last bar marked System Average. Like the >System< values of the USTAT display it shows the average non-dormant wait states for all users, and it should give you reliable pointers to potential problem areas (the example is not from the same machine as the previous USTAT display).