README for nmon for AIX version 11e
This file is a few quick notes on what is new in this latest nmon for AIX 5 release
to get you started and using the new features.
The postings on this site solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management.
Summary:
- New Simpler Online Look
- Volume Group Disk Stats
- Paging Space Stats
- Command Summary
- Error messages go out to stderr
- Network stats fixed overfow problems.
- Network errors and auto off
- Protect against WLM library AIX memory leaks if WLM off
- AIX 5.2 Low ML problem work around attempt
- CPU Stats on POWER5 with AIX 5.3 and Shared Processor LPARs = SPLPAR.
- Making nmon number match AIX commands
- JFS numclient and maxclient numers
- Wide windows in X Windows makes the columns wider in the Top view
- WLM works again
- nmon spots the VIO Server and reports the VIO Server version
In detail:
1) New Simple Online Look
This is the first thing you may notice on startup online. I have removed the unpopular ping-pong mode - where data that can't be shown was delayed and the primary and second screens displayed on alternative screen updates.
The new look uses less boxes so it does not waste so much screen space. There is only one line between different data types and that line is also the BANNER line. Internally it uses Curses pads instead of curses windows, which makes updates to dynamically sizes windows much simpler - this reduced the code by 300 lines.
When there is too much data than can be displayed on the screen, nmon will now simply output a message at the bottom of the screen as a warning that some data
may not be seen. This addresses my worry that some people may not realise they can't see all the data and make poor tuning decissions based on misinformation.
2) Volume Group Disk Stats
This outputs the stats for disks but grouped together in there volume groups. Many benchmark people ask why this is any use but in production use many disks
are grouped this way initially by their use and perhaps additions disks are added as a whole in a new volume group. This will allow users to see if disk I/O is
balanced across the disk volume groups.
For example:
| Volume-Group-I/O
|Name Disks AvgBusy Read|WriteKB/s TotalMB/s xfers/s BlockSizeKB |
|vg01 3 0.0% 0.0|0.0 0.0 0.0 0.0 |
|rootvg 1 0.0% 0.0|0.0 0.0 0.0 0.0 |
|not available 1 0.0% 0.0|0.0 0.0 0.0 0.0 |
| VGs= 3 TOTALS 5 0.0% 0.0|0.0 0.0 0.0 |
Notes:
- "Not available" is my CDROM as it is not mounted.
- You can also get a name of "None" for disks like RDAC drives.
- Don't blame me - this is what AIX returned in the libperfstat API.
Activation:
- Online hit V
- File collection add -V to the command line
3) Paging Space Stats
We have had paging stats for a long time but these stats in terms of name, volume group, size, size in use, disk I/O queued up and the status.
On my workstation:
|Paging-Space-Statistics
| Volume-Group PagingSpace-Name Type LPs MB Used IOpending |
| rootvg hd6 LV 4 512 0% 0 Active Auto |
| rootvg paging00 LV 8 1024 0% 0 Active Auto |
| VGs= 3 TOTALS 5 0.0% 0.0|0.0 0.0 0.0 |
Activation:
- Online hit P
- File collection add -P
4) Command Summary
This feature will let you will now see the stats for all the processes with the same command name added together and reported. This allows a new way to see what processes are consuming resources on the machine like CPU, memory or character I/O and the number of processes.
Note: as nmon is adding up these numbers ALL processes are included.
With file collecting mode nmon uses threshold to remove processes that are not using CPU to reduce the lines of output. For example, my workstation has 136
processes and I am the only users !! Out of which only a dozen at maximum are worth monitoring. If nmon output on all processes this would say 128*500 lines
of pointless output which is enought to break Excel at the 64,000 line barrier.
Warning: this option can also break Excel if you are not careful.
If you look online you will see the number of different processes names in the third line "Total=NNN" If you collect these stats and it breaks Excel
You can remove these new lines by greping out the SUMMARY lines.
This can be useful as I found I have 19 ksh and 3 Java programs running and needed to investigate further.
Stephen has added support for this Top Summary data.
Activation:
- Online hit T and then hit 6
- File collection add -Y
Note; DO NOT USE -t or -T option with -Y
5) Error messages go out to stderr
This change might catch some people out. nmon tries very hard to report errors which can result from ML changes to the kernel and libraries and other environmental problems. These were reported to stdout and in curses can get lost or with the file capture more could result in unexpected outout or going missing. For online use this means you can rediect the errors to a file for reporting them and in file capture mode cron will return the errors is email unless you have redirected them else where.
6) Network stats fixed overflow problems.
The AIX kernel internal stats keeps the bytes in and out in unsigned 32 bit integers. Unfortunately, networks are getting really fast these days. The latest is 10 Gbit/second adapters equals roughly 1 GByte per second, which will overflow a 32 bit unsigned (maximum is 4 billion) every 4 seconds. A 1Gbit adapter overflows every 40 seconds. Online this will not be notices but capturing to file means this will be regularly hit and in the past nmon outputs zero on overflow.
This is a long standing problem in AIX/UNIX but requires ALL devicer drivers to be updated before the restriction can be lifted.
Note the libperfstat library returns unsigned long longs but the numbers still overflow at 32 bits. nmon now works around this probem by capturing network stats every second and spots the overflows as they happen and adds the missing stats. This may cuase increased nmon CPU use but there is really no option.
7) Network errors auto off
While knowing you have network errors is VERY important it gets boring having zero stats every time when you have the network running sweetly. Online now
nmon will show you the networks errors but if you have no errors for three sucessive updates it will switch off the error stats. If you later have errors they will appear again.
8) Collecting WLM stats when WLM is off problems
Somes releases of AIX leak memory badly if you request WLM stats but have WLM switched off. The problem is it is a really bad leak like 15 to 20 MB at each failed attempt to initialise the WLM library access. This means that the nmon process grows in size and at some where between 150 and 200 snapshots the nmon process is 256MB in size and crashes as it can't malloc any more memory. nmon is not compiled in such away to access more than 256 MB of memory thanskfully!! This was encountered by users using the same nmon parameters for data collection on lots of machines (some with WLM on and some with it off).
In this release, we have a "three attempts and your out" policy. This will also switch off the stats online.
This may be fixed in newer AIX maintenence released but many people do not use the latests ML available!!
9) AIX 5.2 Low ML work around attempt
Some of the lower Maintenance Levels (ML) of AIX 5.2 have backward compatability issues in the performance tools area that were fixed in later MLs. This causes the nmon ASSERT errors where the libperfstat refuses to work with code that worked happily on AIX 5.2 ML02. The best fix is to upgrade to the latest AIX 5.2 ML but you need to be careful and make sure you have the latest firmware on the machine and have AIX support check out your levels before doing so.
Also I realise many sites will not or can not upgrade due it build level standards and scheduling updates. In an attempt to work around this there are two versions on nmon for AIX 5.2
- nmon_aix52ml2 For GA/Gold, ML01 and ML02 machines
- nmon_aix52ml5 for machines with ML05 or above.
This leaves AIX 5.2 ML03 and ML04 - I have not got examples of these but the ML05 version may work or you can use the nmon for AIX 5.1 that I know will work i.e. nmon_aix51 but there may be a few stats missing as AIX 5.1 does not support all the features.
10) CPU Stats on POWER5 with AIX 5.3 and Shared Processor LPARs = SPLPAR.
The CPU utilisation numbers in this case are meaningless. We all know that we have to use the phyiscal CPU stats in this case and not Utitlisation - right !!!
The nmon10 numbers are colected from libperfstat but rarely agreed with AIX tools like mpstat and lparstat. I have been thinking and working on these a lot for this release. For SPLPAR we now get 2 lines of output in addition to the CPU util as below
| CPU-Utilisation-Small-View -----------EntitledCPU= 0.50 UsedCPU= 0.267-----|
|Logical CPUs 0----------25-----------50----------75----------100|
|CPU User% Sys% Wait% Idle%| | | | ||
| 2 0.0 0.0 0.0 100.0| > ||
| 3 0.0 0.0 0.0 100.0|> ||
| 6 50.3 0.0 0.0 49.7|UUUUUUUUUUUUUUUUUUUUUUUUU >||
| 7 51.5 0.0 0.0 48.5|UUUUUUUUUUUUUUUUUUUUUUUUU >|
|EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+|
| EC 52.6 0.2 0.0 0.6|UUUUUUUUUUUUUUUUUUUUUUUUUU-----------------------||
| VP 13.2 0.1 0.0 0.2|UUUUUU-------------------------------------------||
|EC= 53.5% VP= 13.4% +--Capped---|------------|-----------100% VP=2 CPU+|
So this SPLPAR Entitled Capacity (EC) is 0.5 it is using 0.267 or in percent 53% of the EC It is capped but if it was not it could go over 100% of Entitlement. A more important number is the percentage of the uncapped maximum. This is limited by the virtual processors. In this case 2 physical CPU that here are shown as 4 logical CPUs due to SMT=on. So the important numbers are 53% of EC and the VP=13% (needs uncapping to get to 100%).
There is now a new option to look at the physical CPU use by hitting #
Then you get the below:
PHYSICAL
| CPU-Utilisation-Small-View -----------EntitledCPU= 0.50 UsedCPU= 0.267-----|
|Physical CPUs 0----------25-----------50----------75----------100|
|CPU User% Sys% Wait% Idle%| | | | ||
| 2 2.3 0.0 0.0 0.1|U--------->--------------------------------------||
| 3 0.0 0.0 0.0 0.3|>------------------------------------------------||
| 6 11.7 0.1 0.0 0.1|UUUUU------------------------------------------->||
| 7 11.9 0.0 0.0 0.1|UUUUU-------------------------------------------->|
|EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+|
| EC 51.6 0.3 0.0 1.1|UUUUUUUUUUUUUUUUUUUUUUUUU-----------------------i||
| VP 12.9 0.1 0.0 0.3|UUUUUU-------------------------------------------||
|EC= 53.0% VP= 13.2% +--Capped---|------------|-----------100% VP=2 CPU+|
Note here we now have:
- U=User as normal
- s=System as normal
- W=wait for I/O as normal
- i=idle and normal shown and blanks on the screen
- -= some other LPAR has "my" CPU !!
Here you see the processors 6 and 7 are actually one physical CPU and it is busy 11.7% and 11.9% on the two threads. Together this is roughly 25% of the time
and this is half of the EC.
Note: in Physical or PURR mode (i.e. hitting #) if you add the two threads percentage it can never exceed 100%. This is because the PURR is incremented
a each instruction is scheduled for each thread i.e. it is a ratio of how busy they are.
This view gives you a much better idle of how much CPU power the SPLPAR is actually using.
If you run a single threaded job you get something like this:
| 6 60.7 0.1 0.0 0.3|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUU------------------->|
| 7 3.0 0.0 0.0 6.3|Uiii--------------------------------------------->|
Here the first thread is running 60% of the time and the other thread is 3 percent and running the AIX idle loop 6.3%. We can't yield this CPU time back to the hypervisor as we are still using the other thread.
Advanced Look and nmon on a SPLPAR
Below is a series of screens that show an over worked SPLAR. Again 2 Physical CPUs, VP=2 EC=0.5 it is Capped with SMT=on
Note: there are some small fluctuations in workload as it runs, so ignore small changes in CPU use below.
Logical CPU stats
Below we see the SPLAR looking OK with 88% Logical CPU utilisation. This should be read as "while on the actually the physical CPU it is 80% of the time running a user process". But something very odd - it is only using one CPU the two threads on the first CPU (numbers 2 and 3) are not even used! This is due to some very clever tricks being used by AIX. The SPLPAR has an Entitlement of 0.5 CPUs - this can be handled easily by one CPU. If fact is is more efficient to just use the one CPU as then it is 100% local memory cache. So what AIX does is "fold" way the one Virtual CPU and just run on the one remaining CPU. We can also see here that the we are near the Entitlement "EC=93.1% and this was actually often 100%.
| CPU-Utilisation-Small-View -----------EntitledCPU= 0.50 UsedCPU= 0.465-----|
|Logical CPUs 0----------25-----------50----------75----------100|
|CPU User% Sys% Wait% Idle%| | | | ||
| 2 0.0 0.0 0.0 100.0| > ||
| 3 0.0 0.0 0.0 100.0|> ||
| 6 88.0 0.0 0.0 12.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU >||
| 7 88.5 0.0 0.0 11.5|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU >|
|EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+|
| EC 92.0 0.2 0.0 0.9|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU---i||
| VP 23.0 0.1 0.0 0.2|UUUUUUUUUUU--------------------------------------||
|EC= 93.1% VP= 23.3% +--Capped---|------------|-----------100% VP=2 CPU+|
Now we have hit # and are looking at the Physical CPU use. Now we see that instead of 88% the SPLPAR is running 25% of the time on two threads of one CPU. These two threads make one CPU so it is 50% of the time running this LPAR. So how much faster can it run if we UnCap the SPLPAR. This is the famous latent demand problem - there is no way of really knowing without actually trying it. But we can do this safely as the VP=2 this SPLPAR can't use more than the two CPUs regardless of the number of CPUs in the machine.
| CPU-Utilisation-Small-View -----------EntitledCPU= 0.50 UsedCPU= 0.500-----|
|Physical CPUs 0----------25-----------50----------75----------100|
|CPU User% Sys% Wait% Idle%| | | | ||
| 2 0.0 0.0 0.0 0.0|------------>------------------------------------||
| 3 0.0 0.0 0.0 0.0|>------------------------------------------------||
| 6 24.9 0.1 0.0 0.0|UUUUUUUUUUUU----------------------------------->-||
| 7 25.0 0.0 0.0 0.0|UUUUUUUUUUUU------------------------------------->|
|EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+|
| EC 99.7 0.2 0.0 0.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU||
| VP 24.9 0.1 0.0 0.0|UUUUUUUUUUUU-------------------------------------||
|EC= 100.0% VP= 25.0% +--Capped---|------------|-----------100% VP=2 CPU+|
Below we have UnCapped the SPLPAR and wow things have changed. From what looked like 80% happily running we are now running an taking 1.97 CPUs worth of CPU time - we are running four times faster. In this example, the two CPUs were not in use for other work. If it was it would get less CPU time and factored using the weighting factors. Remember the two threads on a single CPU are from the PURR and shows the split of work between the threads and combined add upto 100%. Now the EC becomes EC+ to highlight going over the Entitlement. Now we are using 394% of the Entitlement = approximately four times. Now we are taking 98.6% of the Virtual Processors (VP) and the only way to go faster still is to increase the VP count. This would allow more CPUs to be used in this SPLPAR.
| CPU-Utilisation-Small-View -----------EntitledCPU= 0.50 UsedCPU= 1.972-----|
|Physical CPUs 0----------25-----------50----------75----------100|
|CPU User% Sys% Wait% Idle%| | | | ||
| 2 61.0 0.0 0.0 0.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUU>------------------||
| 3 36.5 0.0 0.0 2.4|UUUUUUUUUUUUUUUUUU>------------------------------||
| 6 44.5 0.1 0.0 0.9|UUUUUUUUUUUUUUUUUUUUUU------------------------->-||
| 7 51.6 0.0 0.0 0.2|UUUUUUUUUUUUUUUUUUUUUUUUU------------------------>|
|EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+|
|EC+ 98.2 0.1 0.0 1.8|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU||
| VP 96.8 0.1 0.0 1.8|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU-||
|EC= 394.5% VP= 98.6% +--No Cap---|------------|-----------100% VP=2 CPU+|
If we now look at the Logical view again we see only two differences to the the original Logical view.
- both CPUs are in use - CPU 2 and 3 are being utilised
- The VP numbers are much higher
This highlights how the Logical CPU utilisations numbers can be very misleading.
| CPU-Utilisation-Small-View -----------EntitledCPU= 0.50 UsedCPU= 2.000-----|
|Logical CPUs 0----------25-----------50----------75----------100|
|CPU User% Sys% Wait% Idle%| | | | ||
| 2 100.0 0.0 0.0 0.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU>|
| 3 74.5 0.0 0.0 25.5|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU > ||
| 6 81.0 0.0 0.0 19.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU > ||
| 7 100.0 0.0 0.0 0.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU>|
|EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+|
|EC+ 97.9 0.1 0.0 2.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUi||
| VP 97.9 0.1 0.0 2.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUi||
|EC= 399.9% VP= 100.0% +--No Cap---|------------|-----------100% VP=2 CPU+|
11) Making nmon numbers match AIX commands
Some effort has gone into making the nmon numbers match AIX commands (like lparstat and mpstat) for shared processor stats in particular. You still have to be very careful that you re comparing similar numbers as many AIX commands that historically report CPU numbers are now ambiguous in what they report i.e. are they reporting Physical CPU, Virtual CPU or Logical CPU numbers and it is not clear in the AIX manuals.
nmon 11e further attempts this by following bizzare AIX udging of the numbers.
The nmon 11d CPU_EC_USE and CPU_VP_USE spreadsheet sections have gone. i spotted that the graphs were identical - only the axis changes so these are now in the LPAR sheet and graphed there. The below notes now apply to the LPAR sheet graphs.
In addition there are two new graphs when collecting data to a file with Shared Processor LPARs (required POWER5, AIX 5.3 or Linux):
- Compare the Physical CPU use with the Entitled Capacity (EC) - if the LPAR is uncapped and the Virtual Processors is set to the maximum allowed by the Entitle Capacity (i.e. ten times) then this can peak to 1000%. If uncapped and the VP is lower then this will reduce the maximum Physical CPU use and so these graphs. If capped then ~100% is the maximum. This should be useful in determining if you have the EC set correctly. If the LPAR CPU use is always well under the EC then you may consider reducing the EC. If the LPAR CPU use is always very much higher than the EC this might be working as excepted (the LPAR is using spare resources to boost performance) or you may think that is is getting far more than necessary and reduce this via the weighting factor or reducing the VP count or even capping the LPAR.
- Compares the Physical CPU use with the number of Virtual Processors (VP) - the VP is the absolute maximum number of CPUs a LPAR can reach. For example, a EC of 3.5 and a VP of 5 means the LPAR can peak to at more 5 whole CPUs. If the LPAR is always up at the VP count and you want it to go faster or hit higher peaks you can increase the VPs. With AIX 5.3 ML03 or higher, AIX will folder unused Virtual Processors so there is no harm in having too many. With Linux or older AIX versions, you may want to reduce the VP if they are never used to increase CPU efficiency.
12) JFS2 numclient and maxclient
These are now included and are important for tuning JFS2.
In nmon version 11d these are reported correctly at the end of the MEMUSE section as percentages of memory.
13) Wide windows in X Windows makes the columns wider in the Top view
If you make your nmon windows wider than 119 characters this will also make the number of columns wider in the TOP view.
This is a work around for when you have very large processes and the process memory grows to GB's and the number will not fit and stop lining up.
14) WLM works again - sorry about that
nmon 11c and 11d had problems displaying WLM that are now fixed.
This was due to version control in the AIX headers, where you can request the WLM version 2 API (see /usr/include/sys/wlm.h) and you get back gibberish!
This may cause problems on AIX 5.2 - I compile for ML2 and ML5 - nmon works fine on my systems 
15) nmon spots the VIO Server and reports the VIO Server version
On the welcome screen is will say something like: 1.2.1.4-FP-7.4 - Virtual I/O Server and the same on the resources screen (hit r).
And when saved to file there is a new line something like this: AAA,VIOS,1.2.1.4-FP-7.4
I hope these improvements will make nmon even better, thanks Nigel Griffiths