IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > AIX > ... > nmon > nmon FAQ
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
nmon FAQ
Added by nagger, last edited by nagger on Jul 13, 2009  (view change)
Labels: 
(None)

Frequently Asked Questions

The postings on this site solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management.

Summary of the questions:

  • Question 1: Which nmon for my version of AIX or Linux?
  • Question 2: nmon crash shortly after starting a data capture please send me the next version?
  • Question 3: I have a problem with nmon running on AIX 4.0.3 (or any really old AIX versions)?
  • Question 4: All I get is "nmon not found"?
  • Question 5: Can you add the monitoring tape drive on AIX?
  • Question 6: Can I get the adapters stats from other tools?
  • Question 7: When I start nmon 9 on a system that it use to run fine I know get an error message?
  • Question 8: What is the most reported error for nmon?
  • Question 9: Can you add the monitoring of process priority?
  • Question 10: on AIX, nmon 9 does not run, please fix?
  • Question 11: Can I decide the filename it saves data too?
  • Question 12: What is the default output filename?
  • Question 13: I want nmon output piped into a further command, how?
  • Question 14: Why do you support all these old unsupported AIX versions?
  • Question 15: What if I want support?
  • Question 16: Why don't you add a Java front end to nmon and get graphical output?
  • Question 17: The command line options don't seem to work right for file capture?
  • Question 18: What is paging to a filesystem?
  • Question 19: Where can I get nmon and further information?
  • Question 20: nmon crashes after about 200 snapshots on AIX?
  • Question 21: TOP process stats get switched on when I request Asynchronous I/O stats?
  • Question 23: nmon2rrd fails, please fix it?
  • Question 24: NANQ and INF?
  • Question 25: nmon and AIX commands do not agree?
  • Question 26: nmon reports more than 100% for a process - clearly it is wrong?
  • Question 27: On AIX the disk adapter are wrong?
  • Question 28: on AIX the adapter busy goes over 100%. That is impossible surely?
  • Question 29: What about nmon for HP/UX, Solaris on Sparc or x86 or Linux on Itanium?
  • Question 30: What about nmon for Windows?
  • Question 31: Seeing double the number of CPUs?
  • Question 32: 0509-036 Cannot load program /usr/lib/drivers/nfs_kdes.ext ?
  • Question 33: Hello, I am new to UNIX and want to tune AIX, what do you recommend?
  • Question 34: CPU wait is too high, how can I reduce it?
  • Question 35: On AIX, free memory is near zero, how do I free more memory?
  • Question 36: How can I set numperm better?
  • Question 37: What format is the nmon output file?
  • Question 38: I have collected once a second for 8 hours but I can't get the Analyser to work?
  • Question 39: nmon does not work on my Linux machine!!
  • Question 40: When do we get nmon 10 for Linux?
  • Question 41: The boxes and lines in nmon do not work right online with: DTterm, xterm, rvxt, putty, VNC, (whatever you have)?
  • Question 42: I have 2400 disk (small SAN LUNs) and nmon is slow to collect the stats from so many, can you help?
  • Question 43: Adapter stats and IOADAPT is not saved to the nmon file seems to be missing with AIX 5.1?
  • Question 44: What is CharIO (a column of the TOP processes stats)?
  • Question 45: On Linux the disk stats are all doubled?
  • Question 46: On AIX the disk seem to be mostly on the first adapter?
  • Question 47: On nmon for Linux the CPU Wait for IO number is zero or odd?
  • Question 48: On nmon for Linux the paging details are missing and the PAGE lines for the capture to file are missing.
  • Question 49: I want to collect data every second and then see weekly and monthly reports. How?
  • Question 50: nmon will not start on AIX 5.1 due to a libperfstat error?
  • Question 51: How do I work out the Physical CPU use on Linux on POWER for shared processor LPARs?
  • Question 52: The Disk Busy stats are missing on AIX
  • Question 53: Sort order problems with massive nmon output files.
  • Question 54: AIX 5.3 updated but then nmon gives "Illegal instruction(coredump)"
  • Question 54: AIX 5.3 updated but then nmon gives "Assert Failure"
  • Question 55: On AIX 5.3 ML6, nmon output files contain zeros, missing CPU stats, corrupt ZZZ lines and "nfs" strings found in the stats
  • Question 56: Does nmon capture point in time stats or averages?
  • Question 57: Why is the Process memory percentage zero? (same for System and User percent)
  • Question 100: When will nmon collect data from lots of machines or LPARs?
  • Question 101: When will nmon collect data like "topas -C"?

Question 1: Which nmon for my version of AIX or Linux?

Answer: This nmon release works on the following AIX and Linux versions and we given the
nmon filename to use:

Operating System Processor nmon File Stable nmon Version Latest Beta Version
AIX 4.anything POWERPC nmon_aixXXX 9f no new versions
AIX 5.anything POWERPC nmon12d_aixXXX 12d for AIX
Linux SUSE SLES 8 Intel or AMD nmon_x86_sles8 11c for Linux
Linux SUSE SLES 9 Intel or AMD nmon_x86_sles9 11c for Linux
Linux Fedora3 Intel or AMD nmon_x86_fedora3 11c for Linux
Linux Red Hat 9 Intel or AMD nmon_x86_redhat9 11c for Linux
Linux Red Hat EL2.1 Intel or AMD nmon_x86_rhel2 11c for Linux
Linux Red Hat EL3 Intel or AMD nmon_x86_rhel3 11c for Linux
Linux Red Hat EL4 Intel or AMD nmon_x86_rhel4 11c for Linux
Linux Knoppix 4 Intel or AMD nmon_x86_knoppix4 11c for Linux
Linux Debian 3 Intel or AMD nmon_x86_debian3 11c for Linux
Linux SUSE SLES 8 Mainframe nmon_mainframe_sles8 11c for Linux
Linux SUSE SLES 9 Mainframe nmon_mainframe_sles9 11c for Linux
Linux Redhat Mainframe nmon_mainframe_rhel - on request
Linux Red Hat EL 3 POWERPC nmon_power_rhel3 11c for Linux
Linux Red Hat EL 4 POWERPC nmon_power_rhel4 11c for Linux
Linux SUSE SLES 9 POWERPC nmon_power_sles9 11c for Linux

Question 2: nmon crash shortly after starting a data capture please send me the next version?

Answer: When you are capturing data to a file, the nmon tool disconnects from the shell, to ensure that it continues running even if you log out. This means that nmon can appear to crash but it is still running in the background. Use: ps ef | grep nmon to see the process still running.

Question 3: I have a problem with nmon running on AIX 4.0.3 (or any really old AIX versions)?

Answer: Hard luck
I will actively help get AIX 5 bugs fixed but older versions are very much less interesting. In particular, on AIX 4.1.5 the TOP processes does not work but I am not going to fix it unless some one offers me a bribe in hard currency

Question 4: All I get is "nmon not found"?

Answer: First check it is executable (this gets switched off by FTP). Second, if you are the root user, you have to name the executable directly with the full path name or (if in the current working directory) ./nmon or put it into a directory in your $PATH. Many people on AIX use /usr/local/bin and make sure the root user includes this in their $PATH.

Question 5: Can you add the monitoring tape drive on AIX?

Answer: No - the data is not available. The best you can do is to watch the disks and guess what the tape is doing. The adapter statistics is only adding up the attached disks - so it does not help. You can guess at the tape drive I/O rates by looking at the disk I/O rates - after all this is where the data is coming from but it is only approximate and does not account for memory caching of data.

I have a little campaign running to get this tape stats feature available in AIX. Please, complain to your AIX support by raising a PMR - only by popular demand can this get high enough priority for the AIX developers to add this feature that we have been requesting for years! If you really want to "wind them up" say that you think Solaris now has tape stats.

One word of caution - if you are using a tape management system that does serverless backup - i.e. the data is transfered directly from client machines to the tape drives over fibre channel then the tape management system's AIX operating system never actually touches the data - so this can never be recorded by nmon.

There may be tape system supplied tools or APIs for getting tape drive stats. If you come across these please let Nigel know. We could use these to generate nmon style data that can be merged into the nmon data for analysis using the nmon external data collector features.

The same is true for Linux - unless you know the /proc file to find tape stats. In which case let Nigel know ASAP.

Question 6: Can I get the adapters stats from other tools?

Answer: Not in AIX 4 - there are no adapter stats in AIX. This is now available in AIX 5 via the libperfstat library so programmers can get this information - but a warning this is derived data from the connected disks (NOT tape drives) because there is no adapter stats.

Question 7: When I start nmon 9 on a system that it use to run fine I know get an error message?

The error is something about "lslpp" AIX 5.1 about ML03 onwards - or - WLM stats go missing - after upgrading to AIX 5.2 ML5 - can you fix nmon?
Answer: These are bugs in AIX and not nmon -there are fixes available. Please report these problems to your AIX support channel and not me. nmon 10 has also been back ported to AIX 5.1 and AIX 5.2 and has code to work around these bugs and can be used instead of nmon9a.

Question 8: What is the most reported error for nmon?

Answer: See previous question - these AIX bugs cover 70% of nmon complaints.

Question 9: Can you add the monitoring of process priority?

Answer: This is only available from the AIX 5.1 onwards

Question 10: on AIX, nmon 9 does not run, please fix?

With reports like:
read error: No such device or address
nmon file=nmon.c line=1278 version=XXX
Answer: In 95% of the time it is because AIX was upgraded or a maintenance level added but the AIX/system was not rebooted. It is very easy to miss the "You must reboot" message in the gallons of installp output. The reboot is required because the AIX kernel image has been updated and the reboot is the only way to activate the new /unix file. nmon reads the /unix file to find kernel data structure addresses but if the /unix file does no match what is actually running, you get this message.
You can also get really weird effects, if you have messed up LIBPATH.

Question 11: Can I decide the filename it saves data too?

Answer: Use nmon -h and check out the -F <file> option

Question 12: What is the default output filename?

<hostname><Year><Month><Day><Hours><Minute>.nmon
Notes:

  • This has been very carefully chosen so that a directory of nmon files will sort in machine and then time order. So you can find the data file you want in a simple way.
  • Many people needlessly make up their own names via scripts and date commands - a pointless waste of time.
  • One side effect is that, if two nmon captures are started in the same minute they might use the same filename, so stagger the start up by 61 seconds.

Question 13: I want nmon output piped into a further command, how?

Answer: Use a FIFO and the -F option.

  • mkfifo /tmp/xyz
  • nmon -F /tmp/xyz s 5 c 300
  • your-command </tmp/xyz

If you are doing this with the online data output, I think you are barking mad but some people are still trying it.

Question 14: Why do you support all these old unsupported AIX versions?

Answer: You would be amazed at what versions are running out there. I guess it is a case of - "if it isn't broken don't touch it". nmon can also help when planned server consolidation from these old version to, for example, micro-partitions on newer hardware.

Question 15: What if I want support?

Answer: You have a few options:

  • given me money (and I have no problem with this) or
  • pay for and use Performance Toolbox/6000 which can do most of nmon and lots more too.
  • I have agreement in principle that nmon support can be added to an existing AIX Support contract for an extra fee. So far no one as far as I know has signed up for this. If interested get in touch with your AIX Support channel and ask them to get in contact with Nigel.

Question 16: Why don't you add a Java front end to nmon and get graphical output?

Answer: I don't have the time. If you can give me a frame work for getting C functions to generate graphs, please let me know.

Question 17: The command line options don't seem to work right for file capture?

Answer: The -f, -F, -x, -X or -z MUST be the first option on the line and only one of them. This option sets all the other option flags. You can then use the other flags to modify their default behaviour. This has improved with the latest nmon versions.

Question 18: What is paging to a filesystem?

Hopefully, you already understand paging to paging space (also called virtual memory). AIX (and other UNIX versions) page in the read-only code from a program as you start it and as it runs. This is just like paging in from the paging space but is directly from the filesystem, this is also true for shared libraries (which you might not be aware you are using). Also programs using memory mapped files access the files by simply reading and writing memory addresses - AIX will page in the file pages as necessary and they will get paged back to the filesystem to free up memory or if the program forces it.

Question 19: Where can I get nmon and further information?

Answer: From this Wiki !!
The data displayed by nmon are similar to the displays generated by the standard AIX commands such as vmstat, iostat, netpnmon, df, and sar. Use the AIX manual pages for these standard commands to understand what the displayed data means.

Following are several useful IBM Redbooks that you can buy or download for free from http://www.redbooks.ibm.com/portals/unix:

Question 20: nmon crashes after about 200 snapshots on AIX?

If you request Workload Manager stats and have WLM switches off then due to bugs in AIX and a huge memory leak in the libwlm library, nmon will grow in size every time it fails to access the WLM stats until it hits 256 MB and will then crash. This is fixed in nmon 11 by switching off WLM stats after a few failed attempts.

Question 21: TOP process stats get switched on when I request Asynchronous I/O stats?

This is working as normal. To get the aioserver stats the details of all processes has to be collected, sorted and searched.
Having paid the CPU cycles for the TOP process stats you may as well see them on the screen or in the output file, so nmon automatically switches them on for you at no addition charge.

Question 23: nmon2rrd fails, please fix it?

You have been supplied with the source code for nmon2rrd and it is supplied as a "toolbox".
This means users are expected to come up with fixed rather than the original developer.
Note there are updated versions from users on the nmon download site - well done guys.

Question 24: NANQ and INF?

These are output when calculations within nmon have gone wrong. Typically, when dividing by zero. NANQ means "Not a number" and INF means infinite. Some times this can happen due to rounding errors but mostly it is a bug or that numbers a have overflowed the C data types.

Question 25: nmon and AIX commands do not agree?

See question 26. A lot of this happens with nmon 10 and the Shared Processor Logical Partitions (SPLPAR) - what marketing calls Micro-partition. Some of it is because the AIX commands are very unclear about what they are reporting. What was CPU numbers can now be physical CPU, Logical CPU or Virtual CPU numbers and the documentation is unclear. So you may not be comparing "like with like". This has been improved in nmon 11 - please report further issues from nmon 11 onwards.

Question 26: nmon reports more than 100% for a process - clearly it is wrong?

Unlike AIX commands, nmon reports the CPU use of a process per CPU. If your process is, for example, taking 250% then it is using 2.5 CPUs and must be multiple threaded. This is far better than the AIX tools because the percentages on larger machines make it very hard to determine if a process is using a whole CPU. On a 64 CPU machine a single rogue process uselessly spinning on the CPU takes up 1.56% of the total CPU - this makes it very unclear what is going on.

Question 27: On AIX the disk adapter are wrong?

nmon just outputs what it gets from the libperfstat library. For multipath I/O it is often the disk to adapter mapping reflects the order of disk discovery rather than some balanced view. This is an AIX problem and not nmon's fault. To list what nmon is extracting from the libperfstat library you can use the sample code and precompiled for AIX 5.3 binaries from the Roll Your Own Wiki page at ryo - and the adapt sample program.

If you don't like the way libperfstat reports the adapter stats raise a PMR and refer to the adapt sample - as you will get no where reporting nmon errors.

Question 28: on AIX the adapter busy goes over 100%. That is impossible surely?

There are no adapter stats in AIX (see above). They are derived from the disk stats. The adapter busy% is simply the sum of the disk busy%.

So if the adapter busy% is, for example, 350% then you have 3.5 disks busy on that adapter. Or it could be 7 disks at 50% busy or 14 disks at 25% or ....

There is no way to determine the adapter busy and in fact it is not clear what it would really mean. The adapter has a dedicated on-board CPU that is always busy (probably no real OS) and we don't run nmon of these adapter CPUs to find out what they are really doing!!

Question 29: What about nmon for HP/UX, Solaris on Sparc or x86 or Linux on Itanium?

As I don't have access to such machines this is not going to happen. There is also a problem that IBM gives me access to the current hardware because nmon is seen as a competitive advantage. If this was ported to every UNIX then I would not be allowed this access.

Question 30: What about nmon for Windows?

Now you must be joking.

Question 31: Seeing double the number of CPUs?

This is due to the SMT feature of the POWER5 chip, where each CPU (core) runs two processes at the same time.
this gives you a 40% boost in performance for most commercial workloads and it s really "good thing".
You need to read up on SMT or get yourself a presentation from IBM on the subject.

Question 32: 0509-036 Cannot load program /usr/lib/drivers/nfs_kdes.ext ?

You start nmon and get:

  • nmon for AIX5 exec(): 0509-036 Cannot load program /usr/lib/drivers/nfs_kdes.ext

First lets make this very clear - this is an AIX "feature" and not due to user level code like nmon.
The AIX loader is failing to load the NFS kernel extension. I looked up this error in the IBM problem database and found 10 hits of others reporting this issue with other tools (i.e. not just an nmon problem).

  • PMR 76818, 000, 738 - AIX: after 64bit switch over, NFS error
  • PMR 53438, 499, 000 - questions about starting rpc.mountd
  • PMR 66239, 070, 724 - W4F 4 command showmount missing
  • PMR 43814, 019, 000 - unable to mount - nfs_kdes.ext linked to wrong ext
  • PMR 82641, L6Q, 000 - 0509-022 Cannot load module, NFS error

From the first and last PMRs above:
The suggested fix is changing soft link

  • /usr/lib/drivers/nfs_kdes.ext -> /usr/lib/drivers/nfs_kdes_full.ext
    to
  • /usr/lib/drivers/nfs_kdes.ext -> /usr/lib/drivers/nfs_kdes_null.ext

"it seems that if you install some "DES" file of the expansion pack, it will relink your "nfs_kdes.ext" to "nfs_kdes_full.ext". This extension, however, does not load on 64bit (presumably 64bit AIX kernel). That's why you have to relink to fs_kdes_null.ext."...PMR 88582, 487, 000 DES fileset e.g. bos.crypto

Do the following:

  • cd /usr/lib/drivers
  • rm nfs_kdes.ext
  • ln -s /usr/lib/drivers/nfs_kdes_null.ext nfs_kdes.ext

I strongly suggest you contact AIX support to confirm this is a sensible resolution to the issue, before continuing - just in case there are other side effects.

Question 33: Hello, I am new to UNIX and want to tune AIX, what do you recommend?

Don't do it. AIX is very good at looking after itself and self tuning. I have seen rookie systems admin nearly halt a machine by making "improvements". Go on a course or read the AIX performance Redbooks from http://www.redboooks.ibm.com but don't just try changing things unless you first of all have a problem and second know what you are doing and have practiced on a non-production machine or LPAR.
See the AIX Wiki What To Do After Installation hints at

Question 34: CPU wait is too high, how can I reduce it?

This question is asked a lot and it can mean your CPUs are actually too fast!

CPU "waiting for I/O" state and utilisation numbers (as opposed to User, System and Idle) means the CPU is Idle but has a disk I/O outstanding. In history this was used to highlight that your application is being held up by slow disks or disks problems. In the Wait for I/O state the CPU is actually free to do other work and the CPU is NOT looping waiting for the disk - it in fact actioned the adapter to perform the disk I/O, put the calling process to sleep and carried on. If there is no other process it is in the same loop as in the Idle state i.e. it is available to do other things. In AIX the processor does one of two things

  1. in regular stand-alone machines or a dedicate CPU LPAR the process runs a special kernel level process called "wait" from which it can exit very quickly at the arrival of the next interrupt
  2. In a micro-partition (Shared Processor LPAR) the processor after a few micro seconds will call the Hypervisor to yield the processor for other LPARs

In benchmarks, Wait for I/O is seen positively as an opportunity - we can do throw in more work to boost throughput.

Any workload in which the CPU does comparatively little work compared to the volume of disk I/O is going to give you high Wait for I/O.

If this high Wait for I/O is a sudden change from the normal pattern then it needs investigating and you should make sure as many disks as possible are involved in the disk I/O.

But lots of workloads just run like this - a common example I come across regularly is SAP databases. SAP cleverly caches lots of data but on large database it has to do lots of disk I/O for particular customer or whatever records. Once the data is available it is sent to the SAP application servers i.e. little work is done on the database.

In fact, faster CPUs would mean even high wait values.

Question 35: On AIX, free memory is near zero, how do I free more memory?

This is just how AIX works and is perfectly normal. All of memory will be soaked up with copies of filesystem blocks after a reasonable length of time and the free memory will be near zero. AIX will then use the lrud process to keep the free list at a reasonable level.
If you see the lrud process taking more than 30% of a CPU then you need to investigate and make memory parameter changes.

Question 36: How can I set numperm better?

You can't. This number just reflects the amount of memory being used for disk blocks - called the buffer cache. It is controlled by three parameters minperm, maxperm and strictperm but these set thresholds and algorithms. The actual numperm number reflects what is actually going on. You will have to find other places for tuning these parameters as it is beyond the scope of this FAQ.

It is also worth noting that the nmon values for numperm and maxperm are based on a percentage of physical memory. The AIX commands report a percentage but not of all memory - they seem to remove some memory that might be something like the memory allocated to the AIX kernel (i.e. it could never be used as cache). Unfortunately this is not documented and the memory size not counted is not available with any public API. So nmon does the best it can but the numbers will not be absolutely the same.

Question 37: What format is the nmon output file?

Plain ASCII text that you can edit and editable with vi (but you might hit the 2048 byte line limit on the AIX vi). I use the Open Source vim on AIX to avoid this or do it on Linux.

  • The first token on the line tells you what sort of data it is
    • AAA lines are basic nmon data about this collection of data
    • BBB lines are about the configuration of the machine
    • ZZZ lines include the date and time stamp stored here once ro reduce output
    • others should be obvious
  • the second field is the Timestamp - see the ZZZ section to the actual time
  • then there is the data
  • each sort of data (CPU, DISK, etc.) has a Header line that describes the columns and the header lines also include the graph titles

You do not need to sort the nmon output file for nmon2rrd or the Analyser but it you do then you can see the sections easier for editing.

Question 38: I have collected once a second for 8 hours but I can't get the Analyser to work?

You have 28800 data points and you want to see this on a screen with say 1024 pixels wide !!

  • that is 29 data points per pixel.

My new Thinkpad has 1400 pixels across the screen, so I am down to just 18 data points per pixel

  • what where you thinking !!

I think even with the best will in the world, the analyser spreadsheet is going to struggle.
On a tiny machine you get about 1.5KB per snapshot and a normal size machine with a few nmon options it is more like 60KB each. At 60KB the maths --> 28800*60KB = 1.6GB. How big is your output file?
I hope you have at least 4 GBs of memory in your PC to handle this!

As I hope you know the nmon file is text and editable with vi (but you might hit the 2048 byte line limit on the AIX vi). I use the Open Source vim on AIX to avoid this or do it on Linux. If you take a look at the file format you should be able to cut done the file size and make a series of files but each will need the header section that you will find at the top of the file and then a different set of snapshots.

Question 39: nmon does not work on my Linux machine!!

nmon runs on x86 (Intel and AMD), mainframe and POWER processors and on a dozen or so versions of Linux.
If you report problems I will need to know which platform and which Linux version plus distro before I can help so please include these with initial questions.

Question 40: When do we get nmon 10 for Linux?

The Linux & AIX source code for nmon is very different apart from curses framework and basic approach. AIX gets all the information from system and library calls and in Linux this has to be read from the /proc filesystem. This means the AIX code is more straight forward.
So there is no need for Linux and AIX to have the same version number.
From nmon version 11, the AIX and Linux user interfaces where made the same and release with the same version number to keep people happy.
There was no nmon for Linux version 10.

Question 41: The boxes and lines in nmon do not work right online with: DTterm, xterm, rvxt, putty, VNC, (whatever you have)?

nmon uses curses to handle the displaying of characters on the terminal. This is controlled mostly by your TERM variable setting. The nmon developer tests with all of the above. They work perfectly and they work perfectly all the time. If it does not work for you then you have some setting wrong on your machine or X Windows or have some strange settings for TERM and/or TERMINFO shell variable setting or you are using a duff terminal emulator.

Let me state that again: your system has a problem not nmon.

The TERM shell variable should be set to the terminal emulator you are using.

  • If you are using a xterm then TERM should be xterm
  • If you are using DTterm then TERM should be dtterm
  • If you are using an AIX term then TERM should be set to aixterm
  • Get the idea - other combinations are your problem.

Unless you are using a genuine 1970's DEC VT100 then you should not be using this setting with more advanced terminal emulators. I remember VT100's well, even found a bug in the firmware once!

The TERMINFO variable should not be set to anything (in fact not set at all). If it is then you or someone has been mucking about with terminfo databases and why are you blaming nmon?

Terminal Emulators:

  • xterm works well in black and white.
  • aixterm works well and has colour and nmon uses the colour.
  • DTterm works well and has colour and nmon uses the colour.
  • rxvt and xterm-color combination (see WWW for details on setup, on google.com search for xterm-color and AIX) - this combination also lets vim (the improved vi from Open Source) use syntax highlighting in C code.
  • The Windows telnet terminals emulation is very poor indeed and not recommended under any circumstances - you are on your own.
  • The best alternative on a Windows PC is putty (see WWW for details and download) and is highly recommended - I use this every day - this will work with TERM set to xterm perfectly.
  • VNC is, of course, even better and gives you X windows on a Windows PB at zero cost - again highly recommended.

The -B option starts nmon with no boxes (or colour). Some purists do not like to waste the screen space with the box lines. You could add 'B' to the NMON shell variable to make this automatic: export NMON=B

Question 42: I have 2400 disk (small SAN LUNs) and nmon is slow to collect the stats from so many, can you help?

I guess you are learning the folly of small LUNs and that it makes the totally machine unmanageable. But you are not the first or worst - the record stands at 4500. Some suggestions:

  • Have you got more than four paths to each LUN?
    • If yes, you need to fix this ASAP as it is bad for performance and terrible for RAS (and I mean really bad).
  • Use the -D flag to stop nmon collecting disk configuration each time can really help the start up time.
  • Collect this disk configuration just the once - unless you are changing the disks a lot!!
  • You can use nmon User Defined Disk Groups to limit the output but nmon will still have to collect all the data from all the disks and then reduce what is actually reported.
  • But the only real solution is to reduce the number of disks you have - yes, I know this is a lot of work but you have a machine setup that can not be managed and that is not viable in the long term.
  • Don't blame nmon for highlighting the issue.

I recommend 32 to 64 LUNs and make the disk subsystem do the hard work of spreading the data across disk - i.e. not you. After all that is what you buy big disk subsystems for and there a better uses of your time and thought.

Question 43: Adapter stats and IOADAPT is not saved to the nmon file seems to be missing with AIX 5.1?

Correct, this data is not available on AIX 5.1 from the libperfstat library.
This also causes a problem on nmon2rrd version 10 where it expects the IOADAPT section and crashes.
Recommended action upgrade AIX as 5.1 is not supported without purchasing extended support.

Question 44: What is CharIO (a column of the TOP processes stats)?

This is the character I/O that a process is generating and it is counted from calls to the read() and write() systems calls.
I/O started in other ways like Async I/O (commonly used by an RDBMS), paging or memory mapped files are not included.
The number fetch from the AIX kernel using the getprocs64() system call and the structure found in /usr/include/procinfo.h - look for the pi_ioch variable.

Question 45: On Linux the disk stats are all doubled?

nmon collects the data from /proc and displays it. On newer Kernels this is ht e/proc/diskstats file. It was decided a long time ago that hiding data was a very bad idea as it can go wrong and then be very misleading - this is how the ozone hole was missed for 5 years and not detected - the algorithm decided the data must be wrong and deleted it from the stats. The Linux disk stats (in three different files and four formats depending on the Linux version - great coding guys!!) reports both disk level and disk partition level stats in the same file. nmon just shows you the stats - it is your job to understanding them. nmon does not and with LUNs on SAN disks and software RAID and LVM's it is much safer to show everything.

Question 46: On AIX the disk seem to be mostly on the first adapter?

nmon now collects the adapter data from AIX libperfstat. This is the addition of the disk stats added up by knowning which disk is conected to which adapter. This of course, is complex for mutlipath IO disks. AIX seems to build this map from the order in which disks are discovered rather than used. Depending on your initial setup it can often mean that most disks are assigned the first one or two adapters. Sorry, there is nothing that nmon can do about this. To list what nmon is extracting from the libperfstat library you can use the sample code and precompiled for AIX 5.3 binaries from the Roll Your Own Wiki page at ryo

Question 47: On nmon for Linux the CPU Wait for IO number is zero or odd?

This number is not available in the /proc filesystem until the 2.6 kernel and then it appears in the undocumented fields at the end of a line - I have fixed this for the 2.6 kernels in nmon for Linux version 11c.

Question 48: On nmon for Linux the paging details are missing and the PAGE lines for the capture to file are missing.

This data was very hard to locate and now appear in nmon for Linux version 11d onward for the 2.6 kernel.
Before this kernel version the data is not present in /proc.

Question 49: I want to collect data every second and then see weekly and monthly reports. How?

Let us take this in simple bite-size chunks:

  1. First, a piratical point, most Laptop and PC screens are 1024x x768 pixels. The point is that no matter how many data points you have you can not even see a maximum of about 800 data points. This is why I recommend about 300 to 400 data captures with nmon to get good looking graphs.
  2. Second, one second stats for a day give you (60 x 60 x 24) 86400 data points! So OK let us try one minute stats then we have 1440 data points, which is still to many. So we need to move to 5 minutes captures and we get to a sensible 288 data points and a good looking graph.
  3. Third, we then collect data for a month 288 x 31 = 8928 data points - oh dear far to may data points again!! so now we have to drop down to once an hour data capture (24 x 31) and we hav 741 data points which is only just possible - we had better start thinking about the purchase of a bigger screen.
  4. If you then want to compare months or have a yearly report ... well you get the idea by now, we are now monitoring 12 hour periods.

But the above is only a physical problem. The much larger logical problem is still there to catch you out and that problem is averaging out. A long time ago I noticed that the shorter the time period that you use to monitor the more fluxuations you notice in the data.

Philosophy: If you keep using shorter and shorter periods you will eventually see that the CPUs are either 100% busy or 100% idle all the other numbers are just a feature of humans not thinking fast enough and having to average out the CPU use in longer periods.

Anyway, for performance tuning we need to concentrate on the peaks. Take a look at the below graph:

If we average the whole day we get 50% which completely hid the peaks of the data time and the heavy CPU load during the evening batch. If this computer was not used during Saturday and Sunday the average might come down to 35%. The point is averaging data over longer periods removed all the important peaks.

This is in addition to the data management problem.

Due to these three problems:

  1. Data overload - to many data points
  2. Averaging out - eliminates the vital data
  3. Manipulation - the data will need to be stored, manipulated and displayed - non-trivial

I think many people make the mistake that this long term reports from nmon is an easy task but it will turn out to be very hard work and often the results are utterly pointless or meaningless.

If you must attempt this then I recommend:

  • rrdtool to summarise data for you and draw graphs
  • ploticus looks like a good tool
  • take a look at Ganglia

Question 50: nmon will not start on AIX 5.1 due to a libperfstat error?

The error is something like:
exec(): 0509-036 Cannot load program <nmon binary file here> because of the following errors:
0509-150 Dependent module libperfstat.a(shr.o) could not be loaded.
0509-022 Cannot load module libperfstat.a(shr.o).
0509-026 System error: A file or directory in the path name does not exist.

You will need to have installed the libperfstat library from the AIX CDROMs.
This is in bos.perf.libperfstat package.

I hope you realise that AIX 5.1 is not normally supported without extra payments as it is so old.

Question 51: How do I work out the Physical CPU use on Linux on POWER for shared processor LPARs?

Here is a Korn shell script that shows you where to get the data and the maths involved.

#!/usr/bin/ksh
before=`grep purr /proc/ppc64/lparcfg | sed 's/purr=//'`
echo before=$before

integer seconds=2
sleep $seconds

after=`grep purr /proc/ppc64/lparcfg | sed 's/purr=//'`
echo afterr=$after

timebase=`grep timebase /proc/cpuinfo | awk '{print $3 }' `
echo timebase=$timebase

string="($after-$before)/$timebase/$seconds"
echo string $string
bc <<EOF
scale=5
$string
EOF

Question 52: The Disk Busy stats are missing on AIX

If you are watching this on line it will be flashing
--> To enable disk stats as root: chdev -l sys0 -a iostat=true
at you - this is a big hint on how to switch them on !!!

Question 53: Sort order problems with massive nmon output files.

So you collected more than 9999 snapshots in a single nmon capture. Ignoring the fact that the Excel Analyser can't cope with all this data and it makes the data unmanageable. We suggest a good aim is between 400 and 700 snapshots per file for good graphs and manageable file sizes. Anyway, you then find out that if you sort the file the rows don't even sort in the right order. The problem is you have four digit and five digit Timeshot numbers - the T numbers. This mucks up the sort ordering. What can you do? Try this on the AIX system - should work on Linux too, it makes all the T numbers 5 digit and then they can be sorted:

sed 's/\(,T\)\([0-9][0-9][0-9][0-9]\)\(,\)/\10\2\3/' original.nmon >original5digit.csv
sort -n original5digit.csv >fixed.csv




Full marks if you understand the sed command - this is very advanced regular express stuff

Question 54: AIX 5.3 updated but then nmon gives "Illegal instruction(coredump)"

This has been reported shortly after an upgrade to a AIX 5.3 higher ML (like ML5 or ML6) and reboot. After a lot of research and experiments the following was found by a persistent nmon user called Xi Chen. The problem seems to be nmon jumping to a library like libperfstat and the jump vectors are not right so the library/system call jumps to address zero and attempts to execute instruction zero (invalid, of course). This is a bug in AIX and its update process where the libperfstat kernel package does not match the library. Try the following command: # lslpp -L | grep -i perfstat

You may get something like:

# lslpp -L | grep -i perfstat
  bos.perf.libperfstat      5.3.0.50    C     F    Performance Statistics Library
  bos.perf.perfstat         5.3.0.60    C     F    Performance Statistics




Update the package bos.perf.libperfstat to the same (5.3.0.60) or at least much closer levels (like 5.3.0.60 and 5.3.0.61) as bos.perf.perfstat. Preferably, the latest available levels.

Question 54: AIX 5.3 updated but then nmon gives "Assert Failure"

This has been reported shortly after an upgrade - some machines have this problems while others don't. There does not seem to be a pattern. There has been a lot of investigation of this issue with tools being written but it is still a mystery. The libperfstat library is claiming that an invalid parameter has been passed but tools have shown this is not true. The three parameters are a pointer to memory (just malloc'ed in the code), the number of adapters (just returned by the previous call to libperfstat) and the size of the diskadapter structure (which has never changed). The output looks like this:

ERROR: Assert Failure in file="nmon11.c" in function="main" at line=3300
ERROR: Reason=System call returned -1
ERROR: Expression=[[perfstat_diskadapter((perfstat_id_t * )FIRST_DISKADAPTER, p->adapt, sizeof(perfstat_diskadapter_t), adapters)]]
ERROR: errno=22
ERROR: errno means : Invalid argument

Then it has been found that a reboot fixes most of these Assert Failures. We don't fully understand this but it may be adapters in funny states, or kernel modules need to be reloaded or libperfstat in a twist - one thing we do know - its not nmon! If you hit this problem:

  1. Check the software levels, see Question 53
  2. Do you think that you rebooted after the upgrade or do you know for absolutely sure!!
  3. Try: export NMON_IGNORE_ASSERT=1 and then start nmon from this same ksh. This may work around the problem as nmon bravely tries to carry on even with library errors.
  4. Try the latest beta version of nmon (if it supports your AIX level).
  5. I know rebooting can be a problem with production systems but it fixes this the vast majority of the time.
  6. If still its a problem, let us know via the usual AIX Performance Tools Forum.

Question 55: On AIX 5.3 ML6, nmon output files contain zeros, missing CPU stats, corrupt ZZZ lines and "nfs" strings found in the stats

This is yet another bug in the AIX libperfstat library at this ML6.  The NFS data returned to nmon is corrupt and these characters may be output directly from the library (very bad form chaps!).

The work around is:

  1. Do not include NFS statistics (remove the -N)
  2. Move to nmon12 that codes around these bugs.

Question 56: Does nmon capture point in time stats or averages?

Well there are two type of numbers

  • rates and
  • absolutes.

For an absolute example, free memory is an absolute - nmon just show you how much is memory is free.
For a rate example, the network stats are rates, here nmon does the following:

  1. Capture a complete set of counters - these are incremented by the kernel like the number of bytes sent.
  2. then nmon waits the number of seconds you asked
  3. then nmon captures a second set of these counters
  4. then nmon calculates the difference between the two sets and divides by the number of seconds, so everything is per second
  5. this number is then displayed on screen or written to the data file

So the rates are the average between the two capture points. As the number of seconds increases the rates get more and more steady but note if you reduce the seconds to just one (the minimum to make sure nmon does not use too much CPU time) you will see lots more peaks and dips in the numbers.

"Point in time" numbers would be very misleading as they would miss all the peaks and dips in between - you would have to take dozens of them to be sure you are really seeing a representative number.

Question 57: Why is the Process memory percentage zero? (same for System and User percent)

This seems to happen in AIX 5.3 TL07 or there about. In fact, it is the AIX libperfstat library, which nmon uses, that has a bug in it that returns a large negative number for the Process% value. The Process, System and User Percentages are approximations (remember memory has many modes, types and uses and some overlap) and the calculation goes wrong.

nmon reports this problem by showing 0% - which is clearly impossible.

The bug was very hard to reproduce and track down because the problem only happens in particular circumstances and changes in memory use (like starting and stopping large memory applications). I am pretty sure you have a good chance of the number being fixed (for at least some time but may reappear), if you reboot the machine/LPAR.

The fix is to update AIX to AIX 5.3 TL09 (or even better AIX 6) but there may be a PTF or efix. You will have to ask AIX Support by asking for a fix to the libperfstat library to fix the real_system, real_process and real_user members of the perfstat_memory_total_t structure. That will give them the right details to search for in the Retain database. Do not ask for nmon classic support as the answer could be short and/or rude!

In my experience AIX systems administrators don't like adding these updates to a production machine. So it may be better to just accept that if any of these numbers are zero then do not use any of these percentages.


Question 100: When will nmon collect data from lots of machines or LPARs?

Answer: Never.
I like to think nmon does one job and does it well - it collects data from one machine and saves it in one file.
Going multiple machine or LPAR has many problems:

  • Collecting data from lots of machines or LPARs would require network access and lots of error handling for missing or late data.
  • The nmon output file would then be far more complex and have to include the machine names and totally rewrite the time stamps.
  • We already suffer from too much data than Excel can handle.
  • There would simply be too much data to display
  • This complication would mean nmon becomes very large and code stability would take a long time to settle down

What you do need is:

  • Less data and then you drill down of particular nodes
  • Automated database generation to store the data
  • Automated graphing of the data you really want
  • History for the last hour, day, week, month year
  • Small simple daemons on the nodes and automated central collection point
  • Simple method of collecting more stats
  • Open Source code to make it safe and simple to implement.

This tool is called Ganglia, see http://ganglia.sourceforge.net/ See Question 101

Question 101: When will nmon collect data like "topas -C"?

It may not be obvious but topas and topas -C are two completely different programs hidden in one binary. The cross partition stats involved communicating with each LPAR and the HMC to get the data unlike the local stats that just calls the local kernel API. The cross partition version of nmon has already been written it is called Ganglia please see http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia for more details. OK, it is an excellent Open Source tool and nothing to do with nmon but it is has all the right stats, many brilliant features, is very simple to implement and has very little impact on performance. There is no need to duplicate this work and it also supports lots of operating systems, the output is via a website and the data is in graph form and it keeps historic data - so this is better then text output on a dumb screen and only for root users.



 
    About IBM Privacy Contact