- I have been sent 35 GB of nmon files covering the last month, about 6 servers with 40 LPARs (virtual machines) each server in total ~6000 files.
- How do I load that in to the nmon Analyser to see the graphs?
My first reaction is: that is not my problem I wrote nmon for AIX and nmon for Linux and not the nmon analyser.
If you stop and think, it is obvious that you are never ever going to load 35 GB in to Excel on a laptop with 8 to 16 GB of memory.
Excel is likely to grind the laptop a halt and then explode at a few 100 MB of data.
First lesson: Don't try to do a capacity planning exercise using performance tuning data.
- Even for one LPAR the 30 nmon files will probably explode Excel. You don't have a nmon problem or a Excel problem but you do have a too much data problem. Don't even dream of merging nmon files for a week or month. You are facing a major data management issue due to too much data.
Note: I will not comment on the guys taking 1000's of nmon data snapshots a day or running nmon for days at a time - they can explode Excel with a single file. They deserve what that asked for :-) While writing this some one want to have lots more than 10,000 data point on a graph displayed on a screen with 1920 pixel across!
Second Lesson: Change to nmonchart.
This ksh script graphs the nmon files much faster than Excel (typically a second or two) and can tackle much larger files. As it is a script you can run it on 100's of files in a directory on AIX or Linux. If its on a POWER Server do loads of them in parallel. So depending on the size of the nmon files, it can process a few thousand files an hour (single stream shell script looping thought files). The viewing is by your browser - one nmon graphs set per browser tab. When viewing the graphs: you can have loads open at one time and flick between them. This scales up to say two dozen LPARs.
Third Lesson: Building a data repository for capacity planing is non-trivial
You could gear up with various tools that take nmon data and then let you graphs LPARs over time:
- Like free to use nmon2rrd, nmon2web that use rrdtool to store the data and then generates rrdtool graphs.
- There a non-free 3rd party tools like MPG and many others that take nmon files as input to their performance monitoring and capacity planning tools.
- There is also IBM's own comprehensive ITM tools.
If you are going to do long term performance and capacity planning this is a good idea. If it is a once only exercise then you are not going to want to find the hardware and spend a couple of weeks setting these tools up.
There are alternative like LPAR2RRD and Ganglia too which don't use nmon files.
Fourth Lesson: Newer tools avoid all this data management
We also have new wave performance tools with a new data collector njmon - this outputs JSON format and a lot more stats than nmon which you can then live inject in to new wave time aware databases and then live dynamic graphing tool like InfluxDB + Grafana, ELK, Splunk. For more information: http://nmon.sourceforge.net/pmwiki.php?n=Site.Njmon You can also inject nmon files using nmon2json formatter (http://nmon.sourceforge.net/pmwiki.php?n=Site.Nmon2json)
Assuming its a once only project
So lets assume you are doing a "once only" audit or capacity planning or server consolidation exercise. Not time to get organised and tooled up.
What you need to do is extract a summary from the 100's of nmon files then have those summaries in CSV format so you can load the resulting data in to an excel spread sheet.
I am going to assume you are a real technical person - no Windows tools here! You have access to AIX or Linux and are OK with simple shell scripts using grep and awk.
Let us get organised.
If you have nmon files from different servers then place them in different directories.
You may have noticed nmon file names are very carefully designed: <hostname>_<date>_<time>.nmon
This means the sort very nicely with the ls -l command for hostname and then time.
Note: I will not comment on the guys deciding their own rubbish file names via a shell script which is often buggy and then blame nmon !!
You can then pick out a set of nmon files for a particular day like today's (23rd November 2018) ls -1 *_20181123_*.nmon
So here are three scripts to extract General information, CPU stats and Memory usage.
In our example we should quiz the IT staff to decide a busy day then focus on that - latter we might explore other days for comparison.
So we have have 6 directors for the 6 servers and roughly 40 LPARs for for 30-ish days.
We want a summary of the LPARs for a specific day. Instead of time based graphs like nmon we need to step back and get basic config then minimum, average, maximum and 95% type stats on the CPU and Memory.
With 240 LPARs in our example we need to cut down on the stats per LPAR to a basic few
Here is what I recommend
Basic configuration
- nmonfile,
- nmonversion,
- AIXversion,
- time,
- date,
- interval,
- snapshots,
- CPUs-Max-Current,
- serialno,
- LPAR-Number-Name,
- machineTypeModel,
- hardware-description
CPU stats:
- nmonfile,
- snapshots,
- VP,
- E,
- VP:E
- poolCPU,
- poolIdle,
- Weight,
- Capped,
- total,
- min,
- avg,
- max,
- 95%
Memory stats:
- nmonfile,
- count,
- total_used,
- min_used,
- avg_used,
- max_used,
- 95%MB
I have selected a wild random set of nmon files and changes Serial Numbers and hostnames - I hope your nmonfiles are MUCH more consistent.
The scripts are for AIX only (at the moment) and complain if it is Linux data or if the file is missing the LPAR stats used. Below is the raw output:
Sample nsum_gen output - don't try and read this see below
blue:nag:/home/nag/nsum $ ./nsum_gen *.nmon
nmonfile, nmonversion, AIXversion, time, date, interval, snapshots, CPUs-Max-Current, serialno, LPAR-Number-Name, MachineTypeModel, hardware-description
E850_170915_1103.nmon , TOPAS-NMON, 7.1.4.32, 11:03:38, 15-SEP-2017, 3, 360, 128-32, XB47186, 1-VPM_E850, IBM-8408-44E, POWER8 64 bit
aix721_1_170118.nmon , TOPAS-NMON, 7.2.1.2, 04:17:28, 18-JAN-2017, 30, 9999999, 16-8, ABC1234, 4-aix721_1, IBM-8286-42A, POWER8 64 bit
blue_171122_1042.nmon , TOPAS-NMON, 7.2.1.1, 10:42:31, 22-NOV-2017, 1, 100, 32-16, 011ECV7, 17-w3-blue, IBM-8286-42A, POWER8 64 bit
sampleC.nmon , TOPAS-NMON, 6.1.7.17, 00:00:06, 01-DEC-2014, 1200, 72, 28-28, 06E214R, 2-sampleC, IBM-8205-E6C, POWER7_in_P7_mode 64 bit
sampleD.nmon , TOPAS-NMON, 6.1.9.100, 00:00:05, 24-MAR-2016, 300, 288, 32-24, 9999999, 34-34_PROD_1, IBM-9117-MMB, POWER7_COMPAT_mode 64 bit
testhost1234__20181112_0900.nmon , TOPAS-NMON, 7.2.2.0, 09:00:01, 12-NOV-2018, 60, 480, 24-4, 0XX6C17, 40-XX, IBM-9117-MMC, POWER7 64 bit
testhost1235__20181112_0900.nmon , TOPAS-NMON, 7.2.2.0, 09:00:02, 12-NOV-2018, 60, 480, 56-12, 7654321, 34-testhost1235, IBM-9117-MMC, POWER7 64 bit
testhost1236__20181112_0900.nmon , TOPAS-NMON, 7.2.2.0, 09:00:04, 12-NOV-2018, 60, 480, 28-8, 9876543, 33-testhost1236, IBM-9117-MMC, POWER7 64 bit
testhost1237__20181112_0900.nmon , TOPAS-NMON, 7.2.2.0, 09:00:01, 12-NOV-2018, 60, 480, 100-100, 8877665, 20-testhost1237, IBM-9117-MMC, POWER7 64 bit
testhost1239__20181112_0900.nmon , TOPAS-NMON, 7.2.2.0, 09:00:01, 12-NOV-2018, 60, 480, 28-8, 1122334, 42-testhost1239, IBM-9117-MMC, POWER7 64 bit
testhost1254__20181112_0900.nmon , TOPAS-NMON, 7.2.2.0, 09:00:01, 12-NOV-2018, 60, 480, 28-8, 1133557, 45-testhost1254, IBM-9117-MMC, POWER7 64 bit
testhost1284__20181112_0900.nmon , TOPAS-NMON, 7.2.2.0, 09:00:01, 12-NOV-2018, 60, 480, 28-16, 6666666, 47-testhost1284, IBM-9117-MMC, POWER7 64 bit
testhost8888_161219_0000.nmon , TOPAS-NMON, 6.1.4.4, 00:00:02, 19-DEC-2016, 300, 288, 38-6, PRA8899, 5-Mail_WAS01, IBM-9119-FHA, POWER7_in_P6_mode 64 bit
testhost9999_170208_1201.nmon , TOPAS-NMON, 6.1.9.30, 12:01:17, 08-FEB-2017, 300, 288, 64-40, 0499TDD, 204-testhome9999, IBM-9119-FHB, POWER7_COMPAT_mode 64 bit
Sample nsum_cpu output - don't try and read this see below
blue:nag:/home/nag/nsum $ ./nsum_cpu *.nmon
nmonfile, snapshots, VP, E, VP:E poolCPU, poolIdle, Weight, Capped, total, min, avg, max, 95percentile
E850_170915_1103.nmon, 360, 8, 8.00, 100.00%, 0, 0.00, 128, 0, 416.61, 0.01, 1.16, 3.29, 2.41
aix721_1_170118.nmon, 417, 1, 0.20, 500.00%, 0, 0.00, 128, 1, 11.65, 0.03, 0.03, 0.06, 0.04
blue_171122_1042.nmon, 100, 4, 3.00, 133.33%, 0, 13.82, 200, 0, 197.08, 0.01, 1.97, 3.74, 3.32
sampleC.nmon, 30, 9, 7.00, 128.57%, 0, 1.00, 51, 1, 93.80, 1.31, 3.13, 6.38, 5.96
sampleD.nmon, 288, 6, 3.50, 171.43%, 26, 10.89, 128, 0, 647.43, 0.20, 2.25, 5.70, 5.06
testhost1234__20181112_0900.nmon, 480, 1, 0.10, 1000.00%, 0, 0.00, 192, 01, 8.75, 0.01, 0.02, 0.08, 0.03
testhost1235__20181112_0900.nmon, 480, 3, 0.50, 600.00%, 0, 0.00, 192, 0, 66.14, 0.04, 0.14, 0.69, 0.44
testhost1236__20181112_0900.nmon, 480, 2, 0.30, 666.67%, 0, 0.00, 128, 0, 48.31, 0.03, 0.10, 0.47, 0.34
testhost1237__20181112_0900.nmon, 480, 25, 5.00, 500.00%, 0, 0.00, 192, 0, 785.83, 0.25, 1.64, 5.53, 3.61
testhost1239__20181112_0900.nmon, 480, 2, 0.40, 500.00%, 0, 0.00, 128, 0, 42.08, 0.07, 0.09, 0.31, 0.13
testhost1254__20181112_0900.nmon, 480, 2, 0.40, 500.00%, 0, 0.00, 128, 0, 67.96, 0.07, 0.14, 0.81, 0.47
testhost1284__20181112_0900.nmon, 480, 4, 1.00, 400.00%, 0, 0.00, 128, 0, 39.88, 0.05, 0.08, 0.37, 0.26
testhost8888_161219_0000.nmon, 288, 3, 2.00, 150.00%, 0, 0.00, 128, 0, 103.47, 0.13, 0.36, 0.82, 0.58
testhost9999_170208_1201.nmon, 29, 10, 4.00, 250.00%, 0, 0, 255, 1, 6.44, 0.19, 0.22, 0.28, 0.28
Sample nsum_ram output - don't try and read this see below
blue:nag:/home/nag/nsum $ ./nsum_mem *.nmon
nmonfile, count, total_used, min_used, avg_used, max_used, 95percentileMB
E850_170915_1103.nmon, 360, 5837558.20, 12661.50, 16215.44, 18561.70, 18531.40
aix721_1_170118.nmon, 417, 678009.00, 1619.10, 1625.92, 1634.10, 1633.40
blue_171122_1042.nmon, 100, 1603607.00, 16035.60, 16036.07, 16036.70, 16036.70
sampleC.nmon, 31, 1630130.20, 52549.30, 52584.85, 52681.40, 52680.00
sampleD.nmon, 288, 13116070.20, 36336.70, 45541.91, 48097.00, 48093.90
testhost1234__20181112_0900.nmon, 480, 966217.10, 2008.80, 2012.95, 2016.70, 2016.20
testhost1235__20181112_0900.nmon, 480, 10405805.60, 21624.10, 21678.76, 21834.70, 21711.50
testhost1236__20181112_0900.nmon, 480, 6958848.70, 14456.30, 14497.60, 14534.80, 14516.50
testhost1237__20181112_0900.nmon, 480, 21318534.40, 40954.80, 44413.61, 47161.80, 46664.30
testhost1239__20181112_0900.nmon, 480, 7821156.70, 16244.60, 16294.08, 16371.00, 16329.20
testhost1254__20181112_0900.nmon, 480, 7804580.00, 16208.60, 16259.54, 16367.00, 16318.10
testhost1284__20181112_0900.nmon, 480, 10062864.20, 20935.10, 20964.30, 21031.40, 21001.40
testhost8888_161219_0000.nmon, 288, 5322236.40, 18016.40, 18479.99, 18641.30, 18596.40
testhost9999_170208_1201.nmon, 29, 114679.10, 3950.80, 3954.45, 3958.50, 3958.10
The scripts are sub-second - with very large files they may take a second or two.
So I run :
echo General Info >nsum.csv
nsum_gen *.nmon >>nsum.csv
echo CPU >>nsum.csv
nsum_cpu *.nmon >>nsum.csv
echo Memory >>nsum.csv
nsum_ram *.nmon >>nsum.csv
If I had a directory full of many days I would select just one day
echo General Info >nsum.csv
nsum_gen *20181123*.nmon >>nsum.csv
echo CPU >>nsum.csv
nsum_cpu *20181123*.nmon >>nsum.csv
echo Memory >>nsum.csv
nsum_ram *20181123*.nmon >>nsum.csv
Let us use a spreadsheet to sum() columns up and make it easier to read
Next we save these to a .csv file and open that file in Excel - or your favourite spread sheet.
You may have to tell it to open CSV files.
- I have made the title lines Bold
- Added Column Totals to some columns of data and coloured them RED
General information

Note these are a random set of LPARs from different machine - if this was all one server then we can check that with the Serial Number
CPU information
Key
- snapshots = the number of times the data was collected. Allows you to ignore nmon files that are unexpectedly short
- VP = Virtual Processor
- E = Entitlement
- VP:E ratio. Lower is better for efficient use of CPUs. Ignore it if the LPAR is small = below E=1. High then 200% is usually a mistake and NOT Best Practice.
- poolCPU = number of CPUs in the shared CPU pool. a Zero means the user has not set the option on the HMC
- poolIdle = number of CPUs that are Idle (unused) in the shared CPU.
- Weight = how important the LPAR is and allows it grabbing more of the unused CPU time from the pool.
- Capped = not allowed to go over the Entitlement at any time.
- total = ignore this as it is used to calculate the average
- min average max CPU used = exactly what you expect.
- 95% percentile = CPU used, Put all the values in a list highest first then remove the top 5% and the highest value now at the top it this number. A great way to ignore sudden peaks in volatile stats.
Notes:
- For this (let us pretend) POWER7 server the 95% percentile total is 23 CPUs - we would use the IBM POWER7 Performance report for this server model and find/calculate the rPerf number and from that determine a suitable POWER9 server to run this workload.
- We can also do some sanity checks: If the E = Entitlement is more that 1 then the VP:E ratio should not be more that 200% (150% is better). If it is the LPAR is badly spread across CPUs and very inefficient.
- Only one LPAR has got the Pools stats switched on but it reports there are 10.89 unused CPUs - that is a tuning opportunity.
- We can see the max CPU is 29 but the 95% is a little lower. The LPARs might not peak at the same time.
Memory information
Key
- snapshots count = the number of times the data was collected. Allows you to ignore nmon files that are unexpectedly short
- total_used = ignore this as its used to calculate the average
- min used, average_used and ma_used = exactly what you expect
- 95percentileMB = memory in use. Put all the values in a list highest first then remove the top 5% and the highest value now at the top it this number. A great way to ignore sudden peaks in volatile stats.
Notes:
- I formatted the cells to remove the fractions of MBs - we have enough digits already
- This says we need 298087 MBs of memory.
- Don't forget AIX soaks up available memory in to the filesystem cache to reduce disk I/O - we might not need all of the memory.
Well, I hope this lets you quickly analyse vast numbers of nmon file.
Comments are welcome below - especially if I have not explained something well.
All of the above will take about 4 minutes per server, once you have your nmon files grouped sensible into directories.
Download the scripts and samples:
Download by clicking on this link: nsum_scripts_samples_v1.zip
- The file is just 460 KB in a .zip format
- All the three scripts are only 100 links of ksh in total: nsum_gen, nsum_cpu and nsum_ram
- The file sampleD.nmon is just something to try if you have no nmon files handy and nsum.xlsx the the Excel spread sheet I screen grabs above - its all a bit messy as the files are taken at random.
- All pretty straight forward if you know your grep and awk - only testing on AIX so far.
- Please report successes in the comments.
- If you hit a bug, I will need the nmon file to help you and the "oslevel -s" output of the AIX you are running the scripts on.
- We could add totals for network I/O and disks I/O stats but they are not normally very useful for capacity planning or server consolidation. we will get round to this only if there is a demand
- Also note you can use my rperf script (also on DeveloperWorks) on AIX to determine the rPerf rating of the LPAR.
- - - The End - - -