AIXpert Blog is about the AIX operating system from IBM running on POWER based machines called Power Systems and software related to it like IBM Systems Director, PowerVM for virtualisation and PowerSC for security plus performance monitoring and nmon
I have never really understood why I get peaks and troughs in nmon questions and urgent situations but July seems to be a peak and fortunately I was not at work for a large part of it. Performance and nmon Questions trend to come in three flavours: really dumb, genuine and mega-urgent critical lets blame the hardware. Let me give you a flavour of each type of question from the last month. Perhaps, reading these will let you avoid the same problem or at least let you learn "you are not alone".
"nmon has stopped producing the LPAR Tab - help?" or "I can't find the XYZ stats!"
I know I am a guru but there is a lot that I can't do in a vacuum. I need a few facts to get started.
Is this the nmon for AIX or nmon for Linux?
Which version of the OS is involved?
On AIX oslevel -s gives you this in the detail I need (including service packs).
On Linux it is much harder as there is Distro and Version - which are held in different places depending on the flavour. cat /etc/*ease is a good start.
If Linux, what hardware platform? i.e. POWER, x86, x86_64 or mainframe.
Which version of nmon?
Then I can look at the actual question.
All of these are included in the nmon output file - so a small sample file, showing the problem is a good idea.
Lesson: This information is required because a good 50% of issues are fixed by updating the OS or nmon version.
"nmon Error! - there is a problem with nmon2rrd and later in the email they are actually trying to run nmon2web.pl"
This sort of email really really annoys me - as it is like shouting "your baby is ugly!" and it turns out they are talking about someone else's baby and someone you don't really know. For the record: I (Nigel Griffiths) developed the original nmon now called classic nmon and nmon for Linux, the nmon Analyser is developed by Stephen Atkins and nmon2web is developed by Bruce Spenser. Also nmon2rrd is sample C code on analysing nmon files written by me and given away for community support - I generally recommend nmon2web instead. We have all got used to users not understanding "who does what" and passing email on to the right guy. Unlike the song, can't expect the "I want to teach the world to sing in perfect harmony ..."
Lesson: If you want some support please don't start with "your baby is ugly". The smart guys start with "Your baby is beautiful but I have a question ..."
"We have just upgraded AIX 5 to AIX 6 and nmon is broken - what are you going to do about it?"
Most AIX 5 users are running old copies and the classic nmon - which they downloaded years ago and probably forgot where they got it.
Amazingly, when it fails to run they then try older and older versions. How anyone decides that trying nmon_aix433 on AIX 6 is a good idea is flabbergasting to me. You see the trick is in the name "nmon_aix433" was written for AIX 4.3.3 which is something like 14~ years ago. I know I am a very good C programmer but even I can't write programs that will support new operating system stats that will not be released for a decade in the future :-)
OK, so people asking this question missed the announcement of nmon for AIX now being a supported AIX command and delivered inside AIX
The details are at the nmon wiki http://www.ibm.com/developerworks/wikis/display/WikiPtype/nmon
It reads - 21st November 2008: STOP PRESS: nmon for AIX included with AIX from 5.3 TL09, AIX 6.1 TL02 and Virtual I/O Server (VIOS) 2.1 . It is installed by default.
So that is 3.5+ year old news. No one should be running the classic nmon for AIX in 2012 because the AIX releases it was written for are out of service.
OK you may have very old systems but you should be upgrading to an AIX version with the official topas_nmon version.
Lesson: If you like your tools then make sure you keep up to date with the news and code level.
"Where is the download for nmon for AIX 7?"
nmon for your specific AIX 7 release is already on your disk at /usr/bin/nmon
This comes from the AIX Performance Tool team at IBM and has full IBM support - just like any other AIX command. This does not include analysing your data!
For the record: /usr/bin/nmon is a shell script that calls /usr/bin/topas_nmon = the actual binary and the same program file as topas but starts in nmon mode.
Lesson: Engage the brain before whacking off yet another email and looking silly.
"Can you put me on the nmon emailing list?"
Actually this is not a dumb question as there is somewhere out there in the get "Interweb" an old statement about asking to go on it.
If anyone finds it, please, let me know so I can remove/update it - thanks.
I did have a few thousand people on the email list at one time but we have moved to this AIXpert Blog and you can follow my Twitter account mr_nmon as the communications channels for AIX, performance, nmon, POWER and nmon or Linux. Now you can control, if you want to listen or not and not me.
Lesson: Move with the times.
"What does this number mean?" Could be any of the 100s of stats we collect.
nmon was written for performance tuning experts but we have all needed to learn when we started out.
Most of the stats are available in other AIX or UNIX commands so there manual pages help or it is common computer knowledge.
I either refer the questioner to the correct AIX manual pages or nmon documentation or nmon FAQ for others - see the nmon wiki for these.
There is also the POWER CPU related numbers that confuse many like CPU min, desired, max and that we call desired the Entitlement when running, virtual CPUs, logical CPUs and physical CPUs confuse, and so does SMT, core, processor and socket.
However, nmon is not a replacement for some basic computer science knowledge and I have had to ask some people to read a manual, read a Redbook or take education before asking more questions.
Lesson: We can all help out the newbies and straighten out the experts.
"Does nmon save data at a 'point in time'?" Meaning when nmon saves the stats is it just the stats at that exact time or averaged out since the last capture.
Answer is: Yes and no!
Some stats are just numbers and they don't fluctuate much like free memory. In these cases nmon just reports the value at the time asked.
Some numbers are rates like disk or network mega-bytes per second. Obviously, that can't be point in time and and forget just capturing the last second (lots of work). So nmon does averages between the previously saved data and now. It does this by taking the difference between the previous and new counters (which are a total so far counter) and dividing by the elapsed time.
Some number fluctuate a lot like the CPU stats. When you look at the CPUs they are either100% busy or 100% idle. Any idea about a number in between like the utilisation percentages stats is a human perception and so again it is average.
Averages will hid the peaks - if you want to see the peaks you need to run nmon at a faster rate = smaller number of seconds between captures.
Lesson: This question is always from theoretical Performance Architects - hands-on tuners and benchmarkers seem to know this by osmosis.
"Shared CPU, Uncapped LPAR utilisation number do not look right nor does the average of the logical CPUs?"
Correct. They are very misleading.
I have been pointing this out for 5+ years.
For these types of LPARs, you need to monitor the physical CPU use.
The problem is the utilisation numbers (User+System) get to roughly 95% as you get to the Entitlement and stay at just below 100% as you use double, quadruple or higher numbers of physical CPU. They do not show you how much CPU time you are using above Entitlement.
Plus you can't average the logical CPUs (these are the SMT threads) to get the machine average because they are time-sharing the physical CPUs.
Also for Dedicated CPU LPARs all the Shared Processor stats don't mean anything, so they are not collected and there is no LPAR Tab in the nmon Analyser.
Lesson: POWER systems are function rich with advanced features that means we can't use 1990's stats to understand them.
Mega-urgent critical let us first blame the hardware
"Critical testing performance issue for a new solution - can you join the project tomorrow morning (it was 7 pm)"
A couple of months ago I was urgently sent to a large project with a performance problem that had stumped large team developing the software. I had been sent nmon data and the large group machine basically looked OK. As part of the project briefing, it was casually mentioned that the critical DB2 dat base was on its own LPAR and the database was a single file spread across disks. I nearly fell of the chair but managed to keep a straight face and politely asked "did I hear that right?" and "are they using Director I/O or Concurrent I/O to the file". The answer was no DIO or CIO because they followed the regular DB2 policy. The UNIX standard forces you to do single updates at a time for regular files (1 I/O at a time) to ensure integrity in case there are over lapping writes that have to get written in time order. If you know your application does not need this (like every RDBMS), the DIO and CIO features can be used to allow many I/O operations at the same time. We changed the setting and the database went 100 times faster on the first attempt. This team also have the "nmon wall of windows" - seven large flat panel screens all showing nmon on the various LPARs of the solution - I was so proud. Unfortunately, due to security, I can't show you the picture.
Lesson: No matter how hard you look at nmon data there are some things you can never see. But I hit this problem regularly.
"Your bosses bosses boss says that you can fix this critical situation with a performance problem"
Not the sort of email I like on day 3 of my holiday and I was "just checking in" to reduce the pile of junk email.
The team had decided there was an obvious problem with the POWER7 machine.
So we do the regular checks: System Firmware, AIX level to check we are running the version with bugs fixed then request they send me the nmon data.
I got a bucket full of nmon files from 4 machines over 5 days - no clear description of when the problem was happening so I have to guess from the data.
The nmon collections start a 3 pm for 24 hours - that tells you something about the team!!
I make comments about missing Firmware updates and AIX Service Packs being rather silly (see other AIXpert Blogs on these) that annoyed the customer.
Along with 3 pages of comments about the performance stats and some hints and tips about what to look for or can they be explained that these are normal for this particular workload. There was nothing very major causing obvious a bottleneck.
Then we looked at the whole system.
There are two main critical LPARs on the heavily over committed machine - By this I mean that if you add up the LPAR Entitlements of a machine they have to add up to at most to the number of physical CPUs in the shared pool. But they have most LPARs Uncapped with the Virtual CPU (spreading factor) number much higher than the Entitlement. Normally, I don't recommend this for performance, as the LPAR has to compete for CPU cycles above the Entitlement. In this case, the two main LPARs have an Entitlement of 6 to 10 CPUs but a Virtual CPU of 40. Now the bad news, these two LPARs are busy at the same time - they are doing a database unload in one and a load of the same data in the other LPAR. If I tell you the machine has 64 physical CPUs, you can immediately see the problem. Both LPARs can't get 40 CPUs at the same time (we can't run 80 Virtual CPUs flat-out on 64 physical CPUs) and that does not include the other LPARs also running.
Lesson: The single AIX view can't answer the whole machine problem here. We need IBM Tivoli Monitoring, Ganglia or other tools that can show the whole machine.
Lesson: The team needs to update System Firmware & AIX service packs. If your car brakes fail but you have not serviced your car for 3 years - it is your fault.
"We just ported a database from Solaris to AIX/POWER7 and need some performance help"
Round the usual loops: Firmware & AIX levels
It was a new POWER7 machine (Firmware is normally up to date but worth checking) and AIX 6 TL7 = good.
They sent me topas data = a bit like nmon data but with loads missing! OK we all need to learn.
Looked at data and made what comments I could.
Have you raised a PMR - "No. Oh dear, oh dear, oh dear" - I will post a further AIXpert Blog about how to get IBM to support performance problems.
Three days later I got an email "Thanks for your help - everything is working
fine. We made the updates recommended by Oracle and the
"Hardware" problem went away!
Lesson: Everything "looks" like a hardware issue until you work out it isn't plus you need to engage the application vendors too!
Just for the record: nmon is not my job it is my pet project, even now that I am only in charge of the nmon for Linux code.
ps: If you think you recognise yourself or your questions:
You are wrong = I bent a few facts in the larger questions to protect the identities of those teams involved.
Don't for a minute think your question was unique - I have seen the same old problems many times.