Optimizing AIX 5L performance: Monitoring your CPU, Part 2

CPU monitoring with lparstat, vmstat, sar, procmon, and nmon

Identify which AIX® tools to use to monitor your Central Processing Unit (CPU) for a given situation and find out why some tools might be better than others. Part 1 of this series discussed the tuning methodology and the importance of having procedures for CPU performance tuning. It also briefly introduced some performance tools to use as a part of your tuning repertories, gave an overview of the POWER CPU, and discussed how the architectural improvements of the evolution of the POWER Chip have contributed to the hardware improvements of the System p™ product line.

Ken Milberg, Future Tech UNIX Consultant, Technology Writer, and Site Expert, Future Tech

Ken Milberg is a Technology Writer and Site Expert for techtarget.com and provides Linux technical information and support at searchopensource.com. He is also a writer and technical editor for IBM Systems Magazine, Open Edition. Ken holds a bachelor's degree in computer and information science and a master's degree in technology management from the University of Maryland. He is the founder and group leader of the NY Metro POWER-AIX/Linux Users Group. Through the years, he has worked for both large and small organizations and has held diverse positions from CIO to Senior AIX Engineer. Today, he works for Future Tech, a Long Island-based IBM business partner. Ken is a PMI certified Project Management Professional (PMP), an IBM Certified Advanced Technical Expert (CATE, IBM System p5 2006), and a Solaris Certified Network Administrator (SCNA). You can contact him at kmilberg@gmail.com.



24 April 2007

Also available in Chinese Russian

About this series

This three-part series focuses on the various aspects of the Central Processing Unit (CPU) performance and monitoring. The first installment of the series provides an overview of how to efficiently monitor your CPU, discusses the methodology for performance tuning, and gives considerations that can impact performance, either positively or negatively. Though the first part of the series goes through some commands, the second installment focuses much more on the detail of actual CPU systems monitoring and analyzing trends and results. The third installment focuses on proactively controlling thread usage and other ways to tune your CPU to maximize performance. Throughout this series, I'll also expound on various best practices of AIX® CPU performance tuning and monitoring.

Introduction

Performance tuning is clearly more than running some commands and observing the output. A UNIX® administrator needs to know which tools to run for what purpose and what the best methods are for capturing data. There are times when you might not have 30 days to systemically analyze your data to determine trends, and there might be instances where you find that you might not even have 30 minutes to make an important judgment call on what your bottleneck is. After all, that is the main purpose behind CPU monitoring -- determining exactly what your bottleneck is. You do not want to tune your CPU unless the data that you've compiled clearly shows that CPU is the bottleneck. In fact, more often than not, you'll find your bottleneck will be memory or I/O related rather than CPU related.

As an AIX administrator, one of your most important roles is to tune your systems. Tuning cannot be done without first monitoring your system and analyzing the results. This goes for both long-term trending and short-term (that job must finish in the next hour) issues. While there are specific tools that you can use that analyze only the CPU, for given circumstances, you might want to use tools that look at all possible bottlenecks on your system. As you probably already know, the CPU is the fastest component of the system. If your CPU is a bottleneck, it affects performance throughout your system. As I go through the tools, please note that the following commands have been enhanced in AIX 5.3 to allow the tools to report back accurate statistics on shared partitions using Advanced Power Virtualization: mpstat, sar, topas, and vmstat. Furthermore, the following trace-based tools have also been updated: curt, filemon, netpmon, pprof, and splat.

Enough with the chatter, let's start monitoring your systems.

UNIX generic CPU monitoring tools

In this section, you'll examine UNIX generic tools that are available in all UNIX distributions (Solaris to AIX). While some of the output varies among distributions, most flags work across all UNIX systems. These can help you gather information on the fly, but I wouldn't rely on them for historical trending and analysis.

Let's start with vmstat. vmstat reports back information about processes, memory, paging, blocked I/O, and overall CPU activity. While it has its roots in virtual memory (the vm in vmstat), I have found unquestionably that running vmstat on a host is the quickest way for me to determine exactly what is happening on an AIX server.

Using vmstat

You just received that dreaded call, "Why is the system so slow?", and you need to do a quick analysis to determine where the bottleneck might be. vmstat is the best place to start. See Listing 1 for an example of running vmstat.

Listing 1. Running vmstat
# vmstat 1

System configuration: lcpu=2 mem=3920MB

kthr    memory                page              faults          cpu    
-----  -----------    ------------------------ ------------  -----------
r  b    avm   fre    re  pi  po  fr   sr  cy  in   sy  cs   us sy id wa
0  0  229367 332745   0   0   0   0    0   0   3  198  69    0  0 99  0
0  0  229367 332745   0   0   0   0    0   0   3   33  66    0  0 99  0
0  0  229367 332745   0   0   0   0    0   0   2   33  68    0  0 99  0
0  0  229367 332745   0   0   0   0    0   0  80  306 100    0  1 97  1
0  0  229367 332745   0   0   0   0    0   0   1   20  68    0  0 99  0
0  0  229367 332745   0   0   0   0    0   0   2   36  64    0  0 99  0
0  0  229367 332745   0   0   0   0    0   0   2   33  66    0  0 99  0
0  0  229367 332745   0   0   0   0    0   0   2   21  66    0  0 99  0
0  0  229367 332745   0   0   0   0    0   0   1  237  64    0  0 99  0
0  0  229367 332745   0   0   0   0    0   0   2   19  66    0  0 99  0
0  0  229367 332745   0   0   0   0    0   0   6   37  76    0  0 99  0

The most important fields to look at here are:

  • r -- The average number of runnable kernel threads over whatever sampling interval you have chosen.
  • b -- The average number of kernel threads that are in the virtual memory waiting queue over your sampling interval. r should always be higher than b; if it is not, it usually means you have a CPU bottleneck.
  • fre -- The size of your memory free list. Do not worry so much if the amount is really small. More importantly, determine if there is any paging going on if this amount is small.
  • pi -- Pages paged in from paging space.
  • po -- Pages paged out to paging space.
  • CPU section:
    • us
    • sy
    • id
    • wa

Let's look at the last section, which also comes up in most other CPU monitoring tools, albeit with different headings:

  • us -- user time
  • sy -- system time
  • id -- idle time
  • wa -- waiting on I/O

Clearly, this system has no bottleneck to speak of. How do you determine this? Let's look at the more important fields to analyze in the vmstat output. Even though this system is running AIX 5.3, you will not see the number of physical processors or the percentage of your consumed entitled capacity because it is not running in a micro-partitioned environment. If it were running in a micro-partitioned environment, you would see these additional fields, as vmstat was enhanced to work in a virtualized and micro-partitioned environment.

If your us and sys entries consistently average over 80 percent, you more than likely have a CPU bottleneck. If they add up to 100 percent, your system is really breathing heavy. If the numbers are small, but wa (waiting on I/O) is high (usually > then 30), this means there might be I/O problems on the system, which can cause the CPU not to work as hard as it could. If more time is spent in sy time rather then us time, this means your system is spending less time crunching numbers than actually processing kernel data. This is also not a good thing.

While the vmstat command is more commonly associated with memory, I have found that it is the quickest and most accurate way to determine what my bottleneck is.

So why did the user complain about the system? Because it really seemed like it was running slow to him. I was only able to get to the root cause after I determined there were no systems problems and his buddy in the adjoining cube had no issues to speak of. So I had him reboot his PC and everything came up clean afterwards. Apparently, something was running haywire on the PC client.

The next day I get another call and start vmstat again (see Listing 2).

Listing 2. Running vmstat again
# vmstat 1 
System configuration: lcpu=2 mem=3920MB

kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------  -----------
r  b   avm   fre   re  pi  po  fr   sr  cy  in  sy  cs   us sy id wa
9  0  4200  2746   0   0   0   0    0   0   3  198  69   70 30  0  0     0
4  7  4200  2746   0   0   0   0    0   0   3   33  66   67 31  2  0     0
2  6  4200  2746   0   0   0   0    0   0   2   33  68   65 34  1  0     0
3  9  4200  2746   0   0   0   0    0   0  80  306 100   80 20  0  1     0
2  7  4200  2746   0   0   0   0    0   0   1   20  68   80 20  0  0     0

So what does this tell you?

Clearly, this system is CPU bound. There is no paging going on, nor any I/O problems to speak of. There are lots of runnable threads and not enough CPU cycles to process what needs to be done. How long did it take for me to reach this conclusion? Exactly five seconds. Try doing that with other utilities.

Using sar

The next command, sar, is the UNIX System Activity Reporting tool (part of the bos.acct fileset). It has been around for what seems like forever in the UNIX world. This command essentially writes to standard output the contents of the cumulative activity, which you would have selected as its flag. For example, the following command using the -u flag reports CPU statistics. As with vmstat, if you are using shared partitioning in a virtualized environment, it reports back two additional columns of information; physc and entc, which define the number of physical processors consumed by the partitions as well as the percentage of entitled capacity utilized.

I ran this command on the system (see Listing 3) when there were no users around. Unless there were some batch jobs running, I would not expect to see a lot of activity.

Listing 3. Running sar with no users around
# sar -u 1 5 (or sar 1 5)

AIX test01 3 5     03/18/07

System configuration: lcpu=2 


17:36:53    %usr    %sys    %wio   %idle   physc
17:36:54       0       0       0     100    2.00
17:36:55       1       0       0     99     2.00
17:36:56       0       0       0     100    2.00
17:36:57       0       0       0     100    2.00
17:36:58       0       0       0     100    2.00

Average        0       0       0     100    2.00

Clearly, this system also shows no CPU bottleneck to speak of.

The columns used above are similar to vmstat entry outputs. The following table correlates sar and vmstat descriptives (see Table 1).

Table 1. sar output fields and the corresponding vmstat field
sarvmstat
%usrus
%syssy
%wiowa
%idleid

One of the reasons I prefer vmstat to sar is that it gives you the CPU utilization information, and it provides overall monitoring information on memory and I/O. With sar, you need to run separate commands to pull the information. One advantage that sar gives you is the ability to capture daily information and to run reports on this information (without writing your own script to do so). It does this by using a process called the System Activity Data Collector, which is essentially a back-end to the sar command. When enabled, usually through cron (on a default AIX partition, you would usually find it commented out), it collects data periodically in binary format.

AIX-specific CPU monitoring tools

Let's now discuss commands that are specific to AIX. These commands were written to enable administrators to monitor systems in a partitioned environment. They are particularly helpful when you are using Advanced POWER Virtualization features, such as shared processors and Micro-Partitioning.

Using lparstat

When the user first reported system slowness, a decision was made to kick off lparstat. The purpose of the lparstat command is to report logical partition (LPAR) information and related statistics. In AIX 5L Version 5.3, the lparstat command displays hypervisor statistical data about many POWER Hypervisor calls. The lparstat command is a relatively new command that is typically used to assist in shared processor partitioned environments.

I used the -h flag, as shown in Listing 4, because I also wanted to see the POWER Hypervisor statistics.

Listing 4. The -h flag for the lparstat command
# lparstat -h 1 5

System configuration: type=Dedicated mode=Capped smt=On lcpu=4 mem=3920 

%user  %sys  %wait  %idle  %hypv hcalls
-----  ----  -----  -----  ----- ------
  0.0   0.7    0.0   99.3   44.4 5933918 
  0.4   0.3    0.0   99.3   44.9 5898086 
  0.0   0.1    0.0   99.9   45.1 5930473 
  0.0   0.1    0.0   99.9   44.6 5931287 
  0.0   0.1    0.0   99.9   44.6 5931274

As you can see, in some ways, the output generated above is similar to the sar command. Note that for partitions running AIX 5.2 or AIX 5.3 in either a dedicated environment or shared and capped, the overall CPU utilization is based on the user, sys, wait, and idle values. In AIX 5.3 partitions running in uncapped mode, the utilization would be based on the entitled capacity percentage.

mpstat

Another command I use frequently is the mpstat command (see Listing 5), which is part of the bos.acct fileset. This is a tool created specifically for AIX 5.3 (unlike lparstat) that displays the overall performance number for all logical CPUs on your partitioned system. When you run the mpstat command, two sections of statistics are displayed. The first section shows the system configuration, which is displayed when the command starts and whenever there is a change in the system configuration. The second section shows utilization statistics, which will be displayed at user-specified intervals.

Listing 5. Running mpstat
 # mpstat 1 1
System configuration: lcpu=2 ent=2.0

cpu min maj mpc int cs ics rq mig lpa sysc           us sy wa id  pc  %ec   lcs 
0    0   0   0  164 83 40   0 1   100  17             0  0 0 100 0.17 8.3   113
1    0   0   0  102  1  1   1 0   100 3830453 66 34   0  0 0 100  .83 41.6

I like the mpstat command, because it reports back collection information for each logical CPU on your partition in a format that is clearly illustrated. You can even see the simultaneous multithreading (SMT) thread utilization by using the -s option. The downside to both the lparstat and mpstat commands is that they require the writing of scripts and other tools to deal with the formatting of data and graph output. Essentially, you would need to write your own shell scripts. Though most administrators love to script, they also don't like to reinvent the wheel. If there are already tools in place to help you analyze historical data, it makes little sense to write your own utilities.

GUI tools

In this section, take a look at the utilities that enable you to graphically look at your analysis and also allow you to analyze historical data. Although it takes some time to fully understand these tools, they are more flexible than the command-line tools you already looked at.

procmon

Let's start with procmon (see Figure 1). This utility (released in AIX 5.3) not only provides overall performance statistics, but it also allows you to take action on the actual running processors. It essentially allows an administrator to either kill or renice a process on the fly. You can also export procmon data to a file, which makes it a nice data collection tool. procmon actually runs as a plug-in to the performance workbench, which is started by using the perfwb (in /usr/bin) command (part of the bos.perf.gtools.perfwb fileset).

Figure 1. procmon output
procmon output

What I like about procmon is that it allows you to take action on a process, which might increase performance on a system. While it has its limitations, I strongly recommend that you download and use this tool, which I have found that most administrators have a tendency not to do.

topas

Another tool that you should be aware of is topas. Truthfully, I've never been a huge fan of topas (part of the bos.perf.tools fileset), although it has been improved substantially in AIX. 5.3. Prior to these changes, it did not have the ability to capture historic data, nor was it enhanced for usage in shared partitioned environments. By incorporating these changes to allow you to collect performance data from multiple partitions, it has really simplified the capability of topas as a performance management and capacity planning tool. The look and feel of topas (see Figure 2) is quite similar to top and monitor (used in other UNIX variants). topas is a utility that displays all kinds of information on your screen in a text-based GUI type of format. In its default mode, it shows you the hostname, the refresh interval, and a potpourri of CPU, memory, and I/O information.

Figure 2. topas display
topas display

Some new features also include the ability to run topas on a Virtual I/O Server (VIO Server). To do this, you would use the following command:

# topas -cecdisp

On an LPAR, you would run:

topas -C

Regarding the performance monitoring features that were introduced in 5.3 TL 4, topas uses a daemon named xmwlm, which is automatically started form the inittab. In TL_5 of AIX 5.3, it keeps seven days of data as a default and records almost all of the topas data, which is displayed interactively except for process and Workload Manager (WLM) information. It uses the topasout command to generate the text-based reports. While topas has come a long way in addressing its deficiencies, a lot of administrators might prefer another utility -- nmon, for example.

nmon

Easily my favorite of all performance monitoring tools is nmon (not an "officially" supported IBM tool). The data that you collect from nmon (see Figure 3) is available either from your screen or through reports that usually are run from cron. In the words of it's creator, Nigel Griffiths, "Why use five or six tools when one free tool can give you everything you need?"

Figure 3. nmon sample outpout
nmon sample outpout

It's important to note that unlike some of the other tools already discussed, nmon is also available for Linux®, which really helps the Linux on POWER user base with performance issues. What attracts most administrators to nmon is that not only does it have a very efficient front-end monitor, as shown in Figure 3 (which the admin can call upon on the fly), but it also provides the ability to capture data to a text file for graphing reports, as the output is in a .csv (spreadsheet) format (see Figure 4). In fact, moments after running an nmon session, you can actually see the nicely rendered charts on an Excel spreadsheet, which can be handed off to senior management or other technical teams for further analysis. Further, unlike topas, I've never seen any performance-type overhead associated with this utility.

Let's look at a simple task. First let's tell nmon to create a file, name the run, and do data collection every 30 seconds for 180 intervals (1.5 hours):

# nmon -f -t -r test2 -s 30 -c 180

When this is completed, sort the file, as shown in Listing 6.

Listing 6. Sorting the file
# sort -A testsystem_yymmd_hhmm.nmon > testsystem_yymmdd.hhmm.csv

When this is completed, FTP the .csv file to your workstation, start the nmon analyzer spreadsheet (make sure you enable macros), and then click on analyze nmon data (see Figure 4).

Figure 4. nmon analyzer output
nmon analyzer output

The nmon analyzer is an awesome tool, written by Stephen Atkins, that graphically presents data (CPU, memory, network, or I/O) from an Excel spreadsheet. Perhaps the only drawback which prevents it an enterprise type of tool is that it lacks the ability to gather statistics on large numbers of LPARs at once, as it is not a database (nor was it meant to be). That is where a tool such as Ganglia (see Resources for a link) helps, which has actually received the blessing of Nigel Griffiths, as the tool can integrate nmon analysis.

Summary

Part 2 of this series reviewed many tools and utilities that you can use to capture and analyze performance data from System p servers running AIX. Some of these commands have been available since the beginning days of UNIX. Many are for AIX and others are unsupported IBM utilities, but most AIX administrators use them all. Regardless of which tool you like the best, you need to use one to instantly look at performance activity and another tool to capture data for historical-based performance tuning and trending and capacity planning analysis. Some tools can do both (for example, nmon), but most are more geared for one or the other. I encourage you to play around and find the tools that not only work best for you, but ones that can also provide value to folks that might not be systems administrators capable of reading endless vmstat displays.

Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=211744
ArticleTitle=Optimizing AIX 5L performance: Monitoring your CPU, Part 2
publish-date=04242007