Optimizing AIX 5L performance: Tuning your memory settings, Part 2

Use ps, sar, svmon, and vmstat to monitor memory usage and analyze the results. This three-part series focuses on the various aspects of memory management and tuning on IBM System p™ servers running AIX®. Part 1 provided an overview of memory on AIX, including a discussion of virtual memory and the Virtual Memory Manager (VMM). It also drilled down into the tuning parameters and outlined recent improvements in AIX Version 5.3 with respect to memory management. Part 2 focuses on the detail of actual memory subsystem monitoring and discusses how to analyze the results. Part 3 deals specifically on swap space and how best to tune your VMM settings to provide for optimum swap space configuration and performance. Throughout this series, I'll also cover some of the best practices of memory performance tuning and monitoring.

Ken Milberg, Future Tech UNIX Consultant, Technology Writer, and Site Expert, Future Tech

Ken Milberg is a Technology Writer and Site Expert for techtarget.com and provides Linux technical information and support at searchopensource.com. He is also a writer and technical editor for IBM Systems Magazine, Open Edition. Ken holds a bachelor's degree in computer and information science and a master's degree in technology management from the University of Maryland. He is the founder and group leader of the NY Metro POWER-AIX/Linux Users Group. Through the years, he has worked for both large and small organizations and has held diverse positions from CIO to Senior AIX Engineer. Today, he works for Future Tech, a Long Island-based IBM business partner. Ken is a PMI certified Project Management Professional (PMP), an IBM Certified Advanced Technical Expert (CATE, IBM System p5 2006), and a Solaris Certified Network Administrator (SCNA). You can contact him at kmilberg@gmail.com.



14 June 2007

Also available in Chinese Russian

Introduction

The most important part of tuning your memory subsystem does not involve actual tuning. Before tuning your system, you must have a strong understanding of what is actually going on in the host system. To do that, an AIX® administrator must know which tools to use and the how to analyze the data that he or she will be capturing. To reiterate what I had discussed in some other tuning documents published recently (see Resources), you cannot properly tune a system without first monitoring the host, whether it's running as a logical partition (LPAR) or on its own physical server. There are many commands that allow you to capture and analyze data, so you'll need to understand what they are and which ones are most suitable for the intended job. After you capture your data, you need to analyze the results. What might initially look like a Central Processing Unit (CPU) problem can be properly diagnosed as a memory or I/O problem, assuming you are using the right tools to capture data and understand how to do the analysis. Only when this is properly done can you really consider making changes in your system. Just as a medical doctor cannot treat an illness without knowledge of your history and the symptoms you are experiencing, you also need to come up with a diagnosis before tuning your subsystems. Tuning your memory subsystem when you have a CPU or I/O bottleneck will not help you and it might even hurt the health of the host.

This article helps you understand the importance of getting the diagnosis correct as well. You will see that performance tuning is much more than actual tuning itself. Some of the tools you will be looking at are generic monitoring tools that are available on all flavors of UNIX, while others were written specifically for AIX. I will point out some of the tools that have been optimized for AIX Version 5.3 and the new ones developed specifically for AIX 5.3 systems.

I can't reiterate enough the importance of generating baseline data. The time to be monitoring your system is not when you get that ticket from the Help Desk complaining about poor performance. Data should be captured on your servers as soon as they are put into production. If you do this, you can be proactive in your tuning, with the objective of actually finding the problem before the user points it out to you. How can you determine if the data they are looking at substantiates a performance issue without looking at data when the performance on the box was acceptable. This is all part of appropriate performance tuning methodology; capturing data effectively and properly analyzing the results and the trends. Let's get on with it.

UNIX generic memory monitoring

In this section, I provide an overview of generic UNIX tools available on all UNIX distributions—ps, sar and vmstat. Most of these tools allow you to quickly troubleshoot a performance problem, but they are not really geared for historical trending and analysis.

Most administrators tend to shy away from ever using the ps command to troubleshoot a possible memory bottleneck. In fact, I would add that many UNIX administrators don't even know that you can use ps to help you determine the cause of a memory problem. The most commonly used function of ps is to look at the processors running on your systems (see Listing 1).

Listing 1. Using ps to look at the processors running on your system
# ps -ef | more
  UID   PID  PPID   C    STIME    TTY  TIME CMD
    root     1     0   0   May 03      -  0:03 /etc/init
    root 11244 19154   0                  0:00 <defunct>
    root 11384     1   0   May 03      -  0:00 /usr/lib/errdemon
    root 12434 16618   0   May 03      -  0:29 /usr/opt/ifor/bin/i4llmd -b -n wc
clwts -l /var/ifor/llmlg
    root 13218 16618   0   May 03      -  0:00 /usr/sbin/rsct/bin/IBM.AuditRMd
    root 13440     1   0   May 03      -  0:00 /usr/ccs/bin/shlap
    root 13690 13954   0   May 03      -  0:00 dtlogin <:0>        -daemon
    root 13954     1   0   May 03      -  0:00 /usr/dt/bin/dtlogin -daemon

As you can see, there is not much here that can help you determine a memory bottleneck. The command in Listing 2 shows you the memory usage for each active process running on your system, sorted in a nice format. This uses ps the old fashioned Berkeley Software Distribution (BSD) way, without the dash. What I like about this command is that you don't have to call up any GUI-type tools to quickly get a sense of what is going on from a memory perspective (see Listing 2).

Listing 2. Memory usage for each active process
.
# ps gv | head -n 1; ps gv | egrep -v "RSS" | sort +6b -7 -n -r
  PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
 15256      - A    64:15  755  2572  2888    xx  2356   316  0.9  0.0 /usr/lpp/
 22752      - A     0:08  261  1960  1980 32768   465    20  0.0  0.0 dtwm
 14654      - A     0:00  324  1932  1932    xx   198     0  0.0  0.0 /usr/sbin
 20700      - A     0:07  271  1868  1896 32768    95    28  0.0  0.0 /usr/dt/b
 20444      - A     0:03  203  1736  1824 32768   551    88  0.0  0.0 dtfile
 17602      - A     0:00  274   948  1644 32768   817   696  0.0  0.0 sendmail:
 13218      - A     0:00   74  1620  1620    xx   116     0  0.0  0.0 /usr/sbin

Let's briefly identify what some of this information means.

  • RSS—The amount of RAM used for the text and data segments per process. PID 15256 is using 2888k.
  • %MEM—The actual amount of the RSS / Total RAM. Watch for processes that consume 40-70 percent of %MEM.
  • TRS—The amount of RAM used for the text segment of a process in kilobytes.
  • SIZE—The actual amount of paging space allocated for this process (text and data).

While this command provides a lot of useful information, I don't usually start with this unless one of my trusted administrators has already diagnosed that there is a memory issue of some kind on the system. You should start with the old standby, vmstat. You should actually use vmstat to identify the cause of your bottleneck, even before you have determined that it might be memory related. vmstat reports back information about kernel threads, CPU activity, virtual memory, paging, blocked I/O disks, and related information (see Listing 3). For me, it's the quickest and dirtiest way of finding out what is going on.

Listing 3. Using vmstat to identify the cause of a bottleneck
# vmstat 1 4

System Configuration: lcpu=4 mem=4096MB
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re   pi  po  fr   sr    cy  in   sy  cs  us sy id wa
 1  2 136583  127    0   4   57  44   92    0 345 2223 605  30 40 29 1
 2  7 136587  118    0   2  230   0   245   0 329 3451 526  20 37 10 33
 1  6 136587  157    0   3   67   0   678   0 334 3304 536  25 32 20 23
 3  8 136587  111    0   5   61   0   693   0 329 3341 511  19 26 35 20

Let's first define what the columns mean:

Memory data:

  • avm—The amount of active virtual memory (in 4k pages) you are using, not including file pages.
  • fre—The size of your memory free list. In most cases, I don't worry when this is small, as AIX loves using every last drop of memory and does not return it as fast as you might like. This setting is determined by the minfree parameter of the vmo command. At the end of the day, the paging information is more important.
  • pi—Pages paged in from the paging space.
  • po—Pages paged out to the paging space.

CPU and I/O:

  • r—The average number of runnable kernel threads over the timing interval you have specified.
  • b—The average number of kernel threads that are in the virtual memory waiting queue over your timing interval. If r is not higher than b, that is usually a symptom of a CPU problem, which could be caused by either an I/O or memory bottleneck.
  • us—User time.
  • sy—System time.
  • id—Idle time.
  • wa—Waiting on I/O.

Let's return to the vmstat output and what is wrong with your system. First a disclaimer: Please do not go to senior management with a detailed analysis and recommended tuning strategy based on a five-second vmstat output. You have to work a little harder before you can properly diagnosis the ills of your system. You should use vmstat when you have a production performance issue and need to know as soon as possible what is going on in your system so that you can either alert people of what the problem might be or take immediate action if it is possible and appropriate.

Now then, back to the output. What is going on? Several things, actually. On first glance, you might think you have a CPU bottleneck, as the system is definitely working hard and there is little idle time. As you look at things more carefully though, you'll see that while the CPU might be breathing heavy, there are other things going on—for instance, paging. There is a lot of paging out going on (po), which usually occurs when you are short of real memory. In the output, even your free list has dropped dangerously low. The reason that is probably happening is because your free list (fre) is probably lower then the threshold for minfree, which you had given it using vmo. What about the I/O? When you are seeing blocked processes or high values on waiting on I/O (wa), it usually signifies either real I/O issues where you are waiting for file accesses or an I/O condition associated with paging due to a lack of memory on your system. In this case, it seems to be the latter. You are having VMM issues, which seem to be causing blocked processes and the waiting on I/O condition. You might benefit by either tuning your memory parameters or possibly doing a dynamic LPAR (DLPAR) operation and adding more RAM to your LPAR.

Let's drill down deeper. You can use the ps command that you looked at earlier to try to identify the offending processes. What I'd like to do at this point is run a sar to see if the condition continues to show with another tool. It is a good idea to use multiple tools to further help with the diagnosis to make sure it is right.

While I don't like sar as much as other tools (you need too many flags and have to enter too many commands prior to diagnosing a problem), it allows you to collect data in real time and to view data that was previously captured (using sadc). Most of the older tools allow you to do one or the other. sar has been around for almost as long as UNIX itself and everyone has used it at one time or the other. Use of the -r flag provides some VMM information (see Listing 4).

Listing 4. Using sar with the -r flag to obtain VMM information
# sar -r 1 5
System Configuration: lcpu=4 mem=4096MB

06:18:05   slots  cycle/s  fault/s  odio/s
06:18:06 1048052    0.00    387.25   0.00
06:18:07 1048052    0.00    112.97   0.00
06:18:08 1048052    0.00    45.00   79.21
06:18:09 1048052    0.00    216.00    0.00
06:18:10 1048052    0.00    8.00      0.00

Average  1048052        0      79      16

So what does this actually mean?

  • cycle/s—Reports back the number of page replacement cycles per second.
  • fault/s—Provides the number of page faults per second.
  • Slots—Provides the number of free pages on the paging spaces.
  • odio/s—Provides the number of non paging disk I/Os per second.

You're seeing a lot of page faults per second here, but not much else. You're also seeing that there are 1048052 4k pages available on your paging space, which comes out to 4GB. Time to drill down further using more specific AIX tools.

Specific AIX memory monitoring

In this section, I provide an overview of the specific AIX tools available to you—svmon, procmon, topas, and nmon. Most of these tools allow you to both quickly troubleshoot a performance problem and capture data for historical trending and analysis.

svmon is an analysis utility. It is used specifically for the VMM. It provides a lot of information, including real, virtual, and paging space memory used. The -G flag gives you a global view for memory utilization on your host (see Listing 5).

Listing 5. Using svmon with the -G flag
# svmon -G
               size      inuse       free        pin    virtual
memory      1048576    1048416        160      79327     137750
pg space    1048576        524

               work       pers       clnt      lpage
pin           79327          0          0          0
in use       137764     910652          0          0

The size reports back to total size of RAM in 4k pages. The inuse column reports back the pages in RAM used by processes plus the number of persistent pages that belonged to a terminated process and is still resident in RAM. Free reports back the amount of pages on the free list. Pin reports back the number of pages pinned in physical memory (RAM). This cannot be paged out.

The paging space column reports back the actual use of paging space (in 4k pages). It's important to make the distinction between this and what is reported back in vmstat. The vmstat avm column shows ALL the virtual memory that is accessed, even if it is not paged out. I also like to look at the working and persistent numbers. These parameters show the number of both the working and persistent pages in RAM. Why is this important? As you might remember from Part 1, I discussed some of the differences between working and persistent storage. Computational memory is used while your processes are working on actual computation. They use working segments, which are temporary (transitory) and only exist up until the time a process terminates or the page is stolen. File memory uses persistent segments and have actual permanent storage location on the disk. Data files or executable programs are mapped to persistent segments rather then working segments. Given the alternative, you would much rather have file memory paged to disk than computational memory. In this situation, computational memory is unfortunately paged out more than file memory. Perhaps a little tuning of the vmo parameters might help shift the balance in your favor. Another useful feature of svmon is that you can display memory statistics for a given process. Listing 6 provides an example.

Listing 6. Using svmon to display memory statistics for a given process
# svmon -P | grep -p 15256
-------------------------------------------------------------------------------
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd LPage
   15256 X                12102     3221        0    12022      N     N     N

From here you can determine that this process is not using paging space. Using the ps command I discussed earlier, in conjunction with svmon, positions you to find the offending memory resource hog.

Let's use something a little more user friendly—topas. topas is a nice little performance monitoring tool which can be used for a number of purposes (see Figure 1).

Figure 1. The topas tool
topas tool

As you can see, running topas gives you a list of your process information, CPU, I/O, and VMM activity. From this view you can see that there is very little paging space used on the system. I like to use this command for quickly troubleshooting an issue, especially when I want a little more than vmstat on my screen. I see topas as a graphical type of vmstat. With recent improvements, it now allows the ability to capture data for historical analysis.

What about procmon? First released in AIX Version 5.3, it not only provides overall CPU performance statistics, but it also allows you to take action on the actual running processes. You might already know that you can either kill or renice a process on the fly, but I bet you didn't know that you can drill down into the memory.

Though I would say this is more of a tool people use for CPU analysis, there are also nice hooks into svmon that can help you in a pinch. This view sets options for using the svmon utility from procmon, which allows you to pull your information in a nicer format (see Figure 2).

Figure 2. View setting options for using the svmon utility from procmon
View setting options for using the svmon utility from procmon

You can also export procmon data to a file, which makes it a nice data little data collection tool.

My favorite of all performance tools is actually a non-supported IBM tool called nmon. Similar in some respects to topas, the data that you collect from nmon is either available from your screen (similar to topas) or available through reports that you can capture for trending and analysis. What this tool provides that others simply do not is the ability to view pretty looking charts from an Microsoft® Excel spreadsheet, which can be handed off to senior management or other technical teams for further analysis. This is done with the use of yet another unsupported tool called the nmon analyzer, which provides the hooks into nmon. Figure 3 shows an example of the kind of output that one can expect from an nmon analysis.

Figure 3. nmon analysis output
nmon analysis output

There are many different types of nmon views you can see using this tool, which provide all sorts of CPU, I/O, and memory utilization information.

Summary

In this article, you looked at the various tools that are available to capture data for memory analysis. You also spent some time troubleshooting a system that had some performance problems that you were able to pin (pardon the pun) on virtual memory. I can't reiterate enough that tuning is actually a small part of appropriate tuning methodology. Without capturing data and taking the time to properly analyze your system, you will basically be doing the same thing as a doctor throwing antibiotics at a sick patient without even examining him or her.

There are many different types of performance monitoring tools available to you. Some are tools that you can run from the command line to quickly enable you to gauge the health of your system. Some are more geared to long-term trending and analysis. Some tools even provide you with graphically formatted data that can be handed off to non-technical staff. Regardless of which tool you use, you must also spend the time to learn about what the information you are looking at really means. Don't jump to conclusions based on a small sampling of data. Also, do not rely on only one tool. To substantiate your results, you really should look at a minimum of two tools while performing your analysis. I also briefly discussed tuning methodology and the importance of establishing a baseline while the system is behaving normally. After you examine your data and tune, you must continue to capture data and analyze the results of any changes that are made. Further, you should only make one change at a time, so you can really determine the effect of each individual change.

Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=231288
ArticleTitle=Optimizing AIX 5L performance: Tuning your memory settings, Part 2
publish-date=06142007