Easy system monitoring with SAR

SAR helps you pinpoint performance bottlenecks

Learn how to correlate user complaints with the system activity reporter (SAR) and build a performance baseline for trending purposes using SAR logs. SAR is the perfect tool for systems administrators. It captures important system performance metrics at periodic intervals.

Sean Walberg (sean@ertw.com), Senior Network Engineer, P.Eng

Photo of Sean WalbergSean Walberg has been working with Linux and UNIX systems since 1994 in academic, corporate, and Internet service provider environments. He has written extensively about systems administration over the past several years. You can contact him at sean@ertw.com.



28 February 2006

Also available in Russian

Users seem to remember performance problems some time after they occur. Ignoring the "If it wasn't important then, why is it important now?" question that you long to ask, the question then becomes, "What was the condition of the system at the time of the alleged problem?" By periodically taking performance snapshots and reviewing the data, you're one step closer to pinpointing the cause of the problem and creating a solution.

Collecting data

The SAR suite of utilities is bundled with your system (in fact, it is installed on most flavors of UNIX®), but probably not enabled. To enable SAR, you must run some utilities at periodic intervals through the cron facility. Use the crontab -e command while running as the root user, and then provide the configuration shown in Listing 1.

Listing 1. Run crontab for the root user to enable the SAR collection
# Collect measurements at 10-minute intervals
0,10,20,30,40,50   * * * *   /usr/lib/sa/sa1
# Create daily reports and purge old files
0                  0 * * *   /usr/lib/sa/sa2 -A

The first command, sa1, is a shell script that calls sadc to collect the performance data in a binary log file. The sa1 command also ensures that each day has its own file, which I explain in the Timing is everything section. Run this command every ten minutes, which is a good tradeoff between granularity and system impact.

The second command, sa2, is another shell script that dumps all the data from the current day's binary log file into a text file, and then purges any log files older than seven days. The -A argument specifies what is extracted from the binary file into the text file. Although you can read the text file to see the status of the system for the day, I show you how to query the binary log files to be more precise.


Extracting useful information

Data is being collected, but it must be queried to be useful. Running the sar command without options generates basic statistics about CPU usage for the current day. Listing 2 shows the output of sar without any parameters. (You might see different column names depending on the platform. In some UNIX flavors, sadc collects more or less data based on what's available.) The examples here are from Sun Solaris 10; whatever platform you're using will be similar, but might have slightly different column names.

Listing 2. Default output of sar (showing CPU usage
-bash-3.00$ sar

SunOS unknown 5.10 Generic_118822-23 sun4u    01/20/2006

00:00:01    %usr    %sys    %wio   %idle
00:10:00       0       0       0     100
. cut ...
09:30:00       4      47       0      49

Average        0       1       0      98

Each line in the output of sar is a single measurement, with the timestamp in the left-most column. The other columns hold the data. (These columns vary depending on the command-line arguments you use.) In Listing 2, the CPU usage is broken into four categories:

  • %usr: The percentage of time the CPU is spending on user processes, such as applications, shell scripts, or interacting with the user.
  • %sys: The percentage of time the CPU is spending executing kernel tasks. In this example, the number is high, because I was pulling data from the kernel's random number generator.
  • %wio: The percentage of time the CPU is waiting for input or output from a block device, such as a disk.
  • %idle: The percentage of time the CPU isn't doing anything useful.

The last line is an average of all the datapoints. However, because most systems experience busy periods followed by idle periods, the average doesn't tell the entire story.

Watching disk activity

Disk activity is also monitored. High disk usage means that there will be a greater chance that an application requesting data from disk will block (pause) until the disk is ready for that process. The solution typically involves splitting file systems across disks or arrays; however, the first step is to know that you have a problem.

The output of sar -d shows various disk-related statistics for one measurement period. For the sake of brevity, Listing 3 shows only hard disk drive activity.

Listing 3. Output of sar -d (showing disk activity)
$ sar -d

SunOS unknown 5.10 Generic_118822-23 sun4u    01/22/2006

00:00:01   device       %busy   avque   r+w/s  blks/s  avwait  avserv
. cut ...
14:00:02   dad0             31     0.6      78   16102     1.9     5.3
           dad0,c            0     0.0       0       0     0.0     0.0
           dad0,h           31     0.6      78   16102     1.9     5.3
           dad1              0     0.0       0       1     1.6     1.3
           dad1,a            0     0.0       0       1     1.6     1.3
           dad1,b            0     0.0       0       0     0.0     0.0
           dad1,c            0     0.0       0       0     0.0     0.0

As in the previous example, the time is along the left. The other columns are as follows:

  • device: This is the disk, or disk partition, being measured. In Sun Solaris, you must translate this disk into a physical disk by looking up the reported name in /etc/path_to_inst, and then cross-reference that information to the entries in /dev/dsk. In Linux®, the major and minor numbers of the disk device are used.
  • %busy: This is the percentage of time the device is being read from or written to.
  • avque: This is the average depth of the queue that is used to serialize disk activity. The higher the avque value, the more blocking is occurring.
  • r+w/s, blks/s: This is disk activity per second in terms of read or write operations and disk blocks, respectively.
  • avwait: This is the average time (in milliseconds) that a disk read or write operation waits before it is performed.
  • avserv: This is the average time (in milliseconds) that a disk read or write operation takes to execute.

Some of these numbers, such as avwait and avserv values, correlate directly into user experience. High wait times on the disk likely point to several people contending for the disk, which should be confirmed with high avque numbers. High avserv values point to slow disks.

Other metrics

Many other items are collected, with corresponding arguments to view them:

  • The -b argument shows information on buffers and the efficiency of using a buffer versus having to go to disk.
  • The -c argument shows system calls broken down into some of the popular calls, such as fork(), exec(), read(), and write(). High process creation can lead to poor performance and is a sign that you might need to move some applications to another computer.
  • The -g, -p, and -w arguments show paging (swapping) activity. High paging is a sign of memory starvation. In particular, the -w argument shows the number of process switches: A high number can mean too many things are running on the computer, which is spending more time switching than working.
  • The -q argument shows the size of the run queue, which is the same as the load average for the time.
  • The -r argument shows free memory and swap space over time.

Each UNIX flavor implements its own set of measurements and command-line arguments for sar. Those I've shown are common and represent the elements that I find more useful.


Timing is everything

The examples thus far have shown the current day's data, which has its uses, but it also has two problems:

  • You're interested in an hour of data, but you get the whole day.
  • You need to go back to a different day.

As you saw earlier, sa1 saves the data in a different file for each day. Looking at the sa1 script itself tells you which directory is used; in the case of Sun Solaris 10, it is in /var/adm/sa. Several files reside in this directory, starting with either "sa" or "sar" followed by a number. The number represents the day of the month, with the files beginning with "sar" being text dumps of the data for that day (created by the nightly run of sa2) and the files beginning with "sa" holding the binary version. Indeed, the file containing the current date is the file that is being read from when you launch sar.

Specifying -f to the sar command selects the file to read from. If today were the 23rd day of the month, I could look at yesterday's data by reading from sa22 with the command sar -f /var/adm/sa/sa22. You can also pass the other arguments I showed you to access different types of data.

The second thing you can do to narrow the scope of the query is to specify the time by using the -s and -e arguments (think start and end). Note that -s is not inclusive, so you must subtract an extra ten minutes from the chosen start time. Continuing with the previous example, Listing 4 shows swap file usage and the run queue for the 22nd from 2:30 p.m. to 3:00 p.m.

Listing 4. A complex sar query specifying date, time, and multiple data sets
# sar -f /var/adm/sa/sa22 -s 14:20 -e 15:00 -w -q -i 4

SunOS unknown 5.10 Generic_118822-23 sun4u    01/22/2006

14:20:00 swpin/s bswin/s swpot/s bswot/s pswch/s
14:30:00    0.00     0.0    0.00     0.0     140
14:40:01    0.00     0.0    0.00     0.0     144
14:50:01    0.00     0.0    0.00     0.0     140
15:00:00    0.00     0.0    0.00     0.0     139

Average     0.00     0.0    0.00     0.0     140

14:20:00 runq-sz %runocc swpq-sz %swpocc
14:30:00    10.5     100     0.0       0
14:40:01    10.5     100     0.0       0
14:50:01    10.4     100     0.0       0
15:00:00    10.5     100     0.0       0

Average     10.5     100     0.0       0

Making sense of it all

A brief look at Listing 4 shows that swap activity was NIL, approximately 140 process switches per second occurred, and the load average was slightly more than ten. Assuming that you were investigating a claim of poor performance at the time, what does this tell you?

  • Whatever process is running isn't memory intensive, because you don't see swapping.
  • Chances are that this problem is caused by a long-running set of processes, because the run queue and process switches are relatively consistent. Had they not been, you could suspect application-level problems, such as a busy Web server.
  • Knowing that the output of Listing 3 shows part of the same time period, you can see that one of the disks was being used heavily (31 percent according to sar -b, but also 16,000 blocks per second). This disk is the home directory partition; depending on what the user was trying to do, he or she might have experienced slow responses.

A quick look at the CPU usage for the time period shows that the system took up approximately 80 percent of the CPU; the rest was consumed by user tasks. As the systems administrator, you can use this information in three ways:

  • Go back over previous days' logs. In this case, I found that the problem started at 1:00 p.m. and ended the next morning.
  • Try to correlate the activity to any cron jobs that might have been started that day.
  • Try to find a trend. Looking at data from a couple of other days, I saw that the performance was normal, which isn't indicative of a system that has reached its limits.

In this case, the problem seemed to be isolated, and for good reason -- I was intentionally running the disks with shell scripts to create some interesting sar reports! However, had a trend appeared, such as busy home drives during working hours, it would have been a call to do something about the problem. Possible solutions range from splitting home directories off to other disks, installing faster disks, or moving to something like Network Attached Storage (NAS).


Conclusion

Obtaining qualitative data about your system at periodic intervals is an effective way of finding performance bottlenecks and determining whether further action is needed. SAR and related utilities do just this -- snapshots are taken every ten minutes and a front end allows you to access this data. Though tactical in nature, a wealth of information is provided that enables systems administrators to discover just what aspect of the system is suffering and whether it requires further investigation.

Resources

Learn

  • SAR runs on most flavors of UNIX, including AIX®, HP-UX, and Linux.
  • Stay current with developerWorks technical events and Webcasts.
  • The UNIX Insider Perfomance Q&A column has some valuable advice on performance-tuning Solaris, including more interpretation of sar results.
  • If you liked sar, you might also like iostat and vmstat, which let you dig into current system activity in more depth. The Solaris System Adminstration Guide outlines these tools' use along with more information on sar. Like sar, most of this information applies to other flavors of UNIX.
  • I've written about using vmstat to watch current activity for Linux, which also applies to systems such as AIX, Solaris, and HP-UX.

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • SarCheck® has a commercial offering built around SAR that provides a graphical view of the data. A free evaluation is available.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=104679
ArticleTitle=Easy system monitoring with SAR
publish-date=02282006