Easy system monitoring with SAR
SAR helps you pinpoint performance bottlenecks
Users seem to remember performance problems some time after they occur. Ignoring the "If it wasn't important then, why is it important now?" question that you long to ask, the question then becomes, "What was the condition of the system at the time of the alleged problem?" By periodically taking performance snapshots and reviewing the data, you're one step closer to pinpointing the cause of the problem and creating a solution.
The SAR suite of utilities is bundled with your system (in fact, it is installed on most flavors of UNIX®), but probably not enabled. To enable SAR, you must run some utilities at periodic intervals through the
cron facility. Use the
crontab -e command while running as the root user, and then provide the configuration shown in Listing 1.
Listing 1. Run crontab for the root user to enable the SAR collection
# Collect measurements at 10-minute intervals 0,10,20,30,40,50 * * * * /usr/lib/sa/sa1 # Create daily reports and purge old files 0 0 * * * /usr/lib/sa/sa2 -A
The first command,
sa1, is a shell script that calls
sadc to collect the performance data in a binary log file. The
sa1 command also ensures that each day has its own file, which I explain in the Timing is everything section. Run this command every ten minutes, which is a good tradeoff between granularity and system impact.
The second command,
sa2, is another shell script that dumps all the data from the current day's binary log file into a text file, and then purges any log files older than seven days. The
-A argument specifies what is extracted from the binary file into the text file. Although you can read the text file to see the status of the system for the day, I show you how to query the binary log files to be more precise.
Extracting useful information
Data is being collected, but it must be queried to be useful. Running the
sar command without options generates basic statistics about CPU usage for the current day. Listing 2 shows the output of
sar without any parameters. (You might see different column names depending on the platform. In some UNIX flavors,
sadc collects more or less data based on what's available.) The examples here are from Sun Solaris 10; whatever platform you're using will be similar, but might have slightly different column names.
Listing 2. Default output of sar (showing CPU usage
-bash-3.00$ sar SunOS unknown 5.10 Generic_118822-23 sun4u 01/20/2006 00:00:01 %usr %sys %wio %idle 00:10:00 0 0 0 100 . cut ... 09:30:00 4 47 0 49 Average 0 1 0 98
Each line in the output of
sar is a single measurement, with the timestamp in the left-most column. The other columns hold the data. (These columns vary depending on the command-line arguments you use.) In Listing 2, the CPU usage is broken into four categories:
- %usr: The percentage of time the CPU is spending on user processes, such as applications, shell scripts, or interacting with the user.
- %sys: The percentage of time the CPU is spending executing kernel tasks. In this example, the number is high, because I was pulling data from the kernel's random number generator.
- %wio: The percentage of time the CPU is waiting for input or output from a block device, such as a disk.
- %idle: The percentage of time the CPU isn't doing anything useful.
The last line is an average of all the datapoints. However, because most systems experience busy periods followed by idle periods, the average doesn't tell the entire story.
Watching disk activity
Disk activity is also monitored. High disk usage means that there will be a greater chance that an application requesting data from disk will block (pause) until the disk is ready for that process. The solution typically involves splitting file systems across disks or arrays; however, the first step is to know that you have a problem.
The output of
sar -d shows various disk-related statistics for
one measurement period. For the sake of brevity, Listing 3 shows only hard disk drive activity.
Listing 3. Output of sar -d (showing disk activity)
$ sar -d SunOS unknown 5.10 Generic_118822-23 sun4u 01/22/2006 00:00:01 device %busy avque r+w/s blks/s avwait avserv . cut ... 14:00:02 dad0 31 0.6 78 16102 1.9 5.3 dad0,c 0 0.0 0 0 0.0 0.0 dad0,h 31 0.6 78 16102 1.9 5.3 dad1 0 0.0 0 1 1.6 1.3 dad1,a 0 0.0 0 1 1.6 1.3 dad1,b 0 0.0 0 0 0.0 0.0 dad1,c 0 0.0 0 0 0.0 0.0
As in the previous example, the time is along the left. The other columns are as follows:
- device: This is the disk, or disk partition, being measured. In Sun Solaris, you must translate this disk into a physical disk by looking up the reported name in /etc/path_to_inst, and then cross-reference that information to the entries in /dev/dsk. In Linux®, the major and minor numbers of the disk device are used.
- %busy: This is the percentage of time the device is being read from or written to.
- avque: This is the average depth of the queue that is used to serialize disk activity. The higher the avque value, the more blocking is occurring.
- r+w/s, blks/s: This is disk activity per second in terms of read or write operations and disk blocks, respectively.
- avwait: This is the average time (in milliseconds) that a disk read or write operation waits before it is performed.
- avserv: This is the average time (in milliseconds) that a disk read or write operation takes to execute.
Some of these numbers, such as avwait and avserv values, correlate directly into user experience. High wait times on the disk likely point to several people contending for the disk, which should be confirmed with high avque numbers. High avserv values point to slow disks.
Many other items are collected, with corresponding arguments to view them:
-bargument shows information on buffers and the efficiency of using a buffer versus having to go to disk.
-cargument shows system calls broken down into some of the popular calls, such as
write(). High process creation can lead to poor performance and is a sign that you might need to move some applications to another computer.
-warguments show paging (swapping) activity. High paging is a sign of memory starvation. In particular, the
-wargument shows the number of process switches: A high number can mean too many things are running on the computer, which is spending more time switching than working.
-qargument shows the size of the run queue, which is the same as the load average for the time.
-rargument shows free memory and swap space over time.
Each UNIX flavor implements its own set of measurements and command-line arguments for
sar. Those I've shown are common and represent the elements that I find more useful.
Timing is everything
The examples thus far have shown the current day's data, which has its uses, but it also has two problems:
- You're interested in an hour of data, but you get the whole day.
- You need to go back to a different day.
As you saw earlier,
sa1 saves the data in a different file for each day. Looking at the
sa1 script itself tells you which directory is used; in the case of Sun Solaris 10, it is in /var/adm/sa. Several files reside in this directory, starting with either "sa" or "sar" followed by a number. The number represents the day of the month, with the files beginning with "sar" being text dumps of the data for that day (created by the nightly run of
sa2) and the files beginning with "sa" holding the binary version. Indeed, the file containing the current date is the file that is being read from when you launch
-f to the
sar command selects the file to read from. If today were the 23rd day of the month, I could look at yesterday's data by reading from sa22 with the command
sar -f /var/adm/sa/sa22. You can also pass the other arguments I showed you to access different types of data.
The second thing you can do to narrow the scope of the query is to specify the time by using the
-e arguments (think start and end). Note that
-s is not inclusive, so you must subtract an extra ten minutes from the chosen start time. Continuing with the previous example, Listing 4 shows swap file usage and the run queue for the 22nd from 2:30 p.m. to 3:00 p.m.
Listing 4. A complex sar query specifying date, time, and multiple data sets
# sar -f /var/adm/sa/sa22 -s 14:20 -e 15:00 -w -q -i 4 SunOS unknown 5.10 Generic_118822-23 sun4u 01/22/2006 14:20:00 swpin/s bswin/s swpot/s bswot/s pswch/s 14:30:00 0.00 0.0 0.00 0.0 140 14:40:01 0.00 0.0 0.00 0.0 144 14:50:01 0.00 0.0 0.00 0.0 140 15:00:00 0.00 0.0 0.00 0.0 139 Average 0.00 0.0 0.00 0.0 140 14:20:00 runq-sz %runocc swpq-sz %swpocc 14:30:00 10.5 100 0.0 0 14:40:01 10.5 100 0.0 0 14:50:01 10.4 100 0.0 0 15:00:00 10.5 100 0.0 0 Average 10.5 100 0.0 0
Making sense of it all
A brief look at Listing 4 shows that swap activity was NIL, approximately 140 process switches per second occurred, and the load average was slightly more than ten. Assuming that you were investigating a claim of poor performance at the time, what does this tell you?
- Whatever process is running isn't memory intensive, because you don't see swapping.
- Chances are that this problem is caused by a long-running set of processes, because the run queue and process switches are relatively consistent. Had they not been, you could suspect application-level problems, such as a busy Web server.
- Knowing that the output of Listing 3 shows part of the same time period, you can see that one of the disks was being used heavily (31 percent according to
sar -b, but also 16,000 blocks per second). This disk is the home directory partition; depending on what the user was trying to do, he or she might have experienced slow responses.
A quick look at the CPU usage for the time period shows that the system took up approximately 80 percent of the CPU; the rest was consumed by user tasks. As the systems administrator, you can use this information in three ways:
- Go back over previous days' logs. In this case, I found that the problem started at 1:00 p.m. and ended the next morning.
- Try to correlate the activity to any
cronjobs that might have been started that day.
- Try to find a trend. Looking at data from a couple of other days, I saw that the performance was normal, which isn't indicative of a system that has reached its limits.
In this case, the problem seemed to be isolated, and for good reason -- I was intentionally running the
disks with shell scripts to create some interesting
However, had a trend appeared, such as busy home drives during working hours, it would have been a
call to do something about the problem. Possible solutions range from splitting home directories off to
other disks, installing faster disks, or moving to something like Network Attached Storage (NAS).
Obtaining qualitative data about your system at periodic intervals is an effective way of finding performance bottlenecks and determining whether further action is needed. SAR and related utilities do just this -- snapshots are taken every ten minutes and a front end allows you to access this data. Though tactical in nature, a wealth of information is provided that enables systems administrators to discover just what aspect of the system is suffering and whether it requires further investigation.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- SAR runs on most flavors of UNIX, including AIX®, HP-UX, and Linux.
- SarCheck® has a commercial offering built around SAR that provides a graphical view of the data. A free evaluation is available.
- I've written about using
vmstatto watch current activity for Linux, which also applies to systems such as AIX, Solaris, and HP-UX.