SystemTap is a dynamic method of monitoring and tracing the operation of a running Linux kernel. The key word there is dynamic, because instead of building a special kernel with instrumentation, SystemTap allows you to install that instrumentation dynamically at run time. It does this with an application programming interface (API) called Kprobes, which this article explores. Let's begin with an exploration of some of the earlier kernel tracing approaches, then dig into the SystemTap architecture and its use.
SystemTap is similar to an older technology called DTrace, which
originated in the Sun Solaris operating system. Within DTrace, developers
can write scripts in the D programming language (a subset of the
C language but modified to support
trace-specific behaviors). A DTrace script contains a number of probes and
associated actions that occur when the probe "fires." For example, a probe
can represent something as simple as invoking a system call or more
complicated interactions such as a particular line of code being executed.
Listing 1 shows a simple example of a DTrace script that counts the number
of system calls made by each process. (Note the use of the dictionary to
associate counts with processes). The format of the script includes the
probe (which fires when a system call is made) and an action (the
corresponding action script).
Listing 1. A simple DTrace script to count system calls per process
syscall:::entry
{
@num[pid,execname] = count();
}
|
DTrace has been an enviable part of Solaris, so it's not surprising to find it developed for other operating systems, as well. DTrace was released under the Common Development and Distribution License (CDDL) and has been ported into the FreeBSD operating system.
Another useful kernel-tracing facility called ProbeVue was developed
by IBM for the IBM® AIX® operating system, version 6.1. You
can use ProbeVue to explore the behavior and performance of a system as
well as provide detailed information about a particular process. The tool
does this in a dynamic way, using a standard kernel. Listing 2 shows a
sample script for ProbeVue that indicates the particular process that's
calling the sync system call.
Listing 2. A simple ProbeVue script to indicate which of your processes invoked sync
@@syscall:*:sync:entry
{
printf( "sync() syscall invoked by process ID %d\n", __pid );
exit();
}
|
Given the usefulness of DTrace and ProbeVue on their respective operating systems, an open source project for Linux was inevitable. SystemTap began in 2005 and provides similar functionality to both DTrace and ProbeVue. It was developed by a community that includes Red Hat, Intel, Hitachi, and IBM, among others.
Each of these solutions provides similar functionality, using probes and associated action scripts when the probes fire. Now, let's look at the installation of SystemTap, then explore its architecture and use.
Depending on your distribution and kernel, you may support SystemTap with nothing more than a SystemTap installation. In other cases, a debug kernel image is required. This section walks through installation of SystemTap on Ubuntu version 8.10 (Intrepid Ibex), which is not representative of a SystemTap installation. In the Resources section, you'll find more information on installations for other distributions and versions.
For most users, a simple installation of SystemTap is all that's required.
For Ubuntu, you use apt-get:
$ sudo apt-get install systemtap |
With the installation complete, you can test your kernel to see whether it supports SystemTap. The following simple command-line script meets that goal:
$ sudo stap -ve 'probe begin { log("hello world") exit() }'
|
If this script works, you'll see "hello world" on standard output
[stdout]. If not, sorry, you have some additional work to do. For
Ubuntu 8.10, a debug kernel image was required. It should be possible
to simply use apt-get to retrieve the package
linux-image-debug-generic. But because this
couldn't be done directly with apt-get, you can
download one and install it through dpkg. You
can download the generic debug image and install as shown below:
$ wget http://ddebs.ubuntu.com/pool/main/l/linux/
linux-image-debug-2.6.27-14-generic_2.6.27-14.39_i386.ddeb
$ sudo dpkg -i linux-image-debug-2.6.27-14-generic_2.6.27-14.39_i386.ddeb
|
You now have a generic debug image installed. There is just one more step for Ubuntu 8.10: The SystemTap distribution had a problem that you can easily solve by modifying the SystemTap source. Check out Resources for information on how to update the run time time.c file.
If you have a custom kernel, you'll need to ensure that the kernel
options CONFIG_RELAY,
CONFIG_DEBUG_FS,
CONFIG_DEBUG_INFO, and
CONFIG_KPROBES are enabled.
Let's dig into some of the details of SystemTap to understand how it provides dynamic probes within a running kernel. You'll also see how SystemTap works, from the scripting process to getting these scripts active within a running kernel.
Dynamically instrumenting a kernel
Two of the methods used in SystemTap to instrument a running kernel are Kprobes and return probes. But an important element of understanding any kernel is the map of the kernel, which provides symbol information (such as functions and variables as well as their addresses). Having the map, you can resolve the address of any symbol and make changes to support probing behavior.
Kprobes has been in the mainline Linux kernel since version 2.6.9 and provides a general service for probing a kernel. It provides a few different services, but two of the most important are Kprobe and Kretprobe. The Kprobe is architecture specific and inserts a breakpoint instruction at the first byte of the instruction to be probed. When the instruction is hit, the particular handler for the probe is executed. When complete, the original instruction is executed (from the breakpoint), and execution continues at that point.
Kretprobes are a bit different, operating on the return of the called function. Note that because a function may have many return points, it sounds a bit complicated. However, it actually uses a simple technique called a trampoline. Rather than instrument every return point in a function (which would not catch all cases), you add a small amount of code to the function entry. This code replaces the return address on the stack with the trampoline address—the Kretprobe address. When the function exists, instead of returning to the caller, it calls the Kretprobe, which executes its functionality, then returns to the actual caller from the Kretprobe.
Figure 1 presents the basic flow of the SystemTap process, involving three
interacting utilities over five phases. The process begins with the
SystemTap script. You use the stap utility to
convert the stap script into the kernel module that provides the probe
behaviors. The stap process begins with a translation of the script into a
parse tree (pass 1). The symbols are then resolved using symbol
information about the currently running kernel in the elaboration step
(pass 2). The translation process then converts the parse tree into
C source (pass 3) and uses the resolved
information as well as what are called tapset scripts (libraries of
useful functionality defined by SystemTap). The final step of stap is the
construction of the kernel module (pass 4), which uses the local kernel
module build process.
Figure 1. The SystemTap process
With the availability of the kernel module, stap
hands control over to two other SystemTap utilities:
staprun and stapio.
These two utilities work in concert to manage the installation of the
module into the kernel and route its output to stdout (pass 5). If you
press Ctrl-C in the shell or the script exits, the cleanup process is
performed, which unloads the module and causes all associated utilities to
exit.
An interesting feature of SystemTap is the ability to cache script translations. If a script is to be installed and hasn't changed, you can use the existing module instead of going through the process of rebuilding it. Figure 2 shows the user-space and kernel-space elements along with the stap-based translation process.
Figure 2. The SystemTap process from the kernel/user-space perspective
Scripting in SystemTap is quite simple but also flexible, with many options to get out of it what you need. The Resources section provides links to manuals that detail the language and possibilities, but this section explores a few examples to give you a taste of SystemTap scripts.
SystemTap scripts are made up of probes and associated blocks of code to be executed when the probe fires. Probes have a number of defined patterns, such as those shown in Table 1. This table enumerates several probe types, including calling a kernel function and returning from a kernel function.
Table 1. Example probe patterns
| Probe type | Description |
|---|---|
begin
| Fires when the script begins |
end
| Fires when the script ends |
kernel.function("sys_sync")
| Fires when sys_sync is called |
kernel.function("sys_sync").call
| Same as above |
kernel.function("sys_sync").return
| Fires when sys_sync returns |
kernel.syscall.*
| Fires when any system call is made |
kernel.function("*@kernel/fork.c:934")
| Fires when line 934 of fork.c is hit |
module("ext3").function("ext3_file_write")
| Fires when the ext3 write function is
called |
timer.jiffies(1000)
| Fires every 1000 kernel jiffies |
timer.ms(200).randomize(50)
| Fires every 200ms, with a linearly distributed random additive (-50 to +50) |
Let's look at a simple example to understand how to construct a probe and
associate code with that probe. A sample probe is shown in Listing 3 that
exists to fire when the kernel system call
sys_sync is invoked. When this probe fires, you
want to count the number of invocations and emit this count with an
indication of the calling process ID (PID). First, declare a global value
that any probe can use (the global namespace is common to all probes),
then initialize it to zero. Next, define your probe, which is an entry
probe into the kernel function sys_sync. The
script associated with the probe is to increment the
count variable, then to emit a message that
defines the number of times the call has been made and the PID for the
current invocation. Note that this example appears very much like
C (except for the probe definition syntax),
which, if you have a background in C, is great.
Listing 3. A simple probe and script
global count=0
probe kernel.function("sys_sync") {
count++
printf( "sys_sync called %d times, currently by pid %d\n", count, pid );
}
|
You can also declare functions that probes can call, which is perfect for common functions for which you'd like to serve multiple probes. The tool even supports recursion to a given depth.
SystemTap permits definition of variables of a number of types, but the type is inferred from context, so no type declarations are needed. In SystemTap, you'll find numbers (64-bit signed integers), integers (64-bit quantities), strings, and literals (strings or integers). You can also use associative arrays and statistics (which we'll explore later).
SystemTap provides all of the necessary operators that you'd expect from
C and follows the same rules. You'll find
arithmetic operators, binary operators, assignment operators, and pointer
dereferencing. You'll also find some simplifications from
C, which include string concatenation,
associative array elements, and aggregation operators.
Within a probe, SystemTap provides a comfortable set of statements
reminiscent of C. Note that although the
language allows you to develop complex scripts, only 1000 statements can
be executed per probe (though this number is configurable).
Table 2 provides a short list of the language
statements just to provide an overview. Note here that many elements
appear exactly as they do in C, though there
are some additions specific to SystemTap.
Table 2. Language elements of SystemTap
| Statement | Description |
|---|---|
if (exp) {} else {}
| Standard if-then-else statement |
for (exp1 ; exp2 ; exp3 ) {}
| A for loop |
while (exp) {}
| Standard while loop |
do {} while (exp)
| A do-while loop |
break
| Exit iteration |
continue
| Continue iteration |
next
| Return from the probe |
return
| Return an expression from a function |
foreach (VAR in ARRAY) {}
| Iterate an array, assigning the current key to
VAR |
This article explores the facilities of statistics and aggregation in the
sample scripts, as these do not have counterparts in the
C language.
Finally, SystemTap provides a number of internal functions that offer
additional information about the current context. For example, you can use
caller() to identify the calling function,
cpu() to identify the current processor number,
and pid() to return the PID. SystemTap provides
a number of other functions, as well, providing access to the call stack
and current registers.
With a quick introduction to SystemTap under your belt, let's explore a few simple examples to see how SystemTap really works. This article also demonstrate some of the interesting aspects of the scripting language, such as aggregations.
The previous section explored a simple script to monitor the
sync system call. Now, let's look at a more
general script that can monitor all system calls and collect additional
information about them.
Listing 4 shows a simple script that includes a global variable definition
and three separate probes. The first probe is invoked when the script is
first loaded (the begin probe). In this probe,
you simply emit a text message to indicate that the script is running in
the kernel. Next, you have a syscall probe.
Note the use of the wildcard (*) here, which
tells SystemTap to monitor all matching system calls. When the probe
fires, you increment an associative array element for the given PID and
process name. The final probe is a timer probe. This probe fires after
10,000 milliseconds (10 seconds). The script for this probe then emits the
collected data (iterating through each of the associative array members).
When all members have been iterated, the exit
call is made, which causes the module to unload and all associated
SystemTap processes to exit.
Listing 4. Monitoring all system calls (profile.stp)
global syscalllist
probe begin {
printf("System Call Monitoring Started (10 seconds)...\n")
}
probe syscall.*
{
syscalllist[pid(), execname()]++
}
probe timer.ms(10000) {
foreach ( [pid, procname] in syscalllist ) {
printf("%s[%d] = %d\n", procname, pid, syscalllist[pid, procname] )
}
exit()
}
|
The output for the script in Listing 4 is shown in Listing 5. You can see from this script each of the processes running in user space and the number of system calls made over the 10-second period.
Listing 5. Output from the profile.stp script
$ sudo stap profile.stp System Call Monitoring Started (10 seconds)... stapio[16208] = 104 gnome-terminal[6416] = 196 Xorg[5525] = 90 vmware-guestd[5307] = 764 hald-addon-stor[4969] = 30 hald-addon-stor[4988] = 15 update-notifier[6204] = 10 munin-node[5925] = 5 gnome-panel[6190] = 33 ntpd[5830] = 20 pulseaudio[6152] = 25 miniserv.pl[5859] = 10 syslogd[4513] = 5 gnome-power-man[6215] = 4 gconfd-2[6157] = 5 hald[4877] = 3 $ |
System call monitoring for a specific process
In this example, you modify your last script slightly to collect system call data for a single process. Further, instead of just capturing counts, you capture the specific system call that's being made for your target process. The script is shown in Listing 6.
This example tests for the particular process of interest (in this case,
the syslog daemon), then changes your
associative array to map system call names to counts. When your timer
probe fires, you emit the system call and count data.
Listing 6. New system call monitoring script (syslog_profile.stp)
global syscalllist
probe begin {
printf("Syslog Monitoring Started (10 seconds)...\n")
}
probe syscall.*
{
if (execname() == "syslogd") {
syscalllist[name]++
}
}
probe timer.ms(10000) {
foreach ( name in syscalllist ) {
printf("%s = %d\n", name, syscalllist[name] )
}
exit()
}
|
The output for this script is provided in Listing 7.
Listing 7. SystemTap output for the new script (syslog_profile.stp)
$ sudo stap syslog_profile.stp Syslog Monitoring Started (10 seconds)... writev = 3 rt_sigprocmask = 1 select = 1 $ |
Using aggregates to capture numerical data
Aggregate instances are a great way to capture statistics on numerical
values. This method is efficient and useful when you're capturing a large
amount of data. In this example, you collect data on network packet
receipt and transmission. Listing 8 defines two new probes to capture
network I/O. Each probe captures the packet length for the given network
device name, PID, and process name. The end probe, which is called if the
user presses Ctrl-C, provides the means to emit the captured data. In this
case, you iterate through the contents of the
recv aggregate, sum the packet lengths for each
tuple (device name, PID, and process name), then emit this data. Note the
extractor used here to sum the tuples: the
@count extractor to grab the number of lengths
captured (packet counts). You could also use the
@sum extractor to perform a summation, the
@min or @max to
gather the minimum or maximum lengths, respectively, as well as compute
the average with the @avg extractor.
Listing 8. Gathering network packet length data (net.stp)
global recv, xmit
probe begin {
printf("Starting network capture (Ctl-C to end)\n")
}
probe netdev.receive {
recv[dev_name, pid(), execname()] <<< length
}
probe netdev.transmit {
xmit[dev_name, pid(), execname()] <<< length
}
probe end {
printf("\nEnd Capture\n\n")
printf("Iface Process........ PID.. RcvPktCnt XmtPktCnt\n")
foreach ([dev, pid, name] in recv) {
recvcount = @count(recv[dev, pid, name])
xmitcount = @count(xmit[dev, pid, name])
printf( "%5s %-15s %-5d %9d %9d\n", dev, name, pid, recvcount, xmitcount )
}
delete recv
delete xmit
}
|
The output for the script in Listing 8 is provided in Listing 9. Note here that the script exits after the user presses Ctrl-C, then emits the captured data.
Listing 9. The output for net.stp
$ sudo stap net.stp Starting network capture (Ctl-C to end) ^C End Capture Iface Process........ PID.. RcvPktCnt XmtPktCnt eth0 swapper 0 122 85 eth0 metacity 6171 4 2 eth0 gconfd-2 6157 5 1 eth0 firefox 21424 48 98 eth0 Xorg 5525 36 21 eth0 bash 22860 1 0 eth0 vmware-guestd 5307 1 1 eth0 gnome-screensav 6244 6 3 Pass 5: run completed in 0usr/50sys/37694real ms. $ |
This final example explores how easy it is for SystemTap to present data in
other forms—in this case, a histogram. Returning to the previous
example, capture your data into an aggregate called histogram (see
Listing 10). Then, use the netdev receive and
transmit probes to capture the packet length data. When the probe ends,
you emit the data in a histogram using the
@hist_log extractor.
Listing 10. Capturing and presenting histogram data (nethist.stp)
global histogram
probe begin {
printf("Capturing...\n")
}
probe netdev.receive {
histogram <<< length
}
probe netdev.transmit {
histogram <<< length
}
probe end {
printf( "\n" )
print( @hist_log(histogram) )
}
|
The output from Listing 10 is shown in Listing 11. In this example, a
browser session, FTP session, and ping were
used to generate network traffic. The @hist_log
extractor is a base-2 logarithmic histogram (as shown). Other histograms
can be captured, allowing you to define the bucket sizes.
Listing 11. Histogram output from nethist.stp
$ sudo stap nethist.stp
Capturing...
^C
value |-------------------------------------------------- count
8 | 0
16 | 0
32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1601
64 |@ 52
128 |@ 46
256 |@@@@ 164
512 |@@@ 140
1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2033
2048 | 0
4096 | 0
$
|
This article barely scratched the surface of the capabilities of SystemTap. In the Resources section, you'll find links to a number of tutorials, examples, and the language reference, which tells you everything you need to know to use SystemTap. SystemTap uses several existing methods and has learned from prior implementations of kernel tracing. Although it's still under active development, the tool is very usable now, and it will be interesting to see what comes in the future.
Learn
- Check out the
SystemTap project Web site
for the latest information, including current versions, documentation, links,
and also how
to get involved with the SystemTap project. SystemTap
uses Kprobes as the underlying method of installing probe points into a
running kernel. Learn more about
Kprobes at the
same Sourceware Web site.
- The IBM Redpaper
"SystemTap: Instrumenting
the Linux Kernel for Analyzing Performance and Functional Problems" provides
more information on using SystemTap.
- An IBM Blueprint for Linux on IBM systems
shows how to
install and use SystemTap
on Red Hat Enterprise Linux and SUSE Linux Enterprise Server. Another
Blueprint discusses how to use the
SystemTap GUI,
an Eclipse-based tool that simplifies writing SystemTap scripts and
visualizing kernel events.
- Learn how to
modify
SystemTap for Ubuntu 8.10
to correct the bug in the run time's time.c file.
- The paper
"Dynamic
Instrumentation of Production Systems"
is from the 2004 USENIX, which presented the DTrace facility, and was
presented by the authors from Sun Microsystems.
- This
architecture paper from
2005
introduces the SystemTap architecture and design format. There, you can
learn the motivation and requirements behind SystemTap. In addition to
providing a great amount of technical detail on SystemTap, the paper is
also a great model for design documentation.
- This
Kprobes
tutorial, given at the
2006 Ottawa Linux
symposium, provides a short but useful introduction to kernel probing with
Kprobes. You might also find the article
"Kernel
debugging with Kprobes"
(developerWorks, August 2004) interesting.
- In this presentation titled
"Dynamic
Tracing and Performance Analysis Using SystemTap,"
Josh Stone of Intel provides a great
tutorial on SystemTap.
This presentation provides a fairly complete introduction to SystemTap,
its language, and its use.
- The
SystemTap Language Reference
is a great resource for learning the SystemTap language and all its
capabilities.
- Wikipedia provides a number of useful
resources for
SystemTap,
DTrace, and
ProbeVue. You'll also
find a set of external links for presentations and tutorials for each of
these technologies.
-
In the
developerWorks Linux zone,
find more resources for Linux developers, and scan our
most popular articles and
tutorials.
-
See all
Linux tips and
Linux tutorials on developerWorks.
-
Stay current with
developerWorks technical events and Webcasts.
Get products and technologies
-
With
IBM trial software,
available for download directly from developerWorks, build your next development
project on Linux.
Discuss
- Discuss SystemTap and the SystemTap GUI
on the
SystemTap Blueprints Community Forum.
-
Get involved in the
My developerWorks community; with your personal profile and custom home page, you
can tailor developerWorks to your interests and interact with other developerWorks users.

M. Tim Jones is an embedded firmware architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its second edition), AI Application Programming (in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.





