Linux introspection and SystemTap

An interface and language for dynamic kernel analysis

Modern operating system kernels provide the means for introspection, the ability to peer dynamically within the kernel to understand its behaviors. These behaviors can indicate problems in the kernel as well as performance bottlenecks. With this knowledge, you can tune or modify the kernel to avoid failure conditions. Discover an open source infrastructure called SystemTap that provides this dynamic introspection for the Linux® kernel.

M. Tim Jones (mtj@mtjones.com), Independent author

M. Tim JonesM. Tim Jones is an embedded firmware architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its second edition), AI Application Programming (in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.


developerWorks Contributing author
        level

09 November 2009

Also available in Japanese Portuguese

Connect with Tim

Tim is one of our most popular and prolific authors. Browse all of Tim's articles on developerWorks. Check out Tim's profile and connect with him, other authors, and fellow readers in My developerWorks.

SystemTap is a dynamic method of monitoring and tracing the operation of a running Linux kernel. The key word there is dynamic, because instead of building a special kernel with instrumentation, SystemTap allows you to install that instrumentation dynamically at run time. It does this with an application programming interface (API) called Kprobes, which this article explores. Let's begin with an exploration of some of the earlier kernel tracing approaches, then dig into the SystemTap architecture and its use.

Kernel tracing

SystemTap is similar to an older technology called DTrace, which originated in the Sun Solaris operating system. Within DTrace, developers can write scripts in the D programming language (a subset of the C language but modified to support trace-specific behaviors). A DTrace script contains a number of probes and associated actions that occur when the probe "fires." For example, a probe can represent something as simple as invoking a system call or more complicated interactions such as a particular line of code being executed. Listing 1 shows a simple example of a DTrace script that counts the number of system calls made by each process. (Note the use of the dictionary to associate counts with processes). The format of the script includes the probe (which fires when a system call is made) and an action (the corresponding action script).

Listing 1. A simple DTrace script to count system calls per process
syscall:::entry 
{ 

  @num[pid,execname] = count(); 

}

DTrace has been an enviable part of Solaris, so it's not surprising to find it developed for other operating systems, as well. DTrace was released under the Common Development and Distribution License (CDDL) and has been ported into the FreeBSD operating system.

Another useful kernel-tracing facility called ProbeVue was developed by IBM for the IBM® AIX® operating system, version 6.1. You can use ProbeVue to explore the behavior and performance of a system as well as provide detailed information about a particular process. The tool does this in a dynamic way, using a standard kernel. Listing 2 shows a sample script for ProbeVue that indicates the particular process that's calling the sync system call.

Listing 2. A simple ProbeVue script to indicate which of your processes invoked sync
@@syscall:*:sync:entry
{
  printf( "sync() syscall invoked by process ID %d\n", __pid );
  exit();
}

Given the usefulness of DTrace and ProbeVue on their respective operating systems, an open source project for Linux was inevitable. SystemTap began in 2005 and provides similar functionality to both DTrace and ProbeVue. It was developed by a community that includes Red Hat, Intel, Hitachi, and IBM, among others.

Each of these solutions provides similar functionality, using probes and associated action scripts when the probes fire. Now, let's look at the installation of SystemTap, then explore its architecture and use.


Installing SystemTap

Depending on your distribution and kernel, you may support SystemTap with nothing more than a SystemTap installation. In other cases, a debug kernel image is required. This section walks through installation of SystemTap on Ubuntu version 8.10 (Intrepid Ibex), which is not representative of a SystemTap installation. In the Resources section, you'll find more information on installations for other distributions and versions.

For most users, a simple installation of SystemTap is all that's required. For Ubuntu, you use apt-get:

$ sudo apt-get install systemtap

With the installation complete, you can test your kernel to see whether it supports SystemTap. The following simple command-line script meets that goal:

$ sudo stap -ve 'probe begin { log("hello world") exit() }'

If this script works, you'll see "hello world" on standard output [stdout]. If not, sorry, you have some additional work to do. For Ubuntu 8.10, a debug kernel image was required. It should be possible to simply use apt-get to retrieve the package linux-image-debug-generic. But because this couldn't be done directly with apt-get, you can download one and install it through dpkg. You can download the generic debug image and install as shown below:

$ wget http://ddebs.ubuntu.com/pool/main/l/linux/
          linux-image-debug-2.6.27-14-generic_2.6.27-14.39_i386.ddeb
$ sudo dpkg -i linux-image-debug-2.6.27-14-generic_2.6.27-14.39_i386.ddeb

You now have a generic debug image installed. There is just one more step for Ubuntu 8.10: The SystemTap distribution had a problem that you can easily solve by modifying the SystemTap source. Check out Resources for information on how to update the run time time.c file.

If you have a custom kernel, you'll need to ensure that the kernel options CONFIG_RELAY, CONFIG_DEBUG_FS, CONFIG_DEBUG_INFO, and CONFIG_KPROBES are enabled.


SystemTap architecture

Let's dig into some of the details of SystemTap to understand how it provides dynamic probes within a running kernel. You'll also see how SystemTap works, from the scripting process to getting these scripts active within a running kernel.

Dynamically instrumenting a kernel

Two of the methods used in SystemTap to instrument a running kernel are Kprobes and return probes. But an important element of understanding any kernel is the map of the kernel, which provides symbol information (such as functions and variables as well as their addresses). Having the map, you can resolve the address of any symbol and make changes to support probing behavior.

Kprobes has been in the mainline Linux kernel since version 2.6.9 and provides a general service for probing a kernel. It provides a few different services, but two of the most important are Kprobe and Kretprobe. The Kprobe is architecture specific and inserts a breakpoint instruction at the first byte of the instruction to be probed. When the instruction is hit, the particular handler for the probe is executed. When complete, the original instruction is executed (from the breakpoint), and execution continues at that point.

Kretprobes are a bit different, operating on the return of the called function. Note that because a function may have many return points, it sounds a bit complicated. However, it actually uses a simple technique called a trampoline. Rather than instrument every return point in a function (which would not catch all cases), you add a small amount of code to the function entry. This code replaces the return address on the stack with the trampoline address—the Kretprobe address. When the function exists, instead of returning to the caller, it calls the Kretprobe, which executes its functionality, then returns to the actual caller from the Kretprobe.

The SystemTap process

Figure 1 presents the basic flow of the SystemTap process, involving three interacting utilities over five phases. The process begins with the SystemTap script. You use the stap utility to convert the stap script into the kernel module that provides the probe behaviors. The stap process begins with a translation of the script into a parse tree (pass 1). The symbols are then resolved using symbol information about the currently running kernel in the elaboration step (pass 2). The translation process then converts the parse tree into C source (pass 3) and uses the resolved information as well as what are called tapset scripts (libraries of useful functionality defined by SystemTap). The final step of stap is the construction of the kernel module (pass 4), which uses the local kernel module build process.

Figure 1. The SystemTap process
The SystemTap process

With the availability of the kernel module, stap hands control over to two other SystemTap utilities: staprun and stapio. These two utilities work in concert to manage the installation of the module into the kernel and route its output to stdout (pass 5). If you press Ctrl-C in the shell or the script exits, the cleanup process is performed, which unloads the module and causes all associated utilities to exit.

An interesting feature of SystemTap is the ability to cache script translations. If a script is to be installed and hasn't changed, you can use the existing module instead of going through the process of rebuilding it. Figure 2 shows the user-space and kernel-space elements along with the stap-based translation process.

Figure 2. The SystemTap process from the kernel/user-space perspective
The SystemTap process from the kernel/user-space perspective

SystemTap scripting

Scripting in SystemTap is quite simple but also flexible, with many options to get out of it what you need. The Resources section provides links to manuals that detail the language and possibilities, but this section explores a few examples to give you a taste of SystemTap scripts.

Probes

SystemTap scripts are made up of probes and associated blocks of code to be executed when the probe fires. Probes have a number of defined patterns, such as those shown in Table 1. This table enumerates several probe types, including calling a kernel function and returning from a kernel function.

Table 1. Example probe patterns
Probe typeDescription
beginFires when the script begins
endFires when the script ends
kernel.function("sys_sync")Fires when sys_sync is called
kernel.function("sys_sync").callSame as above
kernel.function("sys_sync").returnFires when sys_sync returns
kernel.syscall.*Fires when any system call is made
kernel.function("*@kernel/fork.c:934")Fires when line 934 of fork.c is hit
module("ext3").function("ext3_file_write")Fires when the ext3 write function is called
timer.jiffies(1000)Fires every 1000 kernel jiffies
timer.ms(200).randomize(50)Fires every 200ms, with a linearly distributed random additive (-50 to +50)

Let's look at a simple example to understand how to construct a probe and associate code with that probe. A sample probe is shown in Listing 3 that exists to fire when the kernel system call sys_sync is invoked. When this probe fires, you want to count the number of invocations and emit this count with an indication of the calling process ID (PID). First, declare a global value that any probe can use (the global namespace is common to all probes), then initialize it to zero. Next, define your probe, which is an entry probe into the kernel function sys_sync. The script associated with the probe is to increment the count variable, then to emit a message that defines the number of times the call has been made and the PID for the current invocation. Note that this example appears very much like C (except for the probe definition syntax), which, if you have a background in C, is great.

Listing 3. A simple probe and script
global count=0

probe kernel.function("sys_sync") {
  count++
  printf( "sys_sync called %d times, currently by pid %d\n", count, pid );
}

You can also declare functions that probes can call, which is perfect for common functions for which you'd like to serve multiple probes. The tool even supports recursion to a given depth.

Variables and types

SystemTap permits definition of variables of a number of types, but the type is inferred from context, so no type declarations are needed. In SystemTap, you'll find numbers (64-bit signed integers), integers (64-bit quantities), strings, and literals (strings or integers). You can also use associative arrays and statistics (which we'll explore later).

Expressions

SystemTap provides all of the necessary operators that you'd expect from C and follows the same rules. You'll find arithmetic operators, binary operators, assignment operators, and pointer dereferencing. You'll also find some simplifications from C, which include string concatenation, associative array elements, and aggregation operators.

Language elements

Within a probe, SystemTap provides a comfortable set of statements reminiscent of C. Note that although the language allows you to develop complex scripts, only 1000 statements can be executed per probe (though this number is configurable). Table 2 provides a short list of the language statements just to provide an overview. Note here that many elements appear exactly as they do in C, though there are some additions specific to SystemTap.

Table 2. Language elements of SystemTap
StatementDescription
if (exp) {} else {}Standard if-then-else statement
for (exp1 ; exp2 ; exp3 ) {}A for loop
while (exp) {}Standard while loop
do {} while (exp)A do-while loop
breakExit iteration
continueContinue iteration
nextReturn from the probe
returnReturn an expression from a function
foreach (VAR in ARRAY) {}Iterate an array, assigning the current key to VAR

This article explores the facilities of statistics and aggregation in the sample scripts, as these do not have counterparts in the C language.

Finally, SystemTap provides a number of internal functions that offer additional information about the current context. For example, you can use caller() to identify the calling function, cpu() to identify the current processor number, and pid() to return the PID. SystemTap provides a number of other functions, as well, providing access to the call stack and current registers.


SystemTap examples

With a quick introduction to SystemTap under your belt, let's explore a few simple examples to see how SystemTap really works. This article also demonstrate some of the interesting aspects of the scripting language, such as aggregations.

System call monitoring

The previous section explored a simple script to monitor the sync system call. Now, let's look at a more general script that can monitor all system calls and collect additional information about them.

Listing 4 shows a simple script that includes a global variable definition and three separate probes. The first probe is invoked when the script is first loaded (the begin probe). In this probe, you simply emit a text message to indicate that the script is running in the kernel. Next, you have a syscall probe. Note the use of the wildcard (*) here, which tells SystemTap to monitor all matching system calls. When the probe fires, you increment an associative array element for the given PID and process name. The final probe is a timer probe. This probe fires after 10,000 milliseconds (10 seconds). The script for this probe then emits the collected data (iterating through each of the associative array members). When all members have been iterated, the exit call is made, which causes the module to unload and all associated SystemTap processes to exit.

Listing 4. Monitoring all system calls (profile.stp)
global syscalllist

probe begin {
  printf("System Call Monitoring Started (10 seconds)...\n")
}

probe syscall.*
{
  syscalllist[pid(), execname()]++
}

probe timer.ms(10000) {
  foreach ( [pid, procname] in syscalllist ) {
    printf("%s[%d] = %d\n", procname, pid, syscalllist[pid, procname] )
  }
  exit()
}

The output for the script in Listing 4 is shown in Listing 5. You can see from this script each of the processes running in user space and the number of system calls made over the 10-second period.

Listing 5. Output from the profile.stp script
$ sudo stap profile.stp
System Call Monitoring Started (10 seconds)...
stapio[16208] = 104
gnome-terminal[6416] = 196
Xorg[5525] = 90
vmware-guestd[5307] = 764
hald-addon-stor[4969] = 30
hald-addon-stor[4988] = 15
update-notifier[6204] = 10
munin-node[5925] = 5
gnome-panel[6190] = 33
ntpd[5830] = 20
pulseaudio[6152] = 25
miniserv.pl[5859] = 10
syslogd[4513] = 5
gnome-power-man[6215] = 4
gconfd-2[6157] = 5
hald[4877] = 3
$

System call monitoring for a specific process

In this example, you modify your last script slightly to collect system call data for a single process. Further, instead of just capturing counts, you capture the specific system call that's being made for your target process. The script is shown in Listing 6.

This example tests for the particular process of interest (in this case, the syslog daemon), then changes your associative array to map system call names to counts. When your timer probe fires, you emit the system call and count data.

Listing 6. New system call monitoring script (syslog_profile.stp)
global syscalllist

probe begin {
  printf("Syslog Monitoring Started (10 seconds)...\n")
}

probe syscall.*
{
  if (execname() == "syslogd") {
    syscalllist[name]++
  }
}

probe timer.ms(10000) {
  foreach ( name in syscalllist ) {
    printf("%s = %d\n", name, syscalllist[name] )
  }
  exit()
}

The output for this script is provided in Listing 7.

Listing 7. SystemTap output for the new script (syslog_profile.stp)
$ sudo stap syslog_profile.stp
Syslog Monitoring Started (10 seconds)...
writev = 3
rt_sigprocmask = 1
select = 1
$

Using aggregates to capture numerical data

Aggregate instances are a great way to capture statistics on numerical values. This method is efficient and useful when you're capturing a large amount of data. In this example, you collect data on network packet receipt and transmission. Listing 8 defines two new probes to capture network I/O. Each probe captures the packet length for the given network device name, PID, and process name. The end probe, which is called if the user presses Ctrl-C, provides the means to emit the captured data. In this case, you iterate through the contents of the recv aggregate, sum the packet lengths for each tuple (device name, PID, and process name), then emit this data. Note the extractor used here to sum the tuples: the @count extractor to grab the number of lengths captured (packet counts). You could also use the @sum extractor to perform a summation, the @min or @max to gather the minimum or maximum lengths, respectively, as well as compute the average with the @avg extractor.

Listing 8. Gathering network packet length data (net.stp)
global recv, xmit

probe begin {
  printf("Starting network capture (Ctl-C to end)\n")
}

probe netdev.receive {
  recv[dev_name, pid(), execname()] <<< length
}

probe netdev.transmit {
  xmit[dev_name, pid(), execname()] <<< length
}

probe end {
  printf("\nEnd Capture\n\n")

  printf("Iface Process........ PID.. RcvPktCnt XmtPktCnt\n")

  foreach ([dev, pid, name] in recv) {
    recvcount = @count(recv[dev, pid, name])
    xmitcount = @count(xmit[dev, pid, name])
    printf( "%5s %-15s %-5d %9d %9d\n", dev, name, pid, recvcount, xmitcount )
  }

  delete recv
  delete xmit
}

The output for the script in Listing 8 is provided in Listing 9. Note here that the script exits after the user presses Ctrl-C, then emits the captured data.

Listing 9. The output for net.stp
$ sudo stap net.stp
Starting network capture (Ctl-C to end)
^C
End Capture

Iface Process........ PID.. RcvPktCnt XmtPktCnt
 eth0 swapper         0           122        85
 eth0 metacity        6171          4         2
 eth0 gconfd-2        6157          5         1
 eth0 firefox         21424        48        98
 eth0 Xorg            5525         36        21
 eth0 bash            22860         1         0
 eth0 vmware-guestd   5307          1         1
 eth0 gnome-screensav 6244          6         3
Pass 5: run completed in 0usr/50sys/37694real ms.
$

Capturing histogram data

This final example explores how easy it is for SystemTap to present data in other forms—in this case, a histogram. Returning to the previous example, capture your data into an aggregate called histogram (see Listing 10). Then, use the netdev receive and transmit probes to capture the packet length data. When the probe ends, you emit the data in a histogram using the @hist_log extractor.

Listing 10. Capturing and presenting histogram data (nethist.stp)
global histogram

probe begin {
  printf("Capturing...\n")
}

probe netdev.receive {
  histogram <<< length
}

probe netdev.transmit {
  histogram <<< length
}

probe end {
  printf( "\n" )
  print( @hist_log(histogram) )
}

The output from Listing 10 is shown in Listing 11. In this example, a browser session, FTP session, and ping were used to generate network traffic. The @hist_log extractor is a base-2 logarithmic histogram (as shown). Other histograms can be captured, allowing you to define the bucket sizes.

Listing 11. Histogram output from nethist.stp
$ sudo stap nethist.stp 
Capturing...
^C
value |-------------------------------------------------- count
    8 |                                                      0
   16 |                                                      0
   32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            1601
   64 |@                                                    52
  128 |@                                                    46
  256 |@@@@                                                164
  512 |@@@                                                 140
 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  2033
 2048 |                                                      0
 4096 |                                                      0

$

Going further

This article barely scratched the surface of the capabilities of SystemTap. In the Resources section, you'll find links to a number of tutorials, examples, and the language reference, which tells you everything you need to know to use SystemTap. SystemTap uses several existing methods and has learned from prior implementations of kernel tracing. Although it's still under active development, the tool is very usable now, and it will be interesting to see what comes in the future.

Resources

Learn

Get products and technologies

  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=445288
ArticleTitle=Linux introspection and SystemTap
publish-date=11092009