Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Speed your code with the GNU profiler

Target the parts of your applications that take the most time

Martyn Honeyford (ibmmartyn@gmail.com), Software Engineer, IBM
Martyn Honeyford graduated from Nottingham University with a BS in Computer Science in 1996. He has worked as a software engineer at IBM UK Labs in Hursley, England, ever since. His current role is as a developer in the WebSphere MQ Everyplace development team. When not working, Martyn can usually be found either playing the electric guitar (badly) or playing video games more than most people would consider healthy.

Summary:  Improving the performance of your applications is rarely a wasted effort, but it's not always clear which functions the program is spending most of its execution time on. Learn how to pinpoint performance bottlenecks using gprof for both user-space and system calls on Linux®.

Date:  03 Apr 2006
Level:  Introductory

Comments:  

Introduction

The performance needs of software vary, but it's probably not surprising that many applications have very stringent speed requirements. Video players are a good example: a video player is not much use if it can only play a video at 75 percent of the required speed.

Other applications, such as video encoding, are lengthy operations that are best run "batch" style, where you start a job and leave it running while you go do something else. Although these types of applications don't have such hard performance limits, increasing speed will still bring benefits, such as being able to encode more videos over a given period and being able to encode at a higher quality in the same time.

In general, for all but the simplest applications, the better the performance, the more useful and popular the application will be. For this reason, performance considerations are (or should be) in the forefront of many application developers' minds.

Unfortunately, much of the effort that is expended attempting to make applications faster is wasted, because developers will often micro-optimize their software without fully exploring how the program operates at a macro scale. For instance, you might spend a large amount of time making a particular function run twice as fast, which is all well and good, but if that function is called very rarely (when a file is opened, say) then reducing the execution time from 200ms to 100ms isn't going to make much difference to the overall execution time of the software.

A more fruitful use of your time would be spent optimizing those parts of the software that are called more frequently. For example, if the application spends 50 percent of its time in string-handling functions and if you can optimize these functions by 10 percent, this will yield a 5 percent overall improvement in the application's execution time.

It is therefore vital that you have accurate information on exactly where the time is being spent within your applications -- and for real input data -- if you hope to have a chance of optimizing it effectively. This activity is called code profiling. This article introduces a profiling tool provided with the GNU compiler tool chain, the imaginatively named GNU profiler (gprof), for those who are relatively new to open source development tools.

gprof to the rescue

Before I race off and explain how to use gprof, it's important to know where in the development cycle profiling should take place. In general, code should be written with the following three goals, in order of importance:

  1. Make the software work correctly. This must always be the focus of development. In general, there is no point writing software that is very fast if it does not do what it is supposed to! Obviously, correctness is something of a grey area; a video player that works on 99 percent of your files or plays video with the occasional visual glitch is still of some use, but in general, correctness is more important than speed.

  2. Make the software maintainable. This is really a sub-point of the first goal. In general, if software is not written to be maintainable, then even if it works to begin with, sooner or later you (or someone else) will end up breaking it trying to fix bugs or add new features.

  3. Make the software fast. Here is where profiling comes in. Once the software is working correctly, then start profiling to help it run more quickly.

Assuming you are in the position of having a working application, let's look at how to use gprof to accurately measure where time is being spent during the application's execution in order to gain clues as to where your optimizing efforts would be most profitably spent.

gprof can profile C, C++, Pascal, and Fortran 77 applications. The examples here use C.


Listing 1. Sample time-wasting application
#include <stdio.h>

int a(void) {
  int i=0,g=0;
  while(i++<100000)
  {
     g+=i;
  }
  return g;
}
int b(void) {
  int i=0,g=0;
  while(i++<400000)
  {
    g+=i;
  }
  return g;
}

int main(int argc, char** argv)
{
   int iterations;

   if(argc != 2)
   {
      printf("Usage %s <No of Iterations>\n", argv[0]);
      exit(-1);
   }
   else
      iterations = atoi(argv[1]);

   printf("No of iterations = %d\n", iterations);

   while(iterations--)
   {
      a();
      b();
   }
}

As you can see from the code, this very simple app includes two functions, a and b, both of which sit in busy loops wasting CPU cycles. The main function just sits in a loop calling each of these functions. The second function, b, loops four times more than a, so you would expect that once you've profiled the code, it should spend approximately 20 percent of its time in a and 80 percent in b. Let's enable profiling and see if those expectations are correct.

Enabling profiling is as simple as adding -pg to the gcc compile flags. Compile as follows:

gcc example1.c -pg -o example1 -O2 -lc

Once you have built the application, simply run it as normal:

./example1 50000

Once this completes, you should see a file called gmon.out created in the current directory.

Using the output

First look at the "flat profile," which you obtain with the gprof command, passing the executable itself and the gmon.out file as follows:

gprof example1 gmon.out -p

Which outputs the following:


Listing 2. Results of flat profile
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 80.24     63.85    63.85    50000     1.28     1.28  b
 20.26     79.97    16.12    50000     0.32     0.32  a

You can see from this output that, as expected, approximately four times as much time is spent in b as is spent in a. The actual numbers are not very useful; they tend to be very inaccurate due to rounding errors.

Astute readers will notice that many of the functions called (such as printf) are missing from the output. This is because they reside within the C runtime library (libc.so), which was not (in this case) compiled with -pg and, hence, no profiling is gathered for any functions within this library. I will return to this below.

Next you might like to see the "call graph," which you obtain as follows:

gprof example1 gmon.out -q

Which outputs the following:


Listing 3. Call graph
                     Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.01% of 79.97 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.00   79.97                 main [1]
               63.85    0.00   50000/50000       b [2]
               16.12    0.00   50000/50000       a [3]
-----------------------------------------------
               63.85    0.00   50000/50000       main [1]
[2]     79.8   63.85    0.00   50000         b [2]
-----------------------------------------------
               16.12    0.00   50000/50000       main [1]
[3]     20.2   16.12    0.00   50000         a [3]
-----------------------------------------------

Finally, you might want to have an "annotated source" listing, which prints out the source code to the application, with notes on how many times each function is called.

To use this function, compile the source with debugging enabled so that the source is put into the executable:

gcc example1.c -g -pg -o example1 -O2 -lc

Re-run the application as before:

./example1 50000

Your gprof command is now:

gprof example1 gmon.out -A

This outputs the following:


Listing 4. Annotated source
*** File /home/martynh/profarticle/example1.c:
                #include <stdio.h>

       50000 -> int a(void) {
                  int i=0,g=0;
                  while(i++<100000)
                  {
                     g+=i;
                  }
                  return g;
                }
       50000 -> int b(void) {
                  int i=0,g=0;
                  while(i++<400000)
                  {
                    g+=i;
                  }
                  return g;
                }

                int main(int argc, char** argv)
       ##### -> {
                   int iterations;

                   if(argc != 2)
                   {
                      printf("Usage %s <No of Iterations>\n", argv[0]);
                      exit(-1);
                   }
                   else
                      iterations = atoi(argv[1]);

                   printf("No of iterations = %d\n", iterations);

                   while(iterations--)
                   {
                      a();
                      b();
                   }
                }



Top 10 Lines:

     Line      Count

        3      50000
       11      50000
Execution Summary:

        3   Executable lines in this file
        3   Lines executed
   100.00   Percent of the file executed

   100000   Total number of line executions
 33333.33   Average executions per line

Shared library support

As I alluded to earlier, profiling support is added by the compiler, so if you wish to obtain profiling information from any shared libraries, including the C library (libc.a), you need to also compile them with -pg. Luckily, many distributions ship with a version of the C library already compiled with profiling support (libc_p.a).

In the case of the distribution I use, gentoo, you need to add "profile" to your USE flags and re-emerge glibc. Once this finishes, you should see that /usr/lib/libc_p.a has been created. For distributions that do not ship with a libc_p as standard, you will need to check if it is installable separately, or you might need to download the glibc source and build it yourself.

Once you have a libc_p.a file, you can simply rebuild your example like so:

gcc example1.c -g -pg -o example1 -O2 -lc_p

If you then re-run the application as before and obtain a flat profile or a call graph, you should see lots of C runtime functions including printf (none of which are significant in our test program).

User time vs. kernel time

Now that you know how to use gprof, you can very simply and effectively profile your applications to analyze, and hopefully remove, performance bottlenecks.

However, you might have noticed by now one of gprof's biggest limitations: it only profiles user time while the application is running. Typically, applications spend some of their time in user code and some in "system code," such as kernel system calls.

If you modify Listing 1 slightly, you can see the problem clearly:


Listing 5. Adding system calls to Listing 1
#include <stdio.h>

int a(void) {
  sleep(1);
  return 0;
}
int b(void) {
  sleep(4);
  return 0;
}

int main(int argc, char** argv)
{
   int iterations;

   if(argc != 2)
   {
      printf("Usage %s <No of Iterations>\n", argv[0]);
      exit(-1);
   }
   else
      iterations = atoi(argv[1]);

   printf("No of iterations = %d\n", iterations);

   while(iterations--)
   {
      a();
      b();
   }
}

As you can see, I have modified the code so that rather than sitting in busy loops, a and b instead call the C-runtime function sleep to suspend execution for 1 and 4 seconds, respectively.

Compile this application as before:

gcc example2.c -g -pg -o example2 -O2 -lc_p

and execute it to loop 30 times:

./example2 30

The resulting flat profile shows the following:


Listing 6. Flat profile showing system calls
Flat profile:

Each sample counts as 0.01 seconds.
 no time accumulated

  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
  0.00      0.00     0.00      120     0.00     0.00  sigprocmask
  0.00      0.00     0.00       61     0.00     0.00  __libc_sigaction
  0.00      0.00     0.00       61     0.00     0.00  sigaction
  0.00      0.00     0.00       60     0.00     0.00  nanosleep
  0.00      0.00     0.00       60     0.00     0.00  sleep
  0.00      0.00     0.00       30     0.00     0.00  a
  0.00      0.00     0.00       30     0.00     0.00  b
  0.00      0.00     0.00       21     0.00     0.00  _IO_file_overflow
  0.00      0.00     0.00        3     0.00     0.00  _IO_new_file_xsputn
  0.00      0.00     0.00        2     0.00     0.00  _IO_new_do_write
  0.00      0.00     0.00        2     0.00     0.00  __find_specmb
  0.00      0.00     0.00        2     0.00     0.00  __guard_setup
  0.00      0.00     0.00        1     0.00     0.00  _IO_default_xsputn
  0.00      0.00     0.00        1     0.00     0.00  _IO_doallocbuf
  0.00      0.00     0.00        1     0.00     0.00  _IO_file_doallocate
  0.00      0.00     0.00        1     0.00     0.00  _IO_file_stat
  0.00      0.00     0.00        1     0.00     0.00  _IO_file_write
  0.00      0.00     0.00        1     0.00     0.00  _IO_setb
  0.00      0.00     0.00        1     0.00     0.00  ____strtol_l_internal
  0.00      0.00     0.00        1     0.00     0.00  ___fxstat64
  0.00      0.00     0.00        1     0.00     0.00  __cxa_atexit
  0.00      0.00     0.00        1     0.00     0.00  __errno_location
  0.00      0.00     0.00        1     0.00     0.00  __new_exitfn
  0.00      0.00     0.00        1     0.00     0.00  __strtol_internal
  0.00      0.00     0.00        1     0.00     0.00  _itoa_word
  0.00      0.00     0.00        1     0.00     0.00  _mcleanup
  0.00      0.00     0.00        1     0.00     0.00  atexit
  0.00      0.00     0.00        1     0.00     0.00  atoi
  0.00      0.00     0.00        1     0.00     0.00  exit
  0.00      0.00     0.00        1     0.00     0.00  flockfile
  0.00      0.00     0.00        1     0.00     0.00  funlockfile
  0.00      0.00     0.00        1     0.00     0.00  main
  0.00      0.00     0.00        1     0.00     0.00  mmap
  0.00      0.00     0.00        1     0.00     0.00  moncontrol
  0.00      0.00     0.00        1     0.00     0.00  new_do_write
  0.00      0.00     0.00        1     0.00     0.00  printf
  0.00      0.00     0.00        1     0.00     0.00  setitimer
  0.00      0.00     0.00        1     0.00     0.00  vfprintf
  0.00      0.00     0.00        1     0.00     0.00  write

If you analyze the output, you can see that, although the profiler has registered the correct number of times each function has been called, the time registered for these functions (and indeed all functions) is 0.00. This is because the sleep function actually makes a call into kernel space to suspend the application and then effectively halts the execution and waits for the kernel to wake it up again. As the total amount of time executing in user space is tiny compared to the time spent sleeping in kernel space, it gets rounded to zero. The reason for this is that gprof works by instrumenting the program such that measurements are taken at fixed intervals of the program's run time. Therefore, no measurements are taken while the program is not running.

This is really a mixed blessing. On the one hand, it makes it difficult to optimize applications that spend most of their time in kernel space or that run slowly due to external factors such as overloading the operating system's I/O subsystem. On the other hand, it means profiling is not affected by anything else occurring on the system (such as another user using lots of CPU time).

In general, a good benchmark for seeing how useful gprof will be in helping you optimize your application is to run it under the time command. This command notes how long it takes for an application to complete and also measures how long it spends in user space and in kernel space.

If you look at the Listing 2 example:

time ./example2 30

This outputs the following:


Listing 7. Output of time command
No of iterations = 30

real    2m30.295s
user    0m0.000s
sys     0m0.004s

You can see that approximately no time is spent executing code in user space, so gprof will not really be useful here.

Conclusion

Despite the limitations outlined above, gprof can be a very useful tool for optimizing your code, particularly if your code is generally CPU bound in user space. It is usually a good idea to run your application under time first to determine if gprof is likely to yield useful information.

If gprof is not suitable for your profiling needs, a number of alternatives overcome some of gprof's limitations, including OProfile and Sysprof (see Resources for links to more information on these tools).

On the other hand -- assuming you have gcc installed -- one of the main advantages of gprof over the alternatives is that you most likely already have it installed on any Linux machine you use.


Resources

Learn

Get products and technologies

  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

About the author

Martyn Honeyford graduated from Nottingham University with a BS in Computer Science in 1996. He has worked as a software engineer at IBM UK Labs in Hursley, England, ever since. His current role is as a developer in the WebSphere MQ Everyplace development team. When not working, Martyn can usually be found either playing the electric guitar (badly) or playing video games more than most people would consider healthy.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux, Open source
ArticleID=107243
ArticleTitle=Speed your code with the GNU profiler
publish-date=04032006
author1-email=ibmmartyn@gmail.com
author1-email-cc=martynh@uk.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).