The performance needs of software vary, but it's probably not surprising that many applications have very stringent speed requirements. Video players are a good example: a video player is not much use if it can only play a video at 75 percent of the required speed.
Other applications, such as video encoding, are lengthy operations that are best run "batch" style, where you start a job and leave it running while you go do something else. Although these types of applications don't have such hard performance limits, increasing speed will still bring benefits, such as being able to encode more videos over a given period and being able to encode at a higher quality in the same time.
In general, for all but the simplest applications, the better the performance, the more useful and popular the application will be. For this reason, performance considerations are (or should be) in the forefront of many application developers' minds.
Unfortunately, much of the effort that is expended attempting to make applications faster is wasted, because developers will often micro-optimize their software without fully exploring how the program operates at a macro scale. For instance, you might spend a large amount of time making a particular function run twice as fast, which is all well and good, but if that function is called very rarely (when a file is opened, say) then reducing the execution time from 200ms to 100ms isn't going to make much difference to the overall execution time of the software.
A more fruitful use of your time would be spent optimizing those parts of the software that are called more frequently. For example, if the application spends 50 percent of its time in string-handling functions and if you can optimize these functions by 10 percent, this will yield a 5 percent overall improvement in the application's execution time.
It is therefore vital that you have accurate information on exactly where the time is being spent within your applications -- and for real input data -- if you hope to have a chance of optimizing it effectively. This activity is called code profiling. This article introduces a profiling tool provided with the GNU compiler tool chain, the imaginatively named GNU profiler (gprof), for those who are relatively new to open source development tools.
Before I race off and explain how to use gprof, it's important to know where in the development cycle profiling should take place. In general, code should be written with the following three goals, in order of importance:
- Make the software work correctly. This must always be the focus of development. In general, there is no point writing software that is very fast if it does not do what it is supposed to! Obviously, correctness is something of a grey area; a video player that works on 99 percent of your files or plays video with the occasional visual glitch is still of some use, but in general, correctness is more important than speed.
- Make the software maintainable. This is really a sub-point of the first goal. In general, if software is not written to be maintainable, then even if it works to begin with, sooner or later you (or someone else) will end up breaking it trying to fix bugs or add new features.
- Make the software fast. Here is where profiling comes in. Once the software is working correctly, then start profiling to help it run more quickly.
Assuming you are in the position of having a working application, let's look at how to use gprof to accurately measure where time is being spent during the application's execution in order to gain clues as to where your optimizing efforts would be most profitably spent.
gprof can profile C, C++, Pascal, and Fortran 77 applications. The examples here use C.
Listing 1. Sample time-wasting application
#include <stdio.h>
int a(void) {
int i=0,g=0;
while(i++<100000)
{
g+=i;
}
return g;
}
int b(void) {
int i=0,g=0;
while(i++<400000)
{
g+=i;
}
return g;
}
int main(int argc, char** argv)
{
int iterations;
if(argc != 2)
{
printf("Usage %s <No of Iterations>\n", argv[0]);
exit(-1);
}
else
iterations = atoi(argv[1]);
printf("No of iterations = %d\n", iterations);
while(iterations--)
{
a();
b();
}
}
|
As you can see from the code, this very simple app includes two functions, a and b, both of which sit in busy loops wasting CPU cycles. The main function just sits in a loop calling each of these functions. The second function, b, loops four times more than a, so you would expect that once you've profiled the code, it should spend approximately 20 percent of its time in a and 80 percent in b. Let's enable profiling and see if those expectations are correct.
Enabling profiling is as simple as adding -pg to the gcc compile flags. Compile as follows:
gcc example1.c -pg -o example1 -O2 -lc
Once you have built the application, simply run it as normal:
./example1 50000
Once this completes, you should see a file called gmon.out created in the current directory.
First look at the "flat profile," which you obtain with the gprof command, passing the executable itself and the gmon.out file as follows:
gprof example1 gmon.out -p
Which outputs the following:
Listing 2. Results of flat profile
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 80.24 63.85 63.85 50000 1.28 1.28 b 20.26 79.97 16.12 50000 0.32 0.32 a |
You can see from this output that, as expected, approximately four times as much time is spent in b as is spent in a. The actual numbers are not very useful; they tend to be very inaccurate due to rounding errors.
Astute readers will notice that many of the functions called (such as printf) are missing from the output. This is because they reside within the C runtime library (libc.so), which was not (in this case) compiled with -pg and, hence, no profiling is gathered for any functions within this library. I will return to this below.
Next you might like to see the "call graph," which you obtain as follows:
gprof example1 gmon.out -q
Which outputs the following:
Listing 3. Call graph
Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.01% of 79.97 seconds
index % time self children called name
<spontaneous>
[1] 100.0 0.00 79.97 main [1]
63.85 0.00 50000/50000 b [2]
16.12 0.00 50000/50000 a [3]
-----------------------------------------------
63.85 0.00 50000/50000 main [1]
[2] 79.8 63.85 0.00 50000 b [2]
-----------------------------------------------
16.12 0.00 50000/50000 main [1]
[3] 20.2 16.12 0.00 50000 a [3]
-----------------------------------------------
|
Finally, you might want to have an "annotated source" listing, which prints out the source code to the application, with notes on how many times each function is called.
To use this function, compile the source with debugging enabled so that the source is put into the executable:
gcc example1.c -g -pg -o example1 -O2 -lc
Re-run the application as before:
./example1 50000
Your gprof command is now:
gprof example1 gmon.out -A
This outputs the following:
Listing 4. Annotated source
*** File /home/martynh/profarticle/example1.c:
#include <stdio.h>
50000 -> int a(void) {
int i=0,g=0;
while(i++<100000)
{
g+=i;
}
return g;
}
50000 -> int b(void) {
int i=0,g=0;
while(i++<400000)
{
g+=i;
}
return g;
}
int main(int argc, char** argv)
##### -> {
int iterations;
if(argc != 2)
{
printf("Usage %s <No of Iterations>\n", argv[0]);
exit(-1);
}
else
iterations = atoi(argv[1]);
printf("No of iterations = %d\n", iterations);
while(iterations--)
{
a();
b();
}
}
Top 10 Lines:
Line Count
3 50000
11 50000
Execution Summary:
3 Executable lines in this file
3 Lines executed
100.00 Percent of the file executed
100000 Total number of line executions
33333.33 Average executions per line
|
As I alluded to earlier, profiling support is added by the compiler, so if you wish to obtain profiling information from any shared libraries, including the C library (libc.a), you need to also compile them with -pg. Luckily, many distributions ship with a version of the C library already compiled with profiling support (libc_p.a).
In the case of the distribution I use, gentoo, you need to add "profile" to your USE flags and re-emerge glibc. Once this finishes, you should see that /usr/lib/libc_p.a has been created. For distributions that do not ship with a libc_p as standard, you will need to check if it is installable separately, or you might need to download the glibc source and build it yourself.
Once you have a libc_p.a file, you can simply rebuild your example like so:
gcc example1.c -g -pg -o example1 -O2 -lc_p
If you then re-run the application as before and obtain a flat profile or a call graph, you should see lots of C runtime functions including printf (none of which are significant in our test program).
Now that you know how to use gprof, you can very simply and effectively profile your applications to analyze, and hopefully remove, performance bottlenecks.
However, you might have noticed by now one of gprof's biggest limitations: it only profiles user time while the application is running. Typically, applications spend some of their time in user code and some in "system code," such as kernel system calls.
If you modify Listing 1 slightly, you can see the problem clearly:
Listing 5. Adding system calls to Listing 1
#include <stdio.h>
int a(void) {
sleep(1);
return 0;
}
int b(void) {
sleep(4);
return 0;
}
int main(int argc, char** argv)
{
int iterations;
if(argc != 2)
{
printf("Usage %s <No of Iterations>\n", argv[0]);
exit(-1);
}
else
iterations = atoi(argv[1]);
printf("No of iterations = %d\n", iterations);
while(iterations--)
{
a();
b();
}
}
|
As you can see, I have modified the code so that rather than sitting in busy loops, a and b instead call the C-runtime function sleep to suspend execution for 1 and 4 seconds, respectively.
Compile this application as before:
gcc example2.c -g -pg -o example2 -O2 -lc_p
and execute it to loop 30 times:
./example2 30
The resulting flat profile shows the following:
Listing 6. Flat profile showing system calls
Flat profile: Each sample counts as 0.01 seconds. no time accumulated % cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 120 0.00 0.00 sigprocmask 0.00 0.00 0.00 61 0.00 0.00 __libc_sigaction 0.00 0.00 0.00 61 0.00 0.00 sigaction 0.00 0.00 0.00 60 0.00 0.00 nanosleep 0.00 0.00 0.00 60 0.00 0.00 sleep 0.00 0.00 0.00 30 0.00 0.00 a 0.00 0.00 0.00 30 0.00 0.00 b 0.00 0.00 0.00 21 0.00 0.00 _IO_file_overflow 0.00 0.00 0.00 3 0.00 0.00 _IO_new_file_xsputn 0.00 0.00 0.00 2 0.00 0.00 _IO_new_do_write 0.00 0.00 0.00 2 0.00 0.00 __find_specmb 0.00 0.00 0.00 2 0.00 0.00 __guard_setup 0.00 0.00 0.00 1 0.00 0.00 _IO_default_xsputn 0.00 0.00 0.00 1 0.00 0.00 _IO_doallocbuf 0.00 0.00 0.00 1 0.00 0.00 _IO_file_doallocate 0.00 0.00 0.00 1 0.00 0.00 _IO_file_stat 0.00 0.00 0.00 1 0.00 0.00 _IO_file_write 0.00 0.00 0.00 1 0.00 0.00 _IO_setb 0.00 0.00 0.00 1 0.00 0.00 ____strtol_l_internal 0.00 0.00 0.00 1 0.00 0.00 ___fxstat64 0.00 0.00 0.00 1 0.00 0.00 __cxa_atexit 0.00 0.00 0.00 1 0.00 0.00 __errno_location 0.00 0.00 0.00 1 0.00 0.00 __new_exitfn 0.00 0.00 0.00 1 0.00 0.00 __strtol_internal 0.00 0.00 0.00 1 0.00 0.00 _itoa_word 0.00 0.00 0.00 1 0.00 0.00 _mcleanup 0.00 0.00 0.00 1 0.00 0.00 atexit 0.00 0.00 0.00 1 0.00 0.00 atoi 0.00 0.00 0.00 1 0.00 0.00 exit 0.00 0.00 0.00 1 0.00 0.00 flockfile 0.00 0.00 0.00 1 0.00 0.00 funlockfile 0.00 0.00 0.00 1 0.00 0.00 main 0.00 0.00 0.00 1 0.00 0.00 mmap 0.00 0.00 0.00 1 0.00 0.00 moncontrol 0.00 0.00 0.00 1 0.00 0.00 new_do_write 0.00 0.00 0.00 1 0.00 0.00 printf 0.00 0.00 0.00 1 0.00 0.00 setitimer 0.00 0.00 0.00 1 0.00 0.00 vfprintf 0.00 0.00 0.00 1 0.00 0.00 write |
If you analyze the output, you can see that, although the profiler has registered the correct number of times each function has been called, the time registered for these functions (and indeed all functions) is 0.00. This is because the sleep function actually makes a call into kernel space to suspend the application and then effectively halts the execution and waits for the kernel to wake it up again. As the total amount of time executing in user space is tiny compared to the time spent sleeping in kernel space, it gets rounded to zero. The reason for this is that gprof works by instrumenting the program such that measurements are taken at fixed intervals of the program's run time. Therefore, no measurements are taken while the program is not running.
This is really a mixed blessing. On the one hand, it makes it difficult to optimize applications that spend most of their time in kernel space or that run slowly due to external factors such as overloading the operating system's I/O subsystem. On the other hand, it means profiling is not affected by anything else occurring on the system (such as another user using lots of CPU time).
In general, a good benchmark for seeing how useful gprof will be in helping you optimize your application is to run it under the time command. This command notes how long it takes for an application to complete and also measures how long it spends in user space and in kernel space.
If you look at the Listing 2 example:
time ./example2 30
This outputs the following:
Listing 7. Output of time command
No of iterations = 30 real 2m30.295s user 0m0.000s sys 0m0.004s |
You can see that approximately no time is spent executing code in user space, so gprof will not really be useful here.
Despite the limitations outlined above, gprof can be a very useful tool for optimizing your code, particularly if your code is generally CPU bound in user space. It is usually a good idea to run your application under time first to determine if gprof is likely to yield useful information.
If gprof is not suitable for your profiling needs, a number of alternatives overcome some of gprof's limitations, including OProfile and Sysprof (see Resources for links to more information on these tools).
On the other hand -- assuming you have gcc installed -- one of the main advantages of gprof over the alternatives is that you most likely already have it installed on any Linux machine you use.
Learn
-
For more information on using gprof, read the GNU profiler manual.
-
Learn more about OProfile in "Identify performance bottlenecks with OProfile for Linux on POWER" (developerWorks, May 2005) and "Smashing performance with OProfile" (developerWorks, October 2003).
-
Visit the OProfile home page.
-
Read more about the System-wide Linux profiler (Sysprof).
-
In the developerWorks Linux zone, find more resources for Linux developers.
-
Stay current with developerWorks technical events and Webcasts.
Get products and technologies
-
Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
-
With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.
Discuss
-
Check out developerWorks
blogs and get involved in the developerWorks community.
Martyn Honeyford graduated from Nottingham University with a BS in Computer Science in 1996. He has worked as a software engineer at IBM UK Labs in Hursley, England, ever since. His current role is as a developer in the WebSphere MQ Everyplace development team. When not working, Martyn can usually be found either playing the electric guitar (badly) or playing video games more than most people would consider healthy.



