Smashing performance with OProfile

Identifying performance bottlenecks in real-world systems

Analyzing the performance of the Linux operating system and application code can be difficult due to unexpected interactions between the hardware and the software, but profiling is one way you can identify such performance problems. This article looks at OProfile, a profiling tool for Linux that will be included in the upcoming stable kernel.

Share:

Prasanna Panchamukhi (prasanna@in.ibm.com), Developer, Linux Technology Center, IBM India Software Labs

Prasanna S. Panchamukhi works as a developer for IBM's Linux Technology Center in Bangalore, India. He is currently involved in improving various debugging tools for Linux. Previously, he was involved in writing fiber channel device drivers and developing network processor applications, as well as maintaining Unix operating systems. You can reach Prasanna at prasanna@in.ibm.com.



16 October 2003

Profiling is a formal summary or analysis of data, often in the form of a graph or table, representing distinctive performance features or characteristics. The profiling table provides the percentage and number of samples collected for specified processor events such as the number of cache line misses, Transition Lookaside Buffer (TLB) misses, and so on.

OProfile is one of several profiling and performance monitoring tools for Linux. It works on various architectures, including the IA32, IA64, and AMD Athlon families, has a low overhead, and will be included in the 2.6 version of the kernel.

OProfile can help you identify issues such as loop unrolling, poor cache utilization, inefficient type conversion and redundant operations, branch mispredictions, and so on. It collects information about processor events including TLB misses, stalls, memory references, total lines allocated in the DCU (Data Cache Unit), the number of cycles of a DCU miss, and the number of non-cacheable and cacheable instruction fetches. OProfile is fine-grained and can collect samples for a set of instructions, or for function, system call, or interrupt handlers. OProfile works by sampling, and using the collected profile data, you can easily identify performance problems.

Installing OProfile

OProfile is included in the Linux 2.5 kernel and above and on most newer distributions, including Red Hat 9. You can also download OProfile using the link in the Resources section later in this article. You'll need to recompile the kernel with OProfile enabled; here is how to do that:

  1. Enable OProfile:
     #cd /usr/src/linux
     #make xconfig/menuconfig

    Enable OProfile in the profiling menu and set CONFIG_PROFILING=y and CONFIG_OPROFILE=y in the .config file. Also enable Local APIC and IO-APIC in the "Processor type and features" menu.

  2. Recompile like so:
    #make dep (use for 2.4 kernel versions )
    #make bzImage
  3. Boot the new kernel.
  4. To configure and install the OProfile utilities, enter this:
    #./configure --with-linux=/usr/src/linux/ --with-qt-dir=/usr/lib/qt/
         --with-kernel-support
    #make
    #make install

For information about system requirements and for more detailed installation instructions, see the links in the Resources section.


Overview of OProfile tools

  • op_help: Lists available events with short descriptions
  • opcontrol: Controls OProfile data collection
  • oprofpp: Retrieves useful profile data
  • op_time: Lists the relative profile values for all images on the system
  • op_to_source: Produces annotated sources, assembly or mixed source assembly
  • op_merge: Merges sample files belonging to the same application
  • op_import: Converts sample database files from the foreign format (abi) to native format

Three quick steps to start profiling

  1. Start up the profiler:
    # opcontrol --setup --ctr0-event=CPU_CLK_UNHALTED
        --ctr0-count=600000 --vmlinux=/usr/src/linux-2.4.20/vmlinux
            For RTC mode users, use --rtc-value=2048
     # opcontrol --start
  2. Now the profiler is running; go do what ever you want to profile.
  3. Dump the profiled data with this option:
    # opcontrol --stop/--shutdown/--dump

OProfile analysis: Cache memory utilization problem

Caches are the memory closest to the processor execution unit. Cache memory is much smaller and faster than main memory and can either be internal or external to the processor chip. Caches contain copies of the most frequently used instructions and data. By allowing fast access to frequently used data, software can run much faster than if it had accessed the data from the main memory. In the Intel IA32 P4, data is stored in cache lines of 32 bytes each.

For multiple CPUs, the cache lines in a CPU's cache are invalidated when a CPU modifies data that is shared between CPUs.

If the data or instruction is not present in the cache, or if the cache line is invalidated, the CPU updates its cache by reading the data from the main memory. The processor event that does this is known as L2_LINES_IN. Reading the data from the main memory requires more CPU cycles. OProfile helps you identify a cache memory problem like the one in Listing 1.

Listing 1. Code with cache memory problem
/*
 * Shared data being modified by two threads running on different CPUs.
 */

/* shared structure between two threads which will be optimized later*/
struct shared_data_align {
    unsigned int num_proc1;
    unsigned int num_proc2;
};
/* 
 * Shared structure between two threads remains unchanged (non optimized)
 * This is required in order to collect some samples for the L2_LINES_IN event.
 */
struct shared_data_nonalign {
    unsigned int num_proc1;
    unsigned int num_proc2;
};

/*
 * In the example program below, the parent process creates a clone
 * thread sharing its memory space. The parent thread running on one CPU 
 * increments the num_proc1 element of the common and common_aln. The cloned
 * thread running on another CPU increments the value of num_proc2 element of
 * the common and common_aln structure.
 */
/* Declare global data */
struct shared_data_nonalign common_aln;

/*Declare local shared data */
struct shared_data_align common;

    /* Now clone a thread sharing memory space with the parent process */
    if ((pid = clone(func1, buff+8188, CLONE_VM, &common)) < 0) {
        perror("clone");
        exit(1);
    }
    
    /* Increment the value of num_proc1 in loop */
    for (j = 0; j < 200; j++)
        for(i = 0; i < 100000; i++) {
            common.num_proc1++;
        }

    /* Increment the value of num_proc1 in loop */
    for (j = 0; j < 200; j++)
        for(i = 0; i < 100000; i++) {
            common_aln.num_proc1++;
        }
/*
 * The routine below is called by the cloned thread, to increment the num_proc2 
 * element of common and common_aln structure in loop.
 */
int func1(struct shared_data_align *com)
{
    int i, j;
    /* Increment the value of num_proc2 in loop */
    for (j = 0; j < 200; j++)
        for (i = 0; i < 100000; i++) {
            com->num_proc2++;
        }

    /* Increment the value of num_proc2 in loop */
    for (j = 0; j < 200; j++)
        for (i = 0; i < 100000; i++) {
            common_aln.num_proc2++;
        }
}

The above program is profiled for the event L2_LINES_IN. Note the samples collected in func1 and in main:

Listing 2. OProfile data for L2_LINES_IN
# opcontrol --setup --ctr0-event=L2_LINES_IN
    --ctr0-count=500 --vmlinux=/usr/src/linux-2.4.20/vmlinux
#opcontrol --start
#./appln
#opcontrol --stop
#oprofpp -l ./appln
Cpu type: PIII
Cpu speed was (MHz estimation) : 699.57
Counter 0 counted L2_LINES_IN events (number of allocated lines in L2) with a 
unit mask of 0x00 (No unit mask) count 500
vma      samples  %           symbol name
080483d0 0        0           _start
080483f4 0        0           call_gmon_start
08048420 0        0           __do_global_dtors_aux
08048480 0        0           fini_dummy
08048490 0        0           frame_dummy
080484c0 0        0           init_dummy
08048630 0        0           __do_global_ctors_aux
08048660 0        0           init_dummy
08048670 0        0           _fini
080484d0 4107     49.2033     main
080485b8 4240     50.7967     func1

Now, profile the same application (executable) with the event CPU_CLK_UNHALTED, which basically collects samples on the number of cycles the CPU runs without halting. The number of samples collected in the routine is proportional to the the time spent by the processor in executing instructions. The more samples collected, the more time the processor has spent executing those instructions. Note the number of samples collected in main and func1:

Listing 3. OProfile data for CPU_CLK_UNHALTED
#oprofpp -l ./appln
Cpu type: PIII
Cpu speed was (MHz estimation) : 699.667
Counter 0 counted CPU_CLK_UNHALTED events (clocks processor is not halted) with
a unit mask of 0x00 (No unit mask) count 10000
vma      samples  %           symbol name
080483d0 0        0           _start
080483f4 0        0           call_gmon_start
08048420 0        0           __do_global_dtors_aux
08048480 0        0           fini_dummy
08048490 0        0           frame_dummy
080484c0 0        0           init_dummy
08048640 0        0           __do_global_ctors_aux
08048670 0        0           init_dummy
08048680 0        0           _fini
080484d0 40317    49.9356     main
080485bc 40421    50.0644     func1

We will now optimize the shared data structure to improve performance by separating the two elements of the shared data structure into different cache lines. In an Intel IA32 P4 processor, the size of each L2 cache line is 32 bytes. By padding 28 bytes of the first element in the shared_data_align structure, the elements of the structure can be separated into two different cache lines. Now, the parent thread modifies num_proc1 of shared_data_align, resulting in the reading in of num_proc1 on one cache line of CPU number 1 during the first access. Future access to num_proc1 by the parent thread results in the reading in of the data from the cache line. The clone thread modifies num_proc2 of shared_data_align, which results in having num_proc2 on another cache line in CPU number 2. The two threads running in parallel modify the elements num_proc1 and num_proc2 respectively, which are on different cache lines. By separating the two elements of the structure into two different cache lines, modification of one cache line does not cause another cache line to be read in again from the memory. In this way the number of cache lines read in are reduced.

Listing 4. Optimized data structures
    /*
     * The padding is added to separate the two unsigned ints in such a 
     * way that the two elements num_proc1 and num_proc2 are on two
         * different cache lines.
     */

struct shared_data_align {
    unsigned int num_proc1;
    char padding[28];
    unsigned int num_proc2;
};
    /*
     * This structure remains unchanged, so that some cache lines 
      * read in can be seen in profile data.
     */
struct shared_data_nonalign {
    unsigned int num_proc1;
    unsigned int num_proc2;
};

Note that shared_data_nonalign has not been optimized.

Since you already have OProfile enabled and running (you do, don't you?), here are some profiles you can try running yourself: collect OProfile data for the event L2_LINES_IN and set the count to 500, as shown in Listing 2.

Also try collecting OProfile data for the event CPU_CLK_UNHALTED with the count set to 10000; compare the data collected with and without optimization and note the performance improvement.


OProfile analysis: Branch misprediction

Modern processors practice branch prediction (see Resources), since the underlying algorithm and data have regularities. If the prediction is correct, the cost of a branch is lower. But branch prediction is not always correct, and some branches are hard to predict. You can approach this by improving branch prediction in your software, and you can also identify the issues by profiling the application and the kernel for the event BR_MISS_PRED_TAKEN_RET.

The code in Listing 5 shows branch misprediction. This example program creates a cloned thread sharing memory space with the parent process. The parent process running on one processor toggles the value of num_proc1 based on the value of num_proc2 (and branches based on the value of the variable modified by another process). The compiler simply assumes that the value of num_proc2 will be equal to 1 every time, and generates the code for that branch by default. If this prediction of num_proc2 equal to 1 is false, then branch misprediction occurs.

The cloned thread running on another processor toggles the value of num_proc2 based on the value of num_proc1 (and branches based on the value of the variable modified by another process). This causes num_proc2 to not always be equal to 1, and hence branch misprediction occurs in the parent thread. Similarly, toggling the value of num_proc1 by the parent thread causes branch misprediction in the clone thread.

Listing 5. Code showing branch misprediction
/*shared structure between the two processes */
struct share_both {
    unsigned int num_proc1;
    unsigned int num_proc2;
};
    
/* 
 * The parent process clones a thread by sharing its memory space. The parent
 * process just toggles the value of num_proc1 in loop.
 */

/* Declare local shared data */
struct share_both common;

    /* clones a thread with memory space same as parent thread*/
    if ((pid = clone(func1, buff+8188, CLONE_VM, &common)) < 0) {
        perror("clone");
        exit(1);
    }
    /* toggles the value of num_proc1 in loop */    
    for (j = 0; j < 200; j++)
        for(i = 0; i < 100000; i++) {
            if (common.num_proc2)
                common.num_proc1 = 0;
            else 
                common.num_proc1 = 1;
        }

    
/*
 * The function below is called by the cloned thread, which just toggles the
 * value of num_proc2 every time in the loop.
 */
int func1(struct share_both *com)
{
    int i, j;
    /* toggles the value of num_proc2 in loop */    
    for (j = 0; j < 200; j++)
        for (i = 0; i < 100000; i++) {
            if (com->num_proc1)
                   com->num_proc2 = 0;
            else 
                   com->num_proc2 = 1;
        }
}

Branch misprediction can be shown by compiling the above code without optimization.

#gcc -o branch parent_thread_source_code clone_thread_source_code

Now, profile the branch application for the event BR_MISS_PRED_TAKEN_RET with the count set to 500, as shown in Listing 2. Note the samples collected in main and func1.

Also profile the same executable for the event CPU_CLK_UNHALTED and with the count set to 10000, as shown in Listing 2.

You can optimize branch misprediction by using the -O2 compiler option:

#gcc -O2 -c clone_thread_source_code
#gcc -o branch clone_thread_source_code.o parent_thread_source_code

Now, start profiling the application for the BR_MISS_PRED_TAKEN_RET event and CPU_CLK_UNHALTED as in Listing 2; note the performance improvement.

Let's look at another example of branch misprediction in the next section on kernel profiling.


Examples of kernel profiling

The profile data shown below were collected for the event BR_MISS_PRED_TAKEN_RET by the kernbench benchmark for the 2.5.70 kernel. 23,360 samples were collected for vma_merge and 20,717 for do_mmap_pgoff.

Listing 6. Kernel profile data
# oprofpp -l -i /boot/vmlinux | tail -20
c0143510 4719     1.26446     page_add_rmap
c0117740 4791     1.28375     schedule
c0140320 4825     1.29286     find_vma_prepare
c010f720 4862     1.30278     sys_mmap2
c0134fc0 5005     1.34109     __alloc_pages
c0123670 5473     1.46649     run_timer_softirq
c0134800 5648     1.51339     bad_range
c0139250 6571     1.7607      mark_page_accessed
c0143bd0 6919     1.85395     __pte_chain_free
c013f180 6973     1.86842     do_no_page
c0140ec0 7393     1.98096     get_unmapped_area
c01400f0 8020     2.14896     vm_enough_memory
c0140ff0 9897     2.65191     find_vma
c01594e0 10939    2.93111     link_path_walk
c0134e70 11467    3.07259     buffered_rmqueue
c0117370 11690    3.13234     scheduler_tick
c013eeb0 17463    4.67922     do_anonymous_page
c01153e0 20322    5.44529     do_page_fault
c01408e0 20717    5.55113     do_mmap_pgoff
c0140600 23360    6.25933     vma_merge

Branch misprediction can be eliminated by removing branches. In the Intel IA32 processor, branches can be eliminated using SETcc instructions or using P6 processor conditional move CMOVcc or FCMOVcc instructions.

The following line of C code shows conditional branching:

(A > B) ? C1 : C2;

Here are the assembly instructions for the above C code:

Listing 7. Equivalent assembly instructions
    cmp A, B             ; compare
    jge L30              ; conditional branch
    mov ebx, CONST1 
    jmp L31              ; unconditional branch
  L30:
    mov ebx, CONST2
  L31:

This code can be optimized to eliminate branches like this:

Listing 8. Equivalent assembly instructions, branches removed
    xor ebx,  ebx            ;

    cmp A, B
    setge b1                 ; if ebx = 0 or 1
    dec ebx
    and ebx, (CONST2-CONST1)
    add ebx, min(CONST2,CONST1)     ; ebx = CONST1 or CONST2

The optimized code sets register EBX to zero, then compares A and B. If A is greater than or equal to B, EBX is set to 1. EBX is decremented and ANDed with the difference between the two constant values. This sets EBX to either zero or the difference of the two values. By adding the smaller of the two constant values, the correct value is written to EBX.

I hope you have gained some insight into OProfile and the ways you can optimize the kernel code. The 2.6 kernel release is coming up and there will be a lot to profile.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11343
ArticleTitle=Smashing performance with OProfile
publish-date=10162003