Evaluate performance for Linux on POWER

Analyze performance using Linux tools


Application performance evaluation can be a complex task on modern machines. Commonly available tools hardly handle all the performance variables. Each workload differs in which computer subsystem it stresses. Measuring and tuning a CPU-bound program is quite different from tuning an IO-bound, or memory-bound program. In this article we focus on CPU-bound and memory-bound programs in compiled language environments (C, C++, and others). We demonstrate how to:

  • Find program hotspots (a region, function, method in a program where a high proportion of the executed instructions occur)
  • Measure how the program behaves on POWER7 using the hardware performance counter available on the processor
  • Identify the tools to use for performance evaluation on Linux.

CPI model for POWER7

Understanding application performance analysis begins with a discussion of CPI Metrics. The Cycles per Instruction (CPI) metric is the number of processor cycles needed to complete an instruction. Each instruction is decomposed into multiple stages: a classic RISC pipeline will have an instruction fetch stage followed by instruction decode/register fetch, execution, an optional memory access and finally the writeback. A CPU can improve its CPI metric (measured by a lower CPI value) by exploiting instruction level parallelism: Each stage will handle different instructions in different stages. When optimizing, try to minimize the CPI value to maximize system utilization. Figure 1 shows an optimal instruction flow in a pipelined processor.

Figure 1. Optimal instruction flow of a pipelined processor
Chart showing independent instructions flowing through fetch, decode, execute and writeback
Chart showing independent instructions flowing through fetch, decode, execute and writeback

Sometimes one stage is not fully independent of other stages or it issues an instruction with dependencies that force the processor to meet that demand before continuing its execution. For instance, a memory load followed by an arithmetic instruction makes the processor first fetch the data into cache or memory only then issuing the arithmetic instruction. When this happens the processor pipeline is said to encounter a stall, which stalls the pipeline. Figure 2 shows what a stalled pipeline might look like.

Figure 2. A pipelined processor with stalls
Chart showing dependent operations where instructions are held up waiting for others to complete
Chart showing dependent operations where instructions are held up waiting for others to complete

In the examples in Figure 1 and Figure 2, consider that during 11 cycles of operation on a fully populated pipeline (where one instruction is completed per cycle) the processor can execute eight instructions. However, when a three-cycle stall occurs only five instructions were executed in the same number of cycles. The performance loss is about 40%. Depending on the algorithm, some stalls are unavoidable; however, careful analysis can provide hints and advice on how to rewrite or adjust some pieces of code to avoid such stalls. FInd a more complete and didactic explanation on modern CPU pipelining and instruction-level parallelism in the article "Modern Microprocessors - a 90 minute guide" (see Related topics).

The CPI Breakdown Model (CBM) relates functional processor stages with performance counters to show which CPU functional unit is generating stalls. The CBM is dependent on the CPU architecture and processor model; The Power Architecture and Intel Architecture have completely different CBMs. POWER5 CBM, although similar, is different from the POWER7 CBM. Figure 3 shows a part of the POWER7 CBM. (See a text version of this information.)

Figure 3. Partial POWER 7 CBM
Screen capture of CPI Breakdown Model for Power7
Screen capture of CPI Breakdown Model for Power7

In the Power Architecture, hardware performance counters are a set of special-purpose registers whose contents are updated when a certain event occurs within the processor. The POWER7 processor has a built-in Performance Monitoring Unit (PMU) with six thread-level Performance Counter Monitors (PCMs) per PMU. Four of these are programmable, meaning it is possible to monitor four events at the same time, and there are more than 500 possible performance events. POWER7 performance counters are defined by groups and the PMU can only watch events of the same groups at one time. Figure 3 shows a subset of the performance counters used to define the POWER7 CBM. The counters in Figure 3 following a profile, are used to denote which CPU functional unit is causing processor stalls and provide possible hints on how to tune the algorithm to eliminate them.

In Figure 3, white boxes are specific POWER7 PCMs watched in a profile. Based on their values the gray boxes [each marked with an asterisk (*)] are calculated (these metrics have no specific hardware counters).

Note: Find a comprehensive PMU reference for POWER7 in the paper, "Comprehensive PMU Event Reference POWER7" (See Related topics).

Tools on Linux

How can you use the PCM found in POWER7 processors? Although you can use various profiling methods on POWER, like hardware interrupts, code instrumentation (such as gprof), operational system hooks (systemtap); PCM provides an extensive set of counter that work directly with processor functionality. The PCM profiler constantly samples the processor register values at regular intervals using operating system interrupts. Although sample profiling might result in less numerically accurate results than instruction trace results, it has less impact in overall system performance and allows the target benchmark to run at nearly full speed. The resulting data is not exact; it is an approximation with an error margin.

The two most commonly-used tools for PCM profiling on Linux are OProfile and perf (see Related topics). Although both use the same principle, constantly sampling the special hardware register (through a syscall) along a workload's backtrace, each is configured and used in a different way.

The OProfile tool is a system-wide profiler for Linux systems, capable of profiling all running code at low overhead. It consists of a kernel driver and daemon for collecting sample data, and several post-profiling tools for turning data into information. Debug symbols (-g option to gcc) are not necessary unless you want annotated source. With a recent Linux 2.6 kernel, OProfile can provide gprof-style call-graph profiling information. OProfile has a typical overhead of 1-8%, depending on sampling frequency and workload.

On POWER, OProfile works by watching groups of performance hardware counters and performance counters, though different groups can not be used together. It means that getting different performance counters from the same workload requires running it multiple times with different OProfile event configurations. This also means that you cannot watch the entire POWER7 CBM at the same time. The available groups are defined in the aforementioned “POWER7 PMY Detailed Event Description” document, or by running the command in Listing 1:

Listing 1. OProfile groups listing
# opcontrol -l

The commands in Listing 2 demonstrate a simple OProfile configuration and invocation:

Listing 2. OProfile POWER7 CPU cycles configuration
# opcontrol -l
# opcontrol -–no-vmlinux
# opcontrol -e PM_CYC_GRP1:500000 -e PM_INST_CMPL_GRP1:500000 -e PM_RUN_CYC_GRP1:500000 
# opcontrol --start

Run the workload in Listing 3.

Listing 3. OProfile run command sequence
# opcontrol --dump 
# opcontrol –-stop
# opcontrol --shutdown

To get the performance counter report, issue the command in Listing 4:

Listing 4. OProfile report generation
# opreport -l > workload_report

Note: Find comprehensive guide to OProfile (although not updated for POWER7) in the developerWorks article "Identify performance bottlenecks with OProfile for Linux on POWER" (see Related topics).

The perf tool, introduced in Linux kernel 2.6.29, analyzes performance events at both hardware and software levels. The perf tool has the advantage of being program oriented, instead of system oriented like OProfile. It has some preset performance counter lists, like 'cpu-cycles OR cycles', 'branch-misses', or 'L1-icache-prefetch-misses' and it has the ability to multiplex the PMU groups to allow gathering of multiple performance counters from different groups at same time at the cost of sample precision.

One drawback is that, although it allows gathering of hardware performance counters directly, perf does not recognize the counter name denoted by the POWER7 CBM; it needs to use raw hexadecimal numbers instead. Table 1 is a mapping of OProfile events to hexadecimal numbers which you can use with perf (using the record raw events options) to utilize the CBM for POWER7.

Table 1. POWER7 perf events raw codes
CounterRaw code

Note: Find a comprehensive guide to perf (although not updated for POWER7) in the IBM Wiki "Using perf on POWER7 systems" (see Related topics).

You can get the raw codes used with perf that correspond to the POWER7 events defined in OProfile from the libpfm4 project (see Related topics): They are defined in the POWER7 specific header (lib/events/power7_events.h). The example program examples/showevtinfo also shows the event names and corresponding raw hexadecimal codes.

To obtain counter information, profiling is a common approach. Profiling allows a developer to identify hotspots in code execution and data access, find performance sensitive areas, understand memory access patterns, and more. Before starting to profile, it is necessary to work out a performance evaluation strategy. The program might be composed of various modules and/or dynamic shared objects (DSO), it might intensively utilize the kernel, it might depend more on data pattern access (high pressure on L2 or L3 cache access) or might focus on the vector operation units. The next section will focus on possible performance evaluation strategies.

Strategies for Performance Evaluation

An initial performance evaluation is to find program hotspots by inspecting the CPU cycle utilization counter. To do this on POWER7, watch the events listed in Table 2:

Table 2. POWER7 CPU cycle utilization counters
PM_CYCProcessor Cycles
PM_INST_CMPLNumber of PowerPC Instructions that completed
PM_RUN_CYCProcessor Cycles gated by the run latch. Operating systems use the run latch to indicate when they are doing useful work. The run latch is typically cleared in the OS idle loop. Gating by the run latch filters out the idle loop.
PM_RUN_INST_CMPLNumber of run instructions completed

Running OProfile with these events will show the overall time the processor spent in a symbol. Below is an example profile output for the 403.gcc component from the SPECcpu2006 benchmark suite compiled with IBM Advance Toolchain 5.0 for POWER (see Related topics). The following is the output from the command opreport -l.

Listing 5. Output from 'opreport -' for 403.gcc benchmark (counter PM_CYC_GRP1 and PM_INST_CMPL_GRP1)
CPU: ppc64 POWER7, speed 3550 MHz (estimated) 
Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor Cycles) with a unit 
mask of 0x00 (No unit mask) count 500000 
Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Number of PowerPC 
Instructions that completed.) with a unit mask of 0x00 (No unit mask) count 500000 

samples  %        samples  %        image name      app name        symbol name 
204528    7.9112  32132     1.3848  gcc_base.none   gcc_base.none   reg_is_remote_cons\
125218    4.8434  246710   10.6324  gcc_base.none   gcc_base.none   bitmap_operation 
113190    4.3782  50950     2.1958    memset 
90316     3.4934  22193     0.9564  gcc_base.none   gcc_base.none   compute_transp 
89978     3.4804  11753     0.5065  vmlinux         vmlinux         .pseries_dedicated_\
88429     3.4204  130166    5.6097  gcc_base.none   gcc_base.none   bitmap_element_\
67720     2.6194  41479     1.7876  gcc_base.none   gcc_base.none   ggc_set_mark 
56613     2.1898  89418     3.8536  gcc_base.none   gcc_base.none   canon_rtx 
53949     2.0868  6985      0.3010  gcc_base.none   gcc_base.none   delete_null_\
51587     1.9954  26000     1.1205  gcc_base.none   gcc_base.none   ggc_mark_rtx_\
48050     1.8586  16086     0.6933  gcc_base.none   gcc_base.none   single_set_2 
47115     1.8224  33772     1.4555  gcc_base.none   gcc_base.none   note_stores
Listing 6. Output from 'opreport -' for 403.gcc benchmark (counter PM_RUN_CYC_GRP1 and PM_RUN_INST_CMPL_GRP1)
Counted PM_RUN_CYC_GRP1 events ((Group 1 pm_utilization) Processor Cycles gated by the 
run latch.  Operating systems use the run latch to indicate when they are doing useful 
work.  The run 
latch is typically cleared in the OS idle loop.  Gating by the run latch filters out 
the idle loop.) with a unit mask of 0x00 (No unit mask) count 500000 
Counted PM_RUN_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Number of run 
instructions completed.) with a unit mask of 0x00 (No unit mask) count 500000 

samples  %        samples  %        samples  %      app name        symbol name 
204538    8.3658  32078     1.3965  gcc_base.none   gcc_base.none   reg_is_remote_consta\
124596    5.0961  252227   10.9809  gcc_base.none   gcc_base.none   bitmap_operation 
112326    4.5943  50890     2.2155    memset 
90312     3.6939  21882     0.9527  gcc_base.none   gcc_base.none   compute_transp 
0              0  0              0  vmlinux         vmlinux         .pseries_dedicated\
88894     3.6359  124831    5.4346  gcc_base.none   gcc_base.none   bitmap_element_all\
67995     2.7811  41331     1.7994  gcc_base.none   gcc_base.none   ggc_set_mark
56460     2.3093  89484     3.8958  gcc_base.none   gcc_base.none   canon_rtx
54076     2.2118  6965      0.3032  gcc_base.none   gcc_base.none   delete_null_pointer\
51228     2.0953  26057     1.1344  gcc_base.none   gcc_base.none   ggc_mark_rtx_childr\
48057     1.9656  16005     0.6968  gcc_base.none   gcc_base.none   single_set_2 
47160     1.9289  33766     1.4700  gcc_base.none   gcc_base.none   note_stores

Each watched event is represented by a pair of columns in the output. The first column shows the sample numbers collected from a PCM for the specified event and the second shows the percentage of the total sample numbers it presents. As seen in this report, the symbol reg_is_remote_constant_p is the one which consumes most of the processor cycles and is a good candidate for code optimization. This profile only identifies which symbols consume the most CPU cycles, but not if the processor pipeline is fully utilized. You can investigate pipeline utilization by comparing the counters results.

Consider the counter PM_INST_CMPL_GRP1 (the second pair of columns); the symbol bitmap_operation shows a higher percentage than the symbol reg_is_remote_constant_p. This performance counter is incremented for each processor instruction completed, whereas PM_CYC_GRP1 only means the number of CPU cycles utilized. Without further analysis, this might indicate that the symbol reg_is_remote_constant_p contains more CPU stalls than the symbol bitmap_operation since the number of instructions completed for the symbol reg_is_remote_constant_p is significantly lower. This profile provides an initial hint on which symbol to focus subsequent optimization efforts.

Before you start to dig in and crack up the code it is wise to understand if the workload is CPU or memory bound. This is important because optimization approaches are quite different for each workload type. For example, most often memory accesses come from cache or main memory (as opposed to NUMA remote node memory access) and performance depends almost entirely on the algorithms and data structures used. To investigate memory access patterns, watch the following two performance counters in Table 3:

Table 3. POWER7 memory utilization counters
PM_MEM0_RQ_DISPRead requests dispatched for main memory
PM_MEM0_WQ_DISPWrite requests dispatched for main memory

These two counters can indicate whether a memory access pattern is mainly from memory reads, writes, or both. Using the same benchmark as before (403.gcc from SPECcpu2006), the profile shows:

Listing 7. Output from 'opreport -' for 403.gcc benchmark (counter PM_MEM0_RQ_DISP and PM_MEM0_WQ_DISP)
CPU: ppc64 POWER7, speed 3550 MHz (estimated) 
Counted PM_MEM0_RQ_DISP_GRP59 events ((Group 59 pm_nest2)  Nest events (MC0/MC1/PB/GX), 
Pair0 Bit1) with a unit mask of 0x00 (No unit mask) count 1000 
Counted PM_MEM0_WQ_DISP_GRP59 events ((Group 59 pm_nest2)  Nest events (MC0/MC1/PB/GX), 
Pair3 Bit1) with a unit mask of 0x00 (No unit mask) count 1000 
samples  %        samples  %        app name                 symbol name 
225841   25.8000  289       0.4086  gcc_base.none            reg_is_remote_constant_p.\
90068    10.2893  2183      3.0862  gcc_base.none            compute_transp 
54038     6.1733  308       0.4354  gcc_base.none            single_set_2 
32660     3.7311  2006      2.8359  gcc_base.none            delete_null_pointer_checks 
26352     3.0104  1498      2.1178  gcc_base.none            note_stores 
21306     2.4340  1950      2.7568  vmlinux                  .pseries_dedicated_idle_sl\
18059     2.0631  9186     12.9865             memset 
15867     1.8126  659       0.9316  gcc_base.none            init_alias_analysis

Another interesting set of performance counters to observe is the access pressure over the cache, both L2 and L3. The following example uses perf to profile the SPECcpu2006 483.xalancbmk component (see Related topics) that is built using RHEL6.2 Linux system GCC. This component uses memory allocation routines heavily so expect a lot of pressure on the memory subsystem. To accomplish this, watch the following counters in Table 4 with OProfile:

Table 4. POWER7 cache/memory access counters
PM_DATA_FROM_L2The processor's Data Cache was reloaded from the local L2 due to a demand load
PM_DATA_FROM_L3The processor's Data Cache was reloaded from the local L3 due to a demand load
PM_DATA_FROM_LMEMThe processor's Data Cache was reloaded from memory attached to the same module this processor is located on
PM_DATA_FROM_RMEMThe processor's Data Cache was reloaded from memory attached to a different module than this processor is located on

The profile output shows the following:

Listing 8. Output from 'opreport -' for 489.Xalancbmk benchmark (counter PM_DATA_FROM_L2_GRP91 and PM_DATA_FROM_L3_GRP91)
CPU: ppc64 POWER7, speed 3550 MHz (estimated) 
Counted PM_DATA_FROM_L2_GRP91 events ((Group 91 pm_dsource1) The processor's Data Cache
was reloaded from the local L2 due to a demand load.) with a unit mask of 0x00 (No unit
 mask) count 1000 
Counted PM_DATA_FROM_L3_GRP91 events ((Group 91 pm_dsource1) The processor's Data Cache
 was reloaded from the local L3 due to a demand load.) with a unit mask of 0x00 (No unit
 mask) count 1000 
samples  %        samples  %        image name     app name       symbol name 
767827   25.5750  7581      0.2525  gcc_base.none  gcc_base.none  bitmap_element_allocate
377138   12.5618  8341      0.2778  gcc_base.none  gcc_base.none  bitmap_operation 
93334     3.1088  3160      0.1052  gcc_base.none  gcc_base.none  bitmap_bit_p 
70278     2.3408  5913      0.1969   _int_free 
56851     1.8936  22874     0.7618  oprofile       oprofile       /oprofile 
47570     1.5845  2881      0.0959  gcc_base.none  gcc_base.none  rehash_using_reg 
41441     1.3803  8532      0.2841   _int_malloc
Listing 9. Output from 'opreport -' for 489.Xalancbmk benchmark (counter PM_DATA_FROM_LMEM_GRP91 and PM_DATA_FROM_RMEM_GRP91)
Counted PM_DATA_FROM_LMEM_GRP91 events ((Group 91 pm_dsource1) The processor's Data Cache
was reloaded from memory attached to the same module this proccessor is located on.) with
 a unit mask of 0x00 (No unit mask) count 1000 
Counted PM_DATA_FROM_RMEM_GRP91 events ((Group 91 pm_dsource1) The processor's Data Cache
 was reloaded from memory attached to a different module than this proccessor is located 
on.) with a unit mask of 0x00 (No unit mask) count 1000
samples  %        samples  %        image name     app name       symbol name 
1605      0.3344  0              0  gcc_base.none  gcc_base.none  bitmap_element_allocate
1778      0.3704  0              0  gcc_base.none  gcc_base.none  bitmap_operation 
1231      0.2564  0              0  gcc_base.none  gcc_base.none  bitmap_bit_p 
205       0.0427  0              0   _int_free 
583       0.1215  327      100.000  oprofile       oprofile       /oprofile 
0              0  0              0  gcc_base.none  gcc_base.none  rehash_using_reg 
225       0.0469  0              0   _int_malloc

Interpreting the profile output shows that most of the cache pressure came from L2 access with almost no L3 reload from demand, since the total and relative counter sample value for L2 access (PM_DATA_FROM_L2) is much higher than L3 demand reload (PM_DATA_FROM_L3). You can only obtain further information, like if L2 access is causing CPU stalls due to cache misses, with more comprehensive analysis (by watching more counters). A conclusion that can be drawn from this example profile is that the main memory access (PM_DATA_FROM_LMEM event) is quite low compared to cache access and there is no remote access (event PM_DATA_FROM_RMEM) indicating no remote NUMA node memory access. Analysis of hotspots and memory access patterns can give direction to optimization efforts; in this case, further analysis is required to identify what really causes CPU stalls because simple identifying the workload hotspots and memory access pattern is not enough to correctly identify CPU stalls.

To come up with better strategies for performance optimization further analysis will require using the perf tool rather than OProfile since many POWER7 CBM counters need to be watched simultaneously the 22 presented in Figure 3 and to come with better strategies for performance optimization. Many of these events are in different groups, meaning that using OProfile requires many runs of the same workload. The perf tool will multiplex the watching of hardware counters when the specified counters are in more than one group. Although this results in a less accurate outcome, the overall result tends to be very similar to the expected with the advantage of less time spent profiling.

The following example uses perf to profile the same SPECcpu2006 483.xalancbmk component. To profile this component, issue the command in Listing 10:

Listing 10. perf command to generated POWER7 CBM
$ /usr/bin/perf stat -C 0 -e r100f2,r4001a,r100f8,r4001c,r2001a,r200f4,r2004a,r4004a,
r40014,r30004 taskset -c 0 ./Xalan_base.none -v t5.xml xalanc.xsl > power7_cbm.dat

This command will cause perf to watch the raw events defined by the -e argument on the CPU specified by -c. The taskset call ensures that the component will run exclusively on CPU number 0. The workload ./Xalan_base.none -v t5.xml xalanc.xsl can be replaced by another application to profile. After the profile is complete, the perf command will output a simple table of the total count for each raw event with the total number of elapsed seconds:

Listing 11. Output from 'perf stat' for 489.Xalancbmk benchmark
 Performance counter stats for 'taskset -c 0 ./Xalan_base.none -v t5.xml xalanc.xsl': 

   366,860,486,404 r100f2                                                       [18.15%] 
     8,090,500,758 r4001a                                                       [13.65%] 
    50,655,176,004 r100f8                                                       [ 9.13%] 
    11,358,043,420 r4001c                                                       [ 9.11%] 
    10,318,533,758 r2001a                                                       [13.68%] 
 1,301,183,175,870 r200f4                                                       [18.22%] 
     2,150,935,303 r2004a                                                       [ 9.10%] 
                 0 r4004a                                                       [13.65%] 
   211,224,577,427 r4004e                                                       [ 4.54%] 
   212,033,138,844 r4004c                                                       [ 4.54%] 
   264,721,636,705 r20016                                                       [ 9.09%] 
    22,176,093,590 r40018                                                       [ 9.11%] 
   510,728,741,936 r20012                                                       [ 9.10%] 
    39,823,575,049 r40016                                                       [ 9.07%] 
     7,219,335,816 r40012                                                       [ 4.54%] 
         1,585,358 r20018                                                       [ 9.08%] 
   882,639,601,431 r4000a                                                       [ 9.08%] 
     1,219,039,175 r2001c                                                       [ 9.08%] 
         3,107,304 r1001c                                                       [13.62%] 
   120,319,547,023 r20014                                                       [ 9.09%] 
    50,684,413,751 r40014                                                       [13.62%] 
   366,940,826,307 r30004                                                       [18.16%] 

     461.057870036 seconds time elapsed

To analyze the perf output against the POWER7 CBM, a Python script is provided (check the in Downloadable resources), which composes the counter metrics from the collected virtual and hardware counters. To create a report issue the command in Listing 12:

Listing 12. POWER7 CBM python script invocation
$ power7_cbm.dat

Output similar to Listing 13 will be printed:

Listing 13. Output from '' for 489.Xalancbmk benchmark
CPI Breakdown Model (Complete) 

Metric                         :            Value :    Percent 
PM_CMPLU_STALL_DIV             :    49802421337.0 :        0.0 
PM_CMPLU_STALL_FXU_OTHER       :    67578558649.0 :        5.2 
PM_CMPLU_STALL_SCALAR_LONG     :        2011413.0 :        0.0 
PM_CMPLU_STALL_SCALAR_OTHER    :     7195240404.0 :        0.6 
PM_CMPLU_STALL_VECTOR_LONG     :              0.0 :        0.0 
PM_CMPLU_STALL_VECTOR_OTHER    :     1209603592.0 :        0.1 
PM_CMPLU_STALL_ERAT_MISS       :    22193968056.0 :        1.7 
PM_CMPLU_STALL_REJECT_OTHER    :    18190293594.0 :        1.4 
PM_CMPLU_STALL_DCACHE_MISS     :   261865838255.0 :       20.3 
PM_CMPLU_STALL_STORE           :     2001544985.0 :        0.2 
PM_CMPLU_STALL_LSU_OTHER       :   202313206181.0 :       15.7 
PM_CMPLU_STALL_THRD            :        2025705.0 :        0.0 
PM_CMPLU_STALL_BRU             :   208356542821.0 :       16.2 
PM_CMPLU_STALL_IFU_OTHER       :     2171796336.0 :        0.2 
PM_CMPLU_STALL_OTHER           :    30895294057.0 :        2.4 
PM_GCT_NOSLOT_IC_MISS          :     9805421042.0 :        0.8 
PM_GCT_NOSLOT_BR_MPRED         :     7823508357.0 :        0.6 
PM_GCT_NOSLOT_BR_MPRED_IC_MISS :    11059314150.0 :        0.9 
PM_GCT_EMPTY_OTHER             :    20292049774.0 :        1.6 
PM_1PLUS_PPC_CMPL              :   365158978504.0 :       28.3 
OVERHEAD_EXPANSION             :      590057044.0 :        0.0 
Total                                             :       96.1

This report is based on statistical values within an error margin, so final percentages are not entirely accurate. Even with a high error margin, about 20% of total CPU stalls are due to data cache misses (PM_CMPLU_STALL_DCACHE_MISS). The final instruction completion percentage (PM_1PLUS_PPC_CMPL) is about 28%.

Future optimizations should try to maximize this number by decreasing CPU stalls and/or GCT (Global Completion Table) percentages. Based on this report, another avenue for analysis is to identify the code where the stalls are happening. To accomplish this by using the perf record command. It will trace the performance of a raw counter and create a map with a process backtrace allowing identification of which symbol generated the most hardware events. This is similar to the way OProfile works. In this example, to trace the PM_CMPLU_STALL_DCACHE_MISS events, issue the command in Listing 14:

Listing 14. perf record for PM_CMPLU_STALL_DCACHE_MISS event
$ /usr/bin/perf record -C 0 -e r20016 taskset -c 0 ./Xalan_base.none -v t5.xml xalanc.xsl

The perf command will create a data file (usually "perf.dat") with the results. It can be read interactively using the perf report command as in Listing 15:

Listing 15. Output from 'perf report' for 489.Xalancbmk benchmark
Events: 192  raw 0x20016
    39.58%  Xalan_base.none  Xalan_base.none  [.] xercesc_2_5::ValueStore::contains 
    11.46%  Xalan_base.none  Xalan_base.none  [.] xalanc_1_8::XStringCachedAllocator
     9.90%  Xalan_base.none  Xalan_base.none  [.] xalanc_1_8::XStringCachedAllocator
     7.29%  Xalan_base.none  Xalan_base.none  [.] xercesc_2_5::ValueStore::isDuplica
     5.21%  Xalan_base.none     [.] _int_malloc 
     5.21%  Xalan_base.none  Xalan_base.none  [.] __gnu_cxx::__normal_iterator<xa
     4.17%  Xalan_base.none     [.] __GI___libc_malloc 
     2.08%  Xalan_base.none     [.] malloc_consolidate.part.4 
     1.56%  Xalan_base.none  Xalan_base.none  [.] xalanc_1_8::ReusableArenaBlock<xa
     1.56%  Xalan_base.none  Xalan_base.none  [.] xalanc_1_8::ReusableArenaBlock<xa
     1.04%  Xalan_base.none     [.] __free

With this analysis using POWER7 CBM counter and the perf report tool, your optimization effort might concentrate on optimizing memory and cache access on the symbol xercesc_2_5::ValueStore::contains(xercesc_2_5::FieldValueMap const*).

This example is just a subset of possible analysis. The POWER7 CBM shows you that although data cache stalls show up as being the higher cause of CPU stalls, the load and store unit (PM_CMPLU_STALL_LSU) and branch unit (PM_CMPLU_STALL_BRU) are both source of stalls. Further analysis can address these counters.

Case Study

The following case study applies these performance evaluation strategies to analyze a trigonometric math function implementation. Based on analysis results, optimization opportunities will be identified. The function used in this case study is the ISO C hypot function, defined as length of the hypotenuse of a right-triangle. The function is defined by C99, POSIX.1-2001 as:

double hypot(double x, double y);
The hypot() function returns sqrt(x*x+y*y). On success, this function returns the length of a right-angled triangle with sides of length x and y. If x or y is an infinity, positive infinity is returned. If x or y is a NaN, and the other argument is not an infinity, a NaN is returned. If the result overflows, a range error occurs, and the functions return HUGE_VAL, HUGE_VALF, or HUGE_VALL, respectively. If both arguments are subnormal, and the result is subnormal, a range error occurs, and the correct result is returned.

Although the algorithm seems simple, the Floating-Point (FP) argument handling of Infinity and NaN and the overflow/underflow related to FP operations impose some challenges with performance impacts. The GNU C Library (see Related topics) provides an implementation of hypot located in the source tree at sysdeps/ieee754/dbl-64/e_hypot.c:

Note: The license information for this code sample is included in Appendix.

Listing 16. Default GLIBC hypot source code
double __ieee754_hypot(double x, double y) 
        double a,b,t1,t2,y1,y2,w; 
        int32_t j,k,ha,hb; 

        ha &= 0x7fffffff; 
        hb &= 0x7fffffff; 
        if(hb > ha) {a=y;b=x;j=ha; ha=hb;hb=j;} else {a=x;b=y;} 
        SET_HIGH_WORD(a,ha);    /* a <- |a| */ 
        SET_HIGH_WORD(b,hb);    /* b <- |b| */ 
        if((ha-hb)>0x3c00000) {return a+b;} /* x/y > 2**60 */ 
        if(ha > 0x5f300000) {   /* a>2**500 */ 
           if(ha >= 0x7ff00000) {       /* Inf or NaN */ 
               u_int32_t low; 
               w = a+b;                 /* for sNaN */ 
               if(((ha&0xfffff)|low)==0) w = a; 
               if(((hb^0x7ff00000)|low)==0) w = b; 
               return w; 
           /* scale a and b by 2**-600 */ 
           ha -= 0x25800000; hb -= 0x25800000;  k += 600; 
        if(hb < 0x20b00000) {   /* b < 2**-500 */ 
            if(hb <= 0x000fffff) {      /* subnormal b or 0 */ 
                u_int32_t low; 
                if((hb|low)==0) return a; 
                SET_HIGH_WORD(t1,0x7fd00000);   /* t1=2^1022 */ 
                b *= t1; 
                a *= t1; 
                k -= 1022; 
            } else {            /* scale a and b by 2^600 */ 
                ha += 0x25800000;       /* a *= 2^600 */ 
                hb += 0x25800000;       /* b *= 2^600 */ 
                k -= 600; 
    /* medium size a and b */ 
        w = a-b; 
        if (w>b) { 
            t1 = 0; 
            t2 = a-t1; 
            w  = __ieee754_sqrt(t1*t1-(b*(-b)-t2*(a+t1))); 
        } else { 
            a  = a+a; 
            y1 = 0; 
            y2 = b - y1; 
            t1 = 0; 
            t2 = a - t1; 
            w  = __ieee754_sqrt(t1*y1-(w*(-w)-(t1*y2+t2*b))); 
        if(k!=0) { 
            u_int32_t high; 
            t1 = 1.0; 
            return t1*w; 
        } else return w; 

This implementation is quite complex mainly because the algorithm executes many of bit-by-bit FP to INT conversions. It assumes that certain FP operations, like compares and multiplications, are more costly when using float-point instructions than when using fixed-point instructions. This is true on some architectures, but not on the Power Architecture.

Your first step to evaluate this implementation is to create a benchmark which can be profiled. In this case, since it is simply a function with two arguments and a straightforward algorithm (no internal function calls or additional paths) a simple benchmark may be created to evaluate it (check the hypot_bench.tar.gz in Downloadable resources). The benchmark is part of the performance evaluation; optimizations should speed up algorithms or critical parts of algorithms that leverage the total workload performance. Synthetic benchmarks, like this one, should represent normal utilization for this function. Since optimization efforts tend to be resource and time consuming one needs to focus on the most common usage cases or expected behavior. Trying to optimize code that represents low total program usage tends to be a waste of resources.

Since this is a performance analysis on a single function, you can skip hotspot analysis and focus on CBM analysis. Using the benchmark in hypot_bench.c along with perf, the CBM information in Listing 17:

Listing 17. Output from '' for hypot benchmark
CPI Breakdown Model (Complete) 

Metric                         :            Value :    Percent 
PM_CMPLU_STALL_DIV             :        8921688.0 :        8.7 
PM_CMPLU_STALL_FXU_OTHER       :    13953382275.0 :        5.0 
PM_CMPLU_STALL_SCALAR_LONG     :    24380128688.0 :        8.7 
PM_CMPLU_STALL_SCALAR_OTHER    :    33862492798.0 :       12.0 
PM_CMPLU_STALL_VECTOR_LONG     :              0.0 :        0.0 
PM_CMPLU_STALL_VECTOR_OTHER    :      275057010.0 :        0.1 
PM_CMPLU_STALL_ERAT_MISS       :         173439.0 :        0.0 
PM_CMPLU_STALL_REJECT_OTHER    :         902838.0 :        0.0 
PM_CMPLU_STALL_DCACHE_MISS     :       15200163.0 :        0.0 
PM_CMPLU_STALL_STORE           :        1837414.0 :        0.0 
PM_CMPLU_STALL_LSU_OTHER       :    94866270200.0 :       33.7 
PM_CMPLU_STALL_THRD            :         569036.0 :        0.0 
PM_CMPLU_STALL_BRU             :    10470012464.0 :        3.7 
PM_CMPLU_STALL_IFU_OTHER       :      -73357562.0 :        0.0 
PM_CMPLU_STALL_OTHER           :     7140295432.0 :        2.5 
PM_GCT_NOSLOT_IC_MISS          :        3586554.0 :        0.0 
PM_GCT_NOSLOT_BR_MPRED         :     1008950510.0 :        0.4 
PM_GCT_NOSLOT_BR_MPRED_IC_MISS :         795943.0 :        0.0 
PM_GCT_EMPTY_OTHER             :    42488384303.0 :       15.1 
PM_1PLUS_PPC_CMPL              :    53138626513.0 :       18.9 
OVERHEAD_EXPANSION             :       30852715.0 :        0.0 
Total                                             :      108.7

The profile analysis shows that most of CPU stalls and hence performance loss comes from the Load and Store Unit (LSU - counter PM_CMPLU_STALL_LSU_OTHER). The LSU has various counters associated with it, however during CPU stall analysis your focus is the counters that are associated with performance degradations. The ones that show performance degradations on POWER are associated with the Load-Hit-Store (LHS) hazards. This is a large stall that occurs when the CPU writes data to an address and then tries to load that data again too soon afterward. The next step is to check if this is happening on this particular algorithm by first checking the event PM_LSU_REJECT_LHS (raw code "rc8ac") as shown in Listing 18.

Listing 18. perf record of PM_LSU_REJECT_LHS POWER7 event
$ perf record -C 0 -e rc8ac taskset -c 0 ./hypot_bench_glibc
$ perf report
Events: 14K raw 0xc8ac
    79.19%  hypot_bench_gli       [.] __ieee754_hypot
    10.38%  hypot_bench_gli       [.] __hypot
     6.34%  hypot_bench_gli       [.] __GI___finite

The profile output shows the symbol __ieee754_hypot is the one generating most of PM_LSU_REJECT_LHS events. Investigating the assembly code generated by the compiler to identify which instructions are generating the event. Expand the symbol __ieee754_hypot to annotate the assembly by iterating on the perf report screen and selecting the __ieee754_hypot symbol, which shows the output in Listing 19.

Listing 19. perf report of PM_LSU_REJECT_LHS POWER7 event
         :        00000080fc38b730 <.__ieee754_hypot>:
    0.00 :          80fc38b730:   7c 08 02 a6     mflr    r0
    0.00 :          80fc38b734:   fb c1 ff f0     std     r30,-16(r1)
    0.00 :          80fc38b738:   fb e1 ff f8     std     r31,-8(r1)
   13.62 :          80fc38b73c:   f8 01 00 10     std     r0,16(r1)
    0.00 :          80fc38b740:   f8 21 ff 71     stdu    r1,-144(r1)
   10.82 :          80fc38b744:   d8 21 00 70     stfd    f1,112(r1)
    0.23 :          80fc38b748:   e9 21 00 70     ld      r9,112(r1)
   17.54 :          80fc38b74c:   d8 41 00 70     stfd    f2,112(r1)
    0.00 :          80fc38b750:   79 29 00 62     rldicl  r9,r9,32,33
    0.00 :          80fc38b754:   e9 61 00 70     ld      r11,112(r1)
    0.00 :          80fc38b758:   e8 01 00 70     ld      r0,112(r1)
    8.46 :          80fc38b75c:   d8 21 00 70     stfd    f1,112(r1)

Early in the code the implementation uses the macro GET_HIGH_WORD to transform a float to an integer for posterior bit-wise operations. GLIBC's math/math_private.h defines the macro using the code in Listing 20.

Listing 20. GET_HIGH_WORD macro definition
#define GET_HIGH_WORD(i,d)                                      \
do {                                                            \
  ieee_double_shape_type gh_u;                                  \
  gh_u.value = (d);                                             \
  (i) =;                                         \
} while (0)

A possible culprit causing a LHS stall in this macro is the operation that reads the attributions of the float to internal value and then reads it to the variable i. The POWER7 processor does not have a native instruction to move the contents of a floating-point register, bit-by-bit, to a Fixed-point register. The way this is accomplished on POWER is to store the FP number in the floating-point register to memory using a store operation and to then load the same memory location into a fixed-point (general-purpose). Since memory access is slower than register operations (even when accessing L1 data cache), the CPU is stalled during the store to complete the subsequent load.

Note: The document, "POWER ISA 2.06 (POWER7)" (see Related topics), contains more information.

Most often performance counter events trigger interrupts that save a PC address of an instruction close to the executing instructions. This can lead to assembly annotation which is not completely accurate. To mitigate this behavior POWER4 and later have a limited set of performance counters named marked. Marked instructions will generate less events per time frame; however, the PC instruction will be exact, resulting in an accurate assembly annotation. Marked events have the PM_MRK prefix in OProfile counter list obtained by opcontrol -l.

To double check the analysis, watch the PM_MRK_LSU_REJECT_LHS counter. Both counters, PM_MRK_LSU_REJECT_LHS and PM_LSU_REJECT_LHS, watch for the same performance event. However the marked counter (PM_MRK_LSU_REJECT_LHS) will generate less events per time frame but with a more accurate assembly annotation. (See Listing 21.)

Listing 21. perf record of PM_MRK_LSU_REJECT_LHS POWER7 event
$ perf record -C 0 -e rd082 taskset -c 0 ./hypot_bench_glibc
$ perf report
Events: 256K raw 0xd082
    64.61%  hypot_bench_gli       [.] __ieee754_hypot
    35.33%  hypot_bench_gli       [.] __GI___finite

This generates the assembly annotation in Listing 22.

Listing 22. perf report of PM_MRK_LSU_REJECT_LHS POWER7 event
         :        00000080fc38b730 <.__ieee754_hypot>:
    1.23 :          80fc38b7a8:   c9 a1 00 70     lfd     f13,112(r1)
    0.00 :          80fc38b7ac:   f8 01 00 70     std     r0,112(r1)
   32.66 :          80fc38b7b0:   c8 01 00 70     lfd     f0,112(r1)
    0.00 :          80fc38b954:   f8 01 00 70     std     r0,112(r1)
    0.00 :          80fc38b958:   e8 0b 00 00     ld      r0,0(r11)
    0.00 :          80fc38b95c:   79 00 00 0e     rldimi  r0,r8,32,0
   61.72 :          80fc38b960:   c9 61 00 70     lfd     f11,112(r1

Another symbol shows about 35% of the generated events with similar behavior, in Listing 23.

Listing 23. More highlights of the perf report
         :        00000080fc3a2610 <.__finitel>>
    0.00 :          80fc3a2610:   d8 21 ff f0     stfd    f1,-16(r1)
  100.00 :          80fc3a2614:   e8 01 ff f0     ld      r0,-16(r1)

Based on this information, your optimization effort might be to eliminate these stalls by removing the FP to INT conversions. The POWER processor has a fast and efficient Float-Point execution unit so there is no need to perform these calculations with Fixed-Point instructions. The algorithm that POWER currently uses in GLIBC (sysdeps/powerpc/fpu/e_hypot.c) has removed all of the LHS stalls by using FP operations only. The result is the much simpler algorithm, in Listing 24.

Listing 24. PowerPC GLIBC hypot source code
__ieee754_hypot (double x, double y)
  x = fabs (x);
  y = fabs (y);

  TEST_INF_NAN (x, y);

  if (y > x)
      double t = x;
      x = y;
      y = t;
  if (y == 0.0 || (x / y) > two60)
      return x + y;
  if (x > two500)
      x *= twoM600;
      y *= twoM600;
      return __ieee754_sqrt (x * x + y * y) / twoM600;
  if (y < twoM500)
      if (y <= pdnum)
          x *= two1022;
          y *= two1022;
          return __ieee754_sqrt (x * x + y * y) / two1022;
          x *= two600;
          y *= two600;
          return __ieee754_sqrt (x * x + y * y) / two600;
  return __ieee754_sqrt (x * x + y * y);

The TEST_INF_NAN macro is a further small optimization which tests if a number is NaN or INFINITY before starting further FP operations (this is due to the fact that operations on NaN and INFINITY can raise FP exceptions and the function specification does not allow that). On POWER7 the isinf and isnan function calls are optimized by the compiler to FP instructions and do not generate extra function calls, while on older processors (POWER6 and older) it will generate a call to the respective functions. The optimization is basically the same implementation, but inlined to avoid function calls.

Finally to compare both implementations, perform the following simple test. Recompile GLIBC with and without the new algorithm and compare the total time for each benchmark run. The default GLIBC implementation results are in Listing 25:

Listing 25. Benchmark with default GLIBC hypot
$ /usr/bin/time ./hypot_bench_glibc
INF_CASE       : elapsed time: 14:994339 
NAN_CASE       : elapsed time: 14:707085 
TWO60_CASE     : elapsed time: 12:983906 
TWO500_CASE    : elapsed time: 10:589746 
TWOM500_CASE   : elapsed time: 11:215079 
NORMAL_CASE    : elapsed time: 15:325237 
79.80user 0.01system 1:19.81elapsed 99%CPU (0avgtext+0avgdata 151552maxresident)k 
0inputs+0outputs (0major+48minor)pagefaults 0swaps

The optimized version results are in Listing 26:

Listing 26. Benchmark with optimized GLIBC hypot
$ /usr/bin/time ./hypot_bench_glibc 
INF_CASE       : elapsed time: 4:667043 
NAN_CASE       : elapsed time: 5:100940 
TWO60_CASE     : elapsed time: 6:245313 
TWO500_CASE    : elapsed time: 4:838627 
TWOM500_CASE   : elapsed time: 8:946053 
NORMAL_CASE    : elapsed time: 6:245218 
36.03user 0.00system 0:36.04elapsed 99%CPU (0avgtext+0avgdata 163840maxresident)k 
0inputs+0outputs (0major+50minor)pagefaults 0swaps

This is a final performance improvement of more than 100%, cutting the benchmark time by half.


Performance evaluation with hardware counter profiling is a powerful tool to understand how a workload behaves on a certain processor and to give hints of where to work on performance optimizations. The latest POWER7 processor has hundreds of performance counters available so we presented a simple model of how to map the workload to CPU stalls. Understanding the POWER7 CBM is somewhat complicated so we also explained tools for Linux that simplify it. Strategies for performance evaluation focused on how to find hotspots, how to understand the memory pattern of an application, and how to use the POWER7 CBM. Finally, we used a recent optimization done on a trigonometric function within GLIBC to explain the performance analysis that was used to result in the optimized code.


Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.

Downloadable resources

Related topics

Zone=Linux, Open source
ArticleTitle=Evaluate performance for Linux on POWER