Evaluate performance for Linux on POWER
Analyze performance using Linux tools
Introduction
Application performance evaluation can be a complex task on modern machines. Commonly available tools hardly handle all the performance variables. Each workload differs in which computer subsystem it stresses. Measuring and tuning a CPU-bound program is quite different from tuning an IO-bound, or memory-bound program. In this article we focus on CPU-bound and memory-bound programs in compiled language environments (C, C++, and others). We demonstrate how to:
- Find program hotspots (a region, function, method in a program where a high proportion of the executed instructions occur)
- Measure how the program behaves on POWER7 using the hardware performance counter available on the processor
- Identify the tools to use for performance evaluation on Linux.
CPI model for POWER7
Understanding application performance analysis begins with a discussion of CPI Metrics. The Cycles per Instruction (CPI) metric is the number of processor cycles needed to complete an instruction. Each instruction is decomposed into multiple stages: a classic RISC pipeline will have an instruction fetch stage followed by instruction decode/register fetch, execution, an optional memory access and finally the writeback. A CPU can improve its CPI metric (measured by a lower CPI value) by exploiting instruction level parallelism: Each stage will handle different instructions in different stages. When optimizing, try to minimize the CPI value to maximize system utilization. Figure 1 shows an optimal instruction flow in a pipelined processor.
Figure 1. Optimal instruction flow of a pipelined processor

Sometimes one stage is not fully independent of other stages or it issues an instruction with dependencies that force the processor to meet that demand before continuing its execution. For instance, a memory load followed by an arithmetic instruction makes the processor first fetch the data into cache or memory only then issuing the arithmetic instruction. When this happens the processor pipeline is said to encounter a stall, which stalls the pipeline. Figure 2 shows what a stalled pipeline might look like.
Figure 2. A pipelined processor with stalls

In the examples in Figure 1 and Figure 2, consider that during 11 cycles of operation on a fully populated pipeline (where one instruction is completed per cycle) the processor can execute eight instructions. However, when a three-cycle stall occurs only five instructions were executed in the same number of cycles. The performance loss is about 40%. Depending on the algorithm, some stalls are unavoidable; however, careful analysis can provide hints and advice on how to rewrite or adjust some pieces of code to avoid such stalls. FInd a more complete and didactic explanation on modern CPU pipelining and instruction-level parallelism in the article "Modern Microprocessors - a 90 minute guide" (see Related topics).
The CPI Breakdown Model (CBM) relates functional processor stages with performance counters to show which CPU functional unit is generating stalls. The CBM is dependent on the CPU architecture and processor model; The Power Architecture and Intel Architecture have completely different CBMs. POWER5 CBM, although similar, is different from the POWER7 CBM. Figure 3 shows a part of the POWER7 CBM. (See a text version of this information.)
Figure 3. Partial POWER 7 CBM

In the Power Architecture, hardware performance counters are a set of special-purpose registers whose contents are updated when a certain event occurs within the processor. The POWER7 processor has a built-in Performance Monitoring Unit (PMU) with six thread-level Performance Counter Monitors (PCMs) per PMU. Four of these are programmable, meaning it is possible to monitor four events at the same time, and there are more than 500 possible performance events. POWER7 performance counters are defined by groups and the PMU can only watch events of the same groups at one time. Figure 3 shows a subset of the performance counters used to define the POWER7 CBM. The counters in Figure 3 following a profile, are used to denote which CPU functional unit is causing processor stalls and provide possible hints on how to tune the algorithm to eliminate them.
In Figure 3, white boxes are specific POWER7 PCMs watched in a profile. Based on their values the gray boxes [each marked with an asterisk (*)] are calculated (these metrics have no specific hardware counters).
Note: Find a comprehensive PMU reference for POWER7 in the paper, "Comprehensive PMU Event Reference POWER7" (See Related topics).
Tools on Linux
How can you use the PCM found in POWER7 processors? Although you can use various profiling methods on POWER, like hardware interrupts, code instrumentation (such as gprof), operational system hooks (systemtap); PCM provides an extensive set of counter that work directly with processor functionality. The PCM profiler constantly samples the processor register values at regular intervals using operating system interrupts. Although sample profiling might result in less numerically accurate results than instruction trace results, it has less impact in overall system performance and allows the target benchmark to run at nearly full speed. The resulting data is not exact; it is an approximation with an error margin.
The two most commonly-used tools for PCM profiling on Linux are
OProfile
and perf
(see Related topics). Although both use the
same principle, constantly sampling the special hardware register (through
a syscall) along a workload's backtrace, each is configured and used in a
different way.
The OProfile
tool is a system-wide profiler for Linux systems,
capable of profiling all running code at low overhead. It consists of a
kernel driver and daemon for collecting sample data, and several
post-profiling tools for turning data into information. Debug symbols
(-g
option to gcc) are not necessary unless you want
annotated source. With a recent Linux 2.6 kernel, OProfile
can provide gprof-style call-graph profiling information.
OProfile
has a typical overhead of 1-8%, depending on
sampling frequency and workload.
On POWER, OProfile
works by watching groups of performance
hardware counters and performance counters, though different groups can
not be used together. It means that getting different performance counters
from the same workload requires running it multiple times with different
OProfile
event configurations. This also means that you
cannot watch the entire POWER7 CBM at the same time. The available groups
are defined in the aforementioned “POWER7 PMY Detailed Event Description”
document, or by running the command in Listing 1:
Listing 1. OProfile groups listing
# opcontrol -l
The commands in Listing 2 demonstrate a simple
OProfile
configuration and invocation:
Listing 2. OProfile POWER7 CPU cycles configuration
# opcontrol -l # opcontrol -–no-vmlinux # opcontrol -e PM_CYC_GRP1:500000 -e PM_INST_CMPL_GRP1:500000 -e PM_RUN_CYC_GRP1:500000 -e PM_RUN_INST_CMPL_GRP1:500000 # opcontrol --start
Run the workload in Listing 3.
Listing 3. OProfile run command sequence
# opcontrol --dump # opcontrol –-stop # opcontrol --shutdown
To get the performance counter report, issue the command in Listing 4:
Listing 4. OProfile report generation
# opreport -l > workload_report
Note: Find comprehensive guide to OProfile
(although not
updated for POWER7) in the developerWorks article "Identify performance
bottlenecks with OProfile for Linux on POWER" (see Related topics).
The perf
tool, introduced in Linux kernel 2.6.29, analyzes
performance events at both hardware and software levels. The
perf
tool has the advantage of being program oriented,
instead of system oriented like OProfile. It has some preset performance
counter lists, like 'cpu-cycles OR cycles', 'branch-misses', or
'L1-icache-prefetch-misses' and it has the ability to multiplex the PMU
groups to allow gathering of multiple performance counters from different
groups at same time at the cost of sample precision.
One drawback is that, although it allows gathering of hardware performance
counters directly, perf
does not recognize the counter name
denoted by the POWER7 CBM; it needs to use raw hexadecimal numbers
instead. Table 1 is a mapping of
OProfile
events to hexadecimal numbers which you can use with
perf
(using the record raw events options) to utilize the CBM
for POWER7.
Table 1. POWER7 perf events raw codes
Counter | Raw code |
---|---|
PM_RUN_CYC | 200f4 |
PM_CMPLU_STALL | 4000a |
PM_CMPLU_STALL_FXU | 20014 |
PM_CMPLU_STALL_DIV | 40014 |
PM_CMPLU_STALL_SCALAR | 40012 |
PM_CMPLU_STALL_SCALAR_LONG | 20018 |
PM_CMPLU_STALL_VECTOR | 2001c |
PM_CMPLU_STALL_VECTOR_LONG | 4004a |
PM_CMPLU_STALL_LSU | 20012 |
PM_CMPLU_STALL_REJECT | 40016 |
PM_CMPLU_STALL_ERAT_MISS | 40018 |
PM_CMPLU_STALL_DCACHE_MISS | 20016 |
PM_CMPLU_STALL_STORE | 2004a |
PM_CMPLU_STALL_THRD | 1001c |
PM_CMPLU_STALL_IFU | 4004c |
PM_CMPLU_STALL_BRU | 4004e |
PM_GCT_NOSLOT_CYC | 100f8 |
PM_GCT_NOSLOT_IC_MISS | 2001a |
PM_GCT_NOSLOT_BR_MPRED | 4001a |
PM_GCT_NOSLOT_BR_MPRED_IC_MISS | 4001c |
PM_GRP_CMPL | 30004 |
PM_1PLUS_PPC_CMPL | 100f2 |
Note: Find a comprehensive guide to perf
(although not updated for POWER7) in the IBM Wiki "Using perf on POWER7
systems" (see Related topics).
You can get the raw codes used with perf
that correspond to
the POWER7 events defined in OProfile
from the
libpfm4
project (see Related
topics): They are defined in the POWER7 specific header
(lib/events/power7_events.h). The example program
examples/showevtinfo also shows the event names and
corresponding raw hexadecimal codes.
To obtain counter information, profiling is a common approach. Profiling allows a developer to identify hotspots in code execution and data access, find performance sensitive areas, understand memory access patterns, and more. Before starting to profile, it is necessary to work out a performance evaluation strategy. The program might be composed of various modules and/or dynamic shared objects (DSO), it might intensively utilize the kernel, it might depend more on data pattern access (high pressure on L2 or L3 cache access) or might focus on the vector operation units. The next section will focus on possible performance evaluation strategies.
Strategies for Performance Evaluation
An initial performance evaluation is to find program hotspots by inspecting the CPU cycle utilization counter. To do this on POWER7, watch the events listed in Table 2:
Table 2. POWER7 CPU cycle utilization counters
Counter | Description |
---|---|
PM_CYC | Processor Cycles |
PM_INST_CMPL | Number of PowerPC Instructions that completed |
PM_RUN_CYC | Processor Cycles gated by the run latch. Operating systems use the run latch to indicate when they are doing useful work. The run latch is typically cleared in the OS idle loop. Gating by the run latch filters out the idle loop. |
PM_RUN_INST_CMPL | Number of run instructions completed |
Running OProfile
with these events will show the overall time
the processor spent in a symbol. Below is an example profile output for
the 403.gcc component from the SPECcpu2006 benchmark
suite compiled with IBM Advance Toolchain 5.0 for POWER (see Related topics). The following is the
output from the command opreport -l
.
Listing 5. Output from 'opreport -' for 403.gcc benchmark (counter PM_CYC_GRP1 and PM_INST_CMPL_GRP1)
CPU: ppc64 POWER7, speed 3550 MHz (estimated) Counted PM_CYC_GRP1 events ((Group 1 pm_utilization) Processor Cycles) with a unit mask of 0x00 (No unit mask) count 500000 Counted PM_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Number of PowerPC Instructions that completed.) with a unit mask of 0x00 (No unit mask) count 500000 samples % samples % image name app name symbol name 204528 7.9112 32132 1.3848 gcc_base.none gcc_base.none reg_is_remote_cons\ tant_p.isra.3.part.4 125218 4.8434 246710 10.6324 gcc_base.none gcc_base.none bitmap_operation 113190 4.3782 50950 2.1958 libc-2.13.so libc-2.13.so memset 90316 3.4934 22193 0.9564 gcc_base.none gcc_base.none compute_transp 89978 3.4804 11753 0.5065 vmlinux vmlinux .pseries_dedicated_\ idle_sleep 88429 3.4204 130166 5.6097 gcc_base.none gcc_base.none bitmap_element_\ allocate 67720 2.6194 41479 1.7876 gcc_base.none gcc_base.none ggc_set_mark 56613 2.1898 89418 3.8536 gcc_base.none gcc_base.none canon_rtx 53949 2.0868 6985 0.3010 gcc_base.none gcc_base.none delete_null_\ pointer_checks 51587 1.9954 26000 1.1205 gcc_base.none gcc_base.none ggc_mark_rtx_\ children_1 48050 1.8586 16086 0.6933 gcc_base.none gcc_base.none single_set_2 47115 1.8224 33772 1.4555 gcc_base.none gcc_base.none note_stores
Listing 6. Output from 'opreport -' for 403.gcc benchmark (counter PM_RUN_CYC_GRP1 and PM_RUN_INST_CMPL_GRP1)
Counted PM_RUN_CYC_GRP1 events ((Group 1 pm_utilization) Processor Cycles gated by the run latch. Operating systems use the run latch to indicate when they are doing useful work. The run latch is typically cleared in the OS idle loop. Gating by the run latch filters out the idle loop.) with a unit mask of 0x00 (No unit mask) count 500000 Counted PM_RUN_INST_CMPL_GRP1 events ((Group 1 pm_utilization) Number of run instructions completed.) with a unit mask of 0x00 (No unit mask) count 500000 samples % samples % samples % app name symbol name 204538 8.3658 32078 1.3965 gcc_base.none gcc_base.none reg_is_remote_consta\ nt_p.isra.3.part.4 124596 5.0961 252227 10.9809 gcc_base.none gcc_base.none bitmap_operation 112326 4.5943 50890 2.2155 libc-2.13.so libc-2.13.so memset 90312 3.6939 21882 0.9527 gcc_base.none gcc_base.none compute_transp 0 0 0 0 vmlinux vmlinux .pseries_dedicated\ _idle_sleep 88894 3.6359 124831 5.4346 gcc_base.none gcc_base.none bitmap_element_all\ ocate 67995 2.7811 41331 1.7994 gcc_base.none gcc_base.none ggc_set_mark 56460 2.3093 89484 3.8958 gcc_base.none gcc_base.none canon_rtx 54076 2.2118 6965 0.3032 gcc_base.none gcc_base.none delete_null_pointer\ _checks 51228 2.0953 26057 1.1344 gcc_base.none gcc_base.none ggc_mark_rtx_childr\ en_1 48057 1.9656 16005 0.6968 gcc_base.none gcc_base.none single_set_2 47160 1.9289 33766 1.4700 gcc_base.none gcc_base.none note_stores
Each watched event is represented by a pair of columns in the output. The first column shows the sample numbers collected from a PCM for the specified event and the second shows the percentage of the total sample numbers it presents. As seen in this report, the symbol reg_is_remote_constant_p is the one which consumes most of the processor cycles and is a good candidate for code optimization. This profile only identifies which symbols consume the most CPU cycles, but not if the processor pipeline is fully utilized. You can investigate pipeline utilization by comparing the counters results.
Consider the counter PM_INST_CMPL_GRP1 (the second pair of columns); the symbol bitmap_operation shows a higher percentage than the symbol reg_is_remote_constant_p. This performance counter is incremented for each processor instruction completed, whereas PM_CYC_GRP1 only means the number of CPU cycles utilized. Without further analysis, this might indicate that the symbol reg_is_remote_constant_p contains more CPU stalls than the symbol bitmap_operation since the number of instructions completed for the symbol reg_is_remote_constant_p is significantly lower. This profile provides an initial hint on which symbol to focus subsequent optimization efforts.
Before you start to dig in and crack up the code it is wise to understand if the workload is CPU or memory bound. This is important because optimization approaches are quite different for each workload type. For example, most often memory accesses come from cache or main memory (as opposed to NUMA remote node memory access) and performance depends almost entirely on the algorithms and data structures used. To investigate memory access patterns, watch the following two performance counters in Table 3:
Table 3. POWER7 memory utilization counters
Counter | Description |
---|---|
PM_MEM0_RQ_DISP | Read requests dispatched for main memory |
PM_MEM0_WQ_DISP | Write requests dispatched for main memory |
These two counters can indicate whether a memory access pattern is mainly from memory reads, writes, or both. Using the same benchmark as before (403.gcc from SPECcpu2006), the profile shows:
Listing 7. Output from 'opreport -' for 403.gcc benchmark (counter PM_MEM0_RQ_DISP and PM_MEM0_WQ_DISP)
CPU: ppc64 POWER7, speed 3550 MHz (estimated) Counted PM_MEM0_RQ_DISP_GRP59 events ((Group 59 pm_nest2) Nest events (MC0/MC1/PB/GX), Pair0 Bit1) with a unit mask of 0x00 (No unit mask) count 1000 Counted PM_MEM0_WQ_DISP_GRP59 events ((Group 59 pm_nest2) Nest events (MC0/MC1/PB/GX), Pair3 Bit1) with a unit mask of 0x00 (No unit mask) count 1000 samples % samples % app name symbol name 225841 25.8000 289 0.4086 gcc_base.none reg_is_remote_constant_p.\ isra.3.part.4 90068 10.2893 2183 3.0862 gcc_base.none compute_transp 54038 6.1733 308 0.4354 gcc_base.none single_set_2 32660 3.7311 2006 2.8359 gcc_base.none delete_null_pointer_checks 26352 3.0104 1498 2.1178 gcc_base.none note_stores 21306 2.4340 1950 2.7568 vmlinux .pseries_dedicated_idle_sl\ eep 18059 2.0631 9186 12.9865 libc-2.13.so memset 15867 1.8126 659 0.9316 gcc_base.none init_alias_analysis
Another interesting set of performance counters to observe is the access
pressure over the cache, both L2 and L3. The following example uses
perf
to profile the SPECcpu2006 483.xalancbmk component (see
Related topics) that is built using
RHEL6.2 Linux system GCC. This component uses memory allocation routines
heavily so expect a lot of pressure on the memory subsystem. To accomplish
this, watch the following counters in Table 4
with OProfile:
Table 4. POWER7 cache/memory access counters
Counter | Description |
---|---|
PM_DATA_FROM_L2 | The processor's Data Cache was reloaded from the local L2 due to a demand load |
PM_DATA_FROM_L3 | The processor's Data Cache was reloaded from the local L3 due to a demand load |
PM_DATA_FROM_LMEM | The processor's Data Cache was reloaded from memory attached to the same module this processor is located on |
PM_DATA_FROM_RMEM | The processor's Data Cache was reloaded from memory attached to a different module than this processor is located on |
The profile output shows the following:
Listing 8. Output from 'opreport -' for 489.Xalancbmk benchmark (counter PM_DATA_FROM_L2_GRP91 and PM_DATA_FROM_L3_GRP91)
CPU: ppc64 POWER7, speed 3550 MHz (estimated) Counted PM_DATA_FROM_L2_GRP91 events ((Group 91 pm_dsource1) The processor's Data Cache was reloaded from the local L2 due to a demand load.) with a unit mask of 0x00 (No unit mask) count 1000 Counted PM_DATA_FROM_L3_GRP91 events ((Group 91 pm_dsource1) The processor's Data Cache was reloaded from the local L3 due to a demand load.) with a unit mask of 0x00 (No unit mask) count 1000 samples % samples % image name app name symbol name 767827 25.5750 7581 0.2525 gcc_base.none gcc_base.none bitmap_element_allocate 377138 12.5618 8341 0.2778 gcc_base.none gcc_base.none bitmap_operation 93334 3.1088 3160 0.1052 gcc_base.none gcc_base.none bitmap_bit_p 70278 2.3408 5913 0.1969 libc-2.13.so libc-2.13.so _int_free 56851 1.8936 22874 0.7618 oprofile oprofile /oprofile 47570 1.5845 2881 0.0959 gcc_base.none gcc_base.none rehash_using_reg 41441 1.3803 8532 0.2841 libc-2.13.so libc-2.13.so _int_malloc
Listing 9. Output from 'opreport -' for 489.Xalancbmk benchmark (counter PM_DATA_FROM_LMEM_GRP91 and PM_DATA_FROM_RMEM_GRP91)
Counted PM_DATA_FROM_LMEM_GRP91 events ((Group 91 pm_dsource1) The processor's Data Cache was reloaded from memory attached to the same module this proccessor is located on.) with a unit mask of 0x00 (No unit mask) count 1000 Counted PM_DATA_FROM_RMEM_GRP91 events ((Group 91 pm_dsource1) The processor's Data Cache was reloaded from memory attached to a different module than this proccessor is located on.) with a unit mask of 0x00 (No unit mask) count 1000 samples % samples % image name app name symbol name 1605 0.3344 0 0 gcc_base.none gcc_base.none bitmap_element_allocate 1778 0.3704 0 0 gcc_base.none gcc_base.none bitmap_operation 1231 0.2564 0 0 gcc_base.none gcc_base.none bitmap_bit_p 205 0.0427 0 0 libc-2.13.so libc-2.13.so _int_free 583 0.1215 327 100.000 oprofile oprofile /oprofile 0 0 0 0 gcc_base.none gcc_base.none rehash_using_reg 225 0.0469 0 0 libc-2.13.so libc-2.13.so _int_malloc
Interpreting the profile output shows that most of the cache pressure came from L2 access with almost no L3 reload from demand, since the total and relative counter sample value for L2 access (PM_DATA_FROM_L2) is much higher than L3 demand reload (PM_DATA_FROM_L3). You can only obtain further information, like if L2 access is causing CPU stalls due to cache misses, with more comprehensive analysis (by watching more counters). A conclusion that can be drawn from this example profile is that the main memory access (PM_DATA_FROM_LMEM event) is quite low compared to cache access and there is no remote access (event PM_DATA_FROM_RMEM) indicating no remote NUMA node memory access. Analysis of hotspots and memory access patterns can give direction to optimization efforts; in this case, further analysis is required to identify what really causes CPU stalls because simple identifying the workload hotspots and memory access pattern is not enough to correctly identify CPU stalls.
To come up with better strategies for performance optimization further
analysis will require using the perf
tool rather than
OProfile
since many POWER7 CBM counters need to be watched
simultaneously the 22 presented in Figure 3
and to come with better strategies for performance optimization. Many of
these events are in different groups, meaning that using
OProfile
requires many runs of the same workload. The
perf
tool will multiplex the watching of hardware counters
when the specified counters are in more than one group. Although this
results in a less accurate outcome, the overall result tends to be very
similar to the expected with the advantage of less time spent
profiling.
The following example uses perf
to profile the same
SPECcpu2006 483.xalancbmk component. To profile this
component, issue the command in Listing 10:
Listing 10. perf command to generated POWER7 CBM
$ /usr/bin/perf stat -C 0 -e r100f2,r4001a,r100f8,r4001c,r2001a,r200f4,r2004a,r4004a, r4004e,r4004c,r20016,r40018,r20012,r40016,r40012,r20018,r4000a,r2001c,r1001c,r20014, r40014,r30004 taskset -c 0 ./Xalan_base.none -v t5.xml xalanc.xsl > power7_cbm.dat
This command will cause perf
to watch the raw events defined
by the -e argument on the CPU specified by -c. The taskset call ensures
that the component will run exclusively on CPU number 0. The workload
./Xalan_base.none -v t5.xml xalanc.xsl
can be replaced by
another application to profile. After the profile is complete, the perf
command will output a simple table of the total count for each raw event
with the total number of elapsed seconds:
Listing 11. Output from 'perf stat' for 489.Xalancbmk benchmark
Performance counter stats for 'taskset -c 0 ./Xalan_base.none -v t5.xml xalanc.xsl': 366,860,486,404 r100f2 [18.15%] 8,090,500,758 r4001a [13.65%] 50,655,176,004 r100f8 [ 9.13%] 11,358,043,420 r4001c [ 9.11%] 10,318,533,758 r2001a [13.68%] 1,301,183,175,870 r200f4 [18.22%] 2,150,935,303 r2004a [ 9.10%] 0 r4004a [13.65%] 211,224,577,427 r4004e [ 4.54%] 212,033,138,844 r4004c [ 4.54%] 264,721,636,705 r20016 [ 9.09%] 22,176,093,590 r40018 [ 9.11%] 510,728,741,936 r20012 [ 9.10%] 39,823,575,049 r40016 [ 9.07%] 7,219,335,816 r40012 [ 4.54%] 1,585,358 r20018 [ 9.08%] 882,639,601,431 r4000a [ 9.08%] 1,219,039,175 r2001c [ 9.08%] 3,107,304 r1001c [13.62%] 120,319,547,023 r20014 [ 9.09%] 50,684,413,751 r40014 [13.62%] 366,940,826,307 r30004 [18.16%] 461.057870036 seconds time elapsed
To analyze the perf
output against the POWER7 CBM, a Python
script is provided (check the power7_cbm.zip in Downloadable resources), which composes the counter metrics from
the collected virtual and hardware counters. To create a report issue the
command in Listing 12:
Listing 12. POWER7 CBM python script invocation
$ power7_cbm.py power7_cbm.dat
Output similar to Listing 13 will be printed:
Listing 13. Output from 'power7_cbm.py' for 489.Xalancbmk benchmark
CPI Breakdown Model (Complete) Metric : Value : Percent PM_CMPLU_STALL_DIV : 49802421337.0 : 0.0 PM_CMPLU_STALL_FXU_OTHER : 67578558649.0 : 5.2 PM_CMPLU_STALL_SCALAR_LONG : 2011413.0 : 0.0 PM_CMPLU_STALL_SCALAR_OTHER : 7195240404.0 : 0.6 PM_CMPLU_STALL_VECTOR_LONG : 0.0 : 0.0 PM_CMPLU_STALL_VECTOR_OTHER : 1209603592.0 : 0.1 PM_CMPLU_STALL_ERAT_MISS : 22193968056.0 : 1.7 PM_CMPLU_STALL_REJECT_OTHER : 18190293594.0 : 1.4 PM_CMPLU_STALL_DCACHE_MISS : 261865838255.0 : 20.3 PM_CMPLU_STALL_STORE : 2001544985.0 : 0.2 PM_CMPLU_STALL_LSU_OTHER : 202313206181.0 : 15.7 PM_CMPLU_STALL_THRD : 2025705.0 : 0.0 PM_CMPLU_STALL_BRU : 208356542821.0 : 16.2 PM_CMPLU_STALL_IFU_OTHER : 2171796336.0 : 0.2 PM_CMPLU_STALL_OTHER : 30895294057.0 : 2.4 PM_GCT_NOSLOT_IC_MISS : 9805421042.0 : 0.8 PM_GCT_NOSLOT_BR_MPRED : 7823508357.0 : 0.6 PM_GCT_NOSLOT_BR_MPRED_IC_MISS : 11059314150.0 : 0.9 PM_GCT_EMPTY_OTHER : 20292049774.0 : 1.6 PM_1PLUS_PPC_CMPL : 365158978504.0 : 28.3 OVERHEAD_EXPANSION : 590057044.0 : 0.0 Total : 96.1
This report is based on statistical values within an error margin, so final percentages are not entirely accurate. Even with a high error margin, about 20% of total CPU stalls are due to data cache misses (PM_CMPLU_STALL_DCACHE_MISS). The final instruction completion percentage (PM_1PLUS_PPC_CMPL) is about 28%.
Future optimizations should try to maximize this number by decreasing CPU
stalls and/or GCT (Global Completion Table) percentages. Based on this
report, another avenue for analysis is to identify the code where the
stalls are happening. To accomplish this by using the
perf record
command. It will trace the performance of a raw
counter and create a map with a process backtrace allowing identification
of which symbol generated the most hardware events. This is similar to the
way OProfile
works. In this example, to trace the
PM_CMPLU_STALL_DCACHE_MISS events, issue the command in Listing 14:
Listing 14. perf record for PM_CMPLU_STALL_DCACHE_MISS event
$ /usr/bin/perf record -C 0 -e r20016 taskset -c 0 ./Xalan_base.none -v t5.xml xalanc.xsl
The perf command will create a data file (usually "perf.dat") with the results. It can be read interactively using the perf report command as in Listing 15:
Listing 15. Output from 'perf report' for 489.Xalancbmk benchmark
Events: 192 raw 0x20016 39.58% Xalan_base.none Xalan_base.none [.] xercesc_2_5::ValueStore::contains 11.46% Xalan_base.none Xalan_base.none [.] xalanc_1_8::XStringCachedAllocator 9.90% Xalan_base.none Xalan_base.none [.] xalanc_1_8::XStringCachedAllocator 7.29% Xalan_base.none Xalan_base.none [.] xercesc_2_5::ValueStore::isDuplica 5.21% Xalan_base.none libc-2.13.so [.] _int_malloc 5.21% Xalan_base.none Xalan_base.none [.] __gnu_cxx::__normal_iterator<xa 4.17% Xalan_base.none libc-2.13.so [.] __GI___libc_malloc 2.08% Xalan_base.none libc-2.13.so [.] malloc_consolidate.part.4 1.56% Xalan_base.none Xalan_base.none [.] xalanc_1_8::ReusableArenaBlock<xa 1.56% Xalan_base.none Xalan_base.none [.] xalanc_1_8::ReusableArenaBlock<xa 1.04% Xalan_base.none libc-2.13.so [.] __free [...]
With this analysis using POWER7 CBM counter and the perf report tool, your optimization effort might concentrate on optimizing memory and cache access on the symbol xercesc_2_5::ValueStore::contains(xercesc_2_5::FieldValueMap const*).
This example is just a subset of possible analysis. The POWER7 CBM shows you that although data cache stalls show up as being the higher cause of CPU stalls, the load and store unit (PM_CMPLU_STALL_LSU) and branch unit (PM_CMPLU_STALL_BRU) are both source of stalls. Further analysis can address these counters.
Case Study
The following case study applies these performance evaluation strategies to analyze a trigonometric math function implementation. Based on analysis results, optimization opportunities will be identified. The function used in this case study is the ISO C hypot function, defined as length of the hypotenuse of a right-triangle. The function is defined by C99, POSIX.1-2001 as:
double hypot(double x, double y);
The hypot() function returns sqrt(x*x+y*y). On success, this function returns the length of a right-angled triangle with sides of length x and y. If x or y is an infinity, positive infinity is returned. If x or y is a NaN, and the other argument is not an infinity, a NaN is returned. If the result overflows, a range error occurs, and the functions return HUGE_VAL, HUGE_VALF, or HUGE_VALL, respectively. If both arguments are subnormal, and the result is subnormal, a range error occurs, and the correct result is returned.
Although the algorithm seems simple, the Floating-Point (FP) argument handling of Infinity and NaN and the overflow/underflow related to FP operations impose some challenges with performance impacts. The GNU C Library (see Related topics) provides an implementation of hypot located in the source tree at sysdeps/ieee754/dbl-64/e_hypot.c:
Note: The license information for this code sample is included in Appendix.
Listing 16. Default GLIBC hypot source code
double __ieee754_hypot(double x, double y) { double a,b,t1,t2,y1,y2,w; int32_t j,k,ha,hb; GET_HIGH_WORD(ha,x); ha &= 0x7fffffff; GET_HIGH_WORD(hb,y); hb &= 0x7fffffff; if(hb > ha) {a=y;b=x;j=ha; ha=hb;hb=j;} else {a=x;b=y;} SET_HIGH_WORD(a,ha); /* a <- |a| */ SET_HIGH_WORD(b,hb); /* b <- |b| */ if((ha-hb)>0x3c00000) {return a+b;} /* x/y > 2**60 */ k=0; if(ha > 0x5f300000) { /* a>2**500 */ if(ha >= 0x7ff00000) { /* Inf or NaN */ u_int32_t low; w = a+b; /* for sNaN */ GET_LOW_WORD(low,a); if(((ha&0xfffff)|low)==0) w = a; GET_LOW_WORD(low,b); if(((hb^0x7ff00000)|low)==0) w = b; return w; } /* scale a and b by 2**-600 */ ha -= 0x25800000; hb -= 0x25800000; k += 600; SET_HIGH_WORD(a,ha); SET_HIGH_WORD(b,hb); } if(hb < 0x20b00000) { /* b < 2**-500 */ if(hb <= 0x000fffff) { /* subnormal b or 0 */ u_int32_t low; GET_LOW_WORD(low,b); if((hb|low)==0) return a; t1=0; SET_HIGH_WORD(t1,0x7fd00000); /* t1=2^1022 */ b *= t1; a *= t1; k -= 1022; } else { /* scale a and b by 2^600 */ ha += 0x25800000; /* a *= 2^600 */ hb += 0x25800000; /* b *= 2^600 */ k -= 600; SET_HIGH_WORD(a,ha); SET_HIGH_WORD(b,hb); } } /* medium size a and b */ w = a-b; if (w>b) { t1 = 0; SET_HIGH_WORD(t1,ha); t2 = a-t1; w = __ieee754_sqrt(t1*t1-(b*(-b)-t2*(a+t1))); } else { a = a+a; y1 = 0; SET_HIGH_WORD(y1,hb); y2 = b - y1; t1 = 0; SET_HIGH_WORD(t1,ha+0x00100000); t2 = a - t1; w = __ieee754_sqrt(t1*y1-(w*(-w)-(t1*y2+t2*b))); } if(k!=0) { u_int32_t high; t1 = 1.0; GET_HIGH_WORD(high,t1); SET_HIGH_WORD(t1,high+(k<<20)); return t1*w; } else return w; }
This implementation is quite complex mainly because the algorithm executes many of bit-by-bit FP to INT conversions. It assumes that certain FP operations, like compares and multiplications, are more costly when using float-point instructions than when using fixed-point instructions. This is true on some architectures, but not on the Power Architecture.
Your first step to evaluate this implementation is to create a benchmark which can be profiled. In this case, since it is simply a function with two arguments and a straightforward algorithm (no internal function calls or additional paths) a simple benchmark may be created to evaluate it (check the hypot_bench.tar.gz in Downloadable resources). The benchmark is part of the performance evaluation; optimizations should speed up algorithms or critical parts of algorithms that leverage the total workload performance. Synthetic benchmarks, like this one, should represent normal utilization for this function. Since optimization efforts tend to be resource and time consuming one needs to focus on the most common usage cases or expected behavior. Trying to optimize code that represents low total program usage tends to be a waste of resources.
Since this is a performance analysis on a single function, you can skip
hotspot analysis and focus on CBM analysis. Using the benchmark in
hypot_bench.c along with perf
, the CBM
information in Listing 17:
Listing 17. Output from 'power7_cbm.py' for hypot benchmark
CPI Breakdown Model (Complete) Metric : Value : Percent PM_CMPLU_STALL_DIV : 8921688.0 : 8.7 PM_CMPLU_STALL_FXU_OTHER : 13953382275.0 : 5.0 PM_CMPLU_STALL_SCALAR_LONG : 24380128688.0 : 8.7 PM_CMPLU_STALL_SCALAR_OTHER : 33862492798.0 : 12.0 PM_CMPLU_STALL_VECTOR_LONG : 0.0 : 0.0 PM_CMPLU_STALL_VECTOR_OTHER : 275057010.0 : 0.1 PM_CMPLU_STALL_ERAT_MISS : 173439.0 : 0.0 PM_CMPLU_STALL_REJECT_OTHER : 902838.0 : 0.0 PM_CMPLU_STALL_DCACHE_MISS : 15200163.0 : 0.0 PM_CMPLU_STALL_STORE : 1837414.0 : 0.0 PM_CMPLU_STALL_LSU_OTHER : 94866270200.0 : 33.7 PM_CMPLU_STALL_THRD : 569036.0 : 0.0 PM_CMPLU_STALL_BRU : 10470012464.0 : 3.7 PM_CMPLU_STALL_IFU_OTHER : -73357562.0 : 0.0 PM_CMPLU_STALL_OTHER : 7140295432.0 : 2.5 PM_GCT_NOSLOT_IC_MISS : 3586554.0 : 0.0 PM_GCT_NOSLOT_BR_MPRED : 1008950510.0 : 0.4 PM_GCT_NOSLOT_BR_MPRED_IC_MISS : 795943.0 : 0.0 PM_GCT_EMPTY_OTHER : 42488384303.0 : 15.1 PM_1PLUS_PPC_CMPL : 53138626513.0 : 18.9 OVERHEAD_EXPANSION : 30852715.0 : 0.0 Total : 108.7
The profile analysis shows that most of CPU stalls and hence performance loss comes from the Load and Store Unit (LSU - counter PM_CMPLU_STALL_LSU_OTHER). The LSU has various counters associated with it, however during CPU stall analysis your focus is the counters that are associated with performance degradations. The ones that show performance degradations on POWER are associated with the Load-Hit-Store (LHS) hazards. This is a large stall that occurs when the CPU writes data to an address and then tries to load that data again too soon afterward. The next step is to check if this is happening on this particular algorithm by first checking the event PM_LSU_REJECT_LHS (raw code "rc8ac") as shown in Listing 18.
Listing 18. perf record of PM_LSU_REJECT_LHS POWER7 event
$ perf record -C 0 -e rc8ac taskset -c 0 ./hypot_bench_glibc $ perf report Events: 14K raw 0xc8ac 79.19% hypot_bench_gli libm-2.12.so [.] __ieee754_hypot 10.38% hypot_bench_gli libm-2.12.so [.] __hypot 6.34% hypot_bench_gli libm-2.12.so [.] __GI___finite
The profile output shows the symbol __ieee754_hypot is the one generating
most of PM_LSU_REJECT_LHS events. Investigating the assembly code
generated by the compiler to identify which instructions are generating
the event. Expand the symbol __ieee754_hypot to annotate the
assembly by iterating on the perf report
screen and selecting
the __ieee754_hypot symbol, which shows the output in Listing 19.
Listing 19. perf report of PM_LSU_REJECT_LHS POWER7 event
: 00000080fc38b730 <.__ieee754_hypot>: 0.00 : 80fc38b730: 7c 08 02 a6 mflr r0 0.00 : 80fc38b734: fb c1 ff f0 std r30,-16(r1) 0.00 : 80fc38b738: fb e1 ff f8 std r31,-8(r1) 13.62 : 80fc38b73c: f8 01 00 10 std r0,16(r1) 0.00 : 80fc38b740: f8 21 ff 71 stdu r1,-144(r1) 10.82 : 80fc38b744: d8 21 00 70 stfd f1,112(r1) 0.23 : 80fc38b748: e9 21 00 70 ld r9,112(r1) 17.54 : 80fc38b74c: d8 41 00 70 stfd f2,112(r1) 0.00 : 80fc38b750: 79 29 00 62 rldicl r9,r9,32,33 0.00 : 80fc38b754: e9 61 00 70 ld r11,112(r1) 0.00 : 80fc38b758: e8 01 00 70 ld r0,112(r1) 8.46 : 80fc38b75c: d8 21 00 70 stfd f1,112(r1) [...]
Early in the code the implementation uses the macro GET_HIGH_WORD to transform a float to an integer for posterior bit-wise operations. GLIBC's math/math_private.h defines the macro using the code in Listing 20.
Listing 20. GET_HIGH_WORD macro definition
#define GET_HIGH_WORD(i,d) \ do { \ ieee_double_shape_type gh_u; \ gh_u.value = (d); \ (i) = gh_u.parts.msw; \ } while (0)
A possible culprit causing a LHS stall in this macro is the operation that reads the attributions of the float to internal value and then reads it to the variable i. The POWER7 processor does not have a native instruction to move the contents of a floating-point register, bit-by-bit, to a Fixed-point register. The way this is accomplished on POWER is to store the FP number in the floating-point register to memory using a store operation and to then load the same memory location into a fixed-point (general-purpose). Since memory access is slower than register operations (even when accessing L1 data cache), the CPU is stalled during the store to complete the subsequent load.
Note: The document, "POWER ISA 2.06 (POWER7)" (see Related topics), contains more information.
Most often performance counter events trigger interrupts that save a PC
address of an instruction close to the executing instructions. This can
lead to assembly annotation which is not completely accurate. To mitigate
this behavior POWER4 and later have a limited set of performance counters
named marked
. Marked instructions will generate less events
per time frame; however, the PC instruction will be exact, resulting in an
accurate assembly annotation. Marked events have the PM_MRK prefix in
OProfile
counter list obtained by
opcontrol -l
.
To double check the analysis, watch the PM_MRK_LSU_REJECT_LHS counter. Both counters, PM_MRK_LSU_REJECT_LHS and PM_LSU_REJECT_LHS, watch for the same performance event. However the marked counter (PM_MRK_LSU_REJECT_LHS) will generate less events per time frame but with a more accurate assembly annotation. (See Listing 21.)
Listing 21. perf record of PM_MRK_LSU_REJECT_LHS POWER7 event
$ perf record -C 0 -e rd082 taskset -c 0 ./hypot_bench_glibc $ perf report Events: 256K raw 0xd082 64.61% hypot_bench_gli libm-2.12.so [.] __ieee754_hypot 35.33% hypot_bench_gli libm-2.12.so [.] __GI___finite
This generates the assembly annotation in Listing 22.
Listing 22. perf report of PM_MRK_LSU_REJECT_LHS POWER7 event
: 00000080fc38b730 <.__ieee754_hypot>: [...] 1.23 : 80fc38b7a8: c9 a1 00 70 lfd f13,112(r1) 0.00 : 80fc38b7ac: f8 01 00 70 std r0,112(r1) 32.66 : 80fc38b7b0: c8 01 00 70 lfd f0,112(r1) [...] 0.00 : 80fc38b954: f8 01 00 70 std r0,112(r1) 0.00 : 80fc38b958: e8 0b 00 00 ld r0,0(r11) 0.00 : 80fc38b95c: 79 00 00 0e rldimi r0,r8,32,0 61.72 : 80fc38b960: c9 61 00 70 lfd f11,112(r1 [...]
Another symbol shows about 35% of the generated events with similar behavior, in Listing 23.
Listing 23. More highlights of the perf report
: 00000080fc3a2610 <.__finitel>> 0.00 : 80fc3a2610: d8 21 ff f0 stfd f1,-16(r1) 100.00 : 80fc3a2614: e8 01 ff f0 ld r0,-16(r1)
Based on this information, your optimization effort might be to eliminate these stalls by removing the FP to INT conversions. The POWER processor has a fast and efficient Float-Point execution unit so there is no need to perform these calculations with Fixed-Point instructions. The algorithm that POWER currently uses in GLIBC (sysdeps/powerpc/fpu/e_hypot.c) has removed all of the LHS stalls by using FP operations only. The result is the much simpler algorithm, in Listing 24.
Listing 24. PowerPC GLIBC hypot source code
double __ieee754_hypot (double x, double y) { x = fabs (x); y = fabs (y); TEST_INF_NAN (x, y); if (y > x) { double t = x; x = y; y = t; } if (y == 0.0 || (x / y) > two60) { return x + y; } if (x > two500) { x *= twoM600; y *= twoM600; return __ieee754_sqrt (x * x + y * y) / twoM600; } if (y < twoM500) { if (y <= pdnum) { x *= two1022; y *= two1022; return __ieee754_sqrt (x * x + y * y) / two1022; } else { x *= two600; y *= two600; return __ieee754_sqrt (x * x + y * y) / two600; } } return __ieee754_sqrt (x * x + y * y); }
The TEST_INF_NAN
macro is a further small optimization which
tests if a number is NaN or INFINITY before starting further FP operations
(this is due to the fact that operations on NaN and INFINITY can raise FP
exceptions and the function specification does not allow that). On POWER7
the isinf
and isnan
function calls are optimized
by the compiler to FP instructions and do not generate extra function
calls, while on older processors (POWER6 and older) it will generate a
call to the respective functions. The optimization is basically the same
implementation, but inlined to avoid function calls.
Finally to compare both implementations, perform the following simple test. Recompile GLIBC with and without the new algorithm and compare the total time for each benchmark run. The default GLIBC implementation results are in Listing 25:
Listing 25. Benchmark with default GLIBC hypot
$ /usr/bin/time ./hypot_bench_glibc INF_CASE : elapsed time: 14:994339 NAN_CASE : elapsed time: 14:707085 TWO60_CASE : elapsed time: 12:983906 TWO500_CASE : elapsed time: 10:589746 TWOM500_CASE : elapsed time: 11:215079 NORMAL_CASE : elapsed time: 15:325237 79.80user 0.01system 1:19.81elapsed 99%CPU (0avgtext+0avgdata 151552maxresident)k 0inputs+0outputs (0major+48minor)pagefaults 0swaps
The optimized version results are in Listing 26:
Listing 26. Benchmark with optimized GLIBC hypot
$ /usr/bin/time ./hypot_bench_glibc INF_CASE : elapsed time: 4:667043 NAN_CASE : elapsed time: 5:100940 TWO60_CASE : elapsed time: 6:245313 TWO500_CASE : elapsed time: 4:838627 TWOM500_CASE : elapsed time: 8:946053 NORMAL_CASE : elapsed time: 6:245218 36.03user 0.00system 0:36.04elapsed 99%CPU (0avgtext+0avgdata 163840maxresident)k 0inputs+0outputs (0major+50minor)pagefaults 0swaps
This is a final performance improvement of more than 100%, cutting the benchmark time by half.
Conclusion
Performance evaluation with hardware counter profiling is a powerful tool to understand how a workload behaves on a certain processor and to give hints of where to work on performance optimizations. The latest POWER7 processor has hundreds of performance counters available so we presented a simple model of how to map the workload to CPU stalls. Understanding the POWER7 CBM is somewhat complicated so we also explained tools for Linux that simplify it. Strategies for performance evaluation focused on how to find hotspots, how to understand the memory pattern of an application, and how to use the POWER7 CBM. Finally, we used a recent optimization done on a trigonometric function within GLIBC to explain the performance analysis that was used to result in the optimized code.
Appendix
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.
Downloadable resources
- PDF of this content
- GLIBC hypot benchmark (hypot_bench.tar.gz | 6KB)
- Python script to format perf output (power7_cbm.zip | 2KB)
Related topics
- Read more about the operation of processors in Modern Microprocessors - a 90 minute guide.
- Learn more about OProfile on the project website.
- Learn more about the perf tool—the code is maintained within the Linux kernel source.
- Read Identify performance bottlenecks with OProfile for Linux on POWER (John Engel, developerWorks, May 2005) for a comprehensive guide to OProfile (although not updated for POWER7).
- Explore the IBM Wiki Using perf on POWER7 systems. a comprehensive guide to perf (although not updated for POWER7).
- Find raw POWER7 codes for perf on the libpfm4 project site.
- Read the description of the SPECcpu2006 483.xalancbmk component.
- See the GNU C Library at the project page.
- Check out a didactic and informative article on modern processors pipelining and instruction-level parallelism, instruction dependencies, branches prediction and other topics related to CPU architecture.
- Identify performance bottlenecks with OProfile for Linux on POWER.
- Read more on how to use perf on POWER7 systems.
- Dig into the GLIBC source code.
- Review the SPECCPU2006 documentation for some examples used in this article.
- In the developerWorks Linux zone, find hundreds of how-to articles and tutorials, as well as downloads, discussion forums, and a wealth of other resources for Linux developers and administrators.
- Start developing with product trials, free downloads, and IBM Bluemix services.