Using perf on POWER7

This page has not been liked. Updated 8/18/14, 5:05 PM by Bill_BurosTags: None

Aug 2014.   This article's relevancy and currency may be need to be improved due to out-of-date information.    Please consider helping to improve the contents or ask questions on the DeveloperWorks Forums.


Leveraging the new performance events kernel subsystem (introduced in the 2.6.32 kernel), here are some hints and tips on taking advantage of "perf" when running on a POWER7 system.

<being updated June 2013>

Introduction to Perf Events

"PerfEvents" (performance events kernel subsystem for Linux) is an exciting new Linux feature that provides a framework for analyzing performance events at both hardware and software levels. The foundation that PerfEvents provides will support future layers of analysis tools such as perf, PAPI and exp.

The perf tool was developed in conjunction with the kernel subsystem to demonstrate the use of the new subsystem and its corresponding API.  The perf tool is easy to install and once installed, instant hardware/software counter analysis is available.

Distro Kernels Supported

PerfEvents support initially started with mainline 2.6.32 kernels. As such, PerfEvents should be available in follow-on SLES 11 service packs and the RHEL 6 release stream.

.config specifications

To insure PerfEvents is enabled, several .config parameters should be verified (and enabled/kernel rebuilt if not set).

CONFIG_PPC_PERF_CTRS=y       # should be set automatically (hidden from make menuconfig)
CONFIG_HAVE_PERF_EVENTS=y    # should be set automatically (hidden from make menuconfig)
CONFIG_EVENT_TRACING=y       # should be set automatically (hidden from make menuconfig)

Kernel Performance Events And Counters (General setup -> Kernel Performance Events and Counters)

CONFIG_PERF_EVENTS=y         # enables kernel support for performance counter registers
CONFIG_PERF_COUNTERS=n       # obsoleted by PERF_EVENTS (old config option)


The basic "perf" RPM must be installed to access the perf tool. Additional RPMs available to facilitate debugging and performance analysis on PerfEvents are perf-debuginfo and perf-debugsource RPMs.

Perf tool

Once the perf RPM is installed, the perf tool should be immediately available.

Simply running perf or perf --help validates accessibility and shows the options available.

% perf


 usage: perf [--version] [--help] COMMAND [ARGS]


 The most commonly used perf commands are:

   annotate        Read (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a file
   diff            Read two files and display the differential profile
   kmem            Tool to trace/measure kernel memory(slab) properties
   list            List all symbolic event types
   lock            Analyze lock events
   probe           Define new dynamic tracepoints
   record          Run a command and record its profile into
   report          Read (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   stat            Run a command and gather performance counter statistics
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   trace           Read (created by perf record) and display trace output


 See 'perf help COMMAND' for more information on a specific command.

Building the perf tool from source

The perf tool resides within mainline kernels under the ../tools/perf directory, so a mainline kernel will need to be downloaded. Once the mainline kernel is expanded, you can cd into the ../tools/perf directory and run make/make install. This will install the new perf into /root/bin by default and will not overwrite the distro installed version.

Also be sure to cd into the ../tools/perf/Documentation directory and run make/make install so that the updated man pages are installed. That way when you run "/root/bin/perf <command> --help" the man page will display all the available options correctly. Note that you may have to install one of the optional distro packages such as asciidoc to complete the Documentation make.

Quick Test Drive

perf top can be used to quickly examine a running system.

% perf --version

perf version 0.0.2.PERF


% perf top
   PerfTop:   259 irqs/sec  kernel:94.6% [1000Hz cycles], (all, 8 CPUs)
  samples  pcnt function                               DSO
  ------- ----- -------------------------------------  ----------------------------

  1935.00 26.7% .account_system_vtime                  /boot/vmlinux-2.6.34-rc7         
  1855.00 25.6% __novmx__sigsetjmp_ent                 /lib64/power6/     
  1790.00 24.7% .account_system_time                   /boot/vmlinux-2.6.34-rc7         
   791.00 10.9% 00000012.plt_call.strcat@@GLIBC_2.3+0  /root/bin/perf                   
   175.00  2.4% .irq_exit                              /boot/vmlinux-2.6.34-rc7         
   168.00  2.3% .trace_hardirqs_off                    /boot/vmlinux-2.6.34-rc7         
    86.00  1.2% .acct_update_integrals                 /boot/vmlinux-2.6.34-rc7         
    59.00  0.8% 00000012.plt_call.strcmp@@GLIBC_2.3+0  /root/bin/perf                   
    47.00  0.6% .raw_local_irq_restore                 /boot/vmlinux-2.6.34-rc7         
    44.00  0.6% .clear_user_page                       /boot/vmlinux-2.6.34-rc7          
    38.00  0.5% .rcu_irq_exit                          /boot/vmlinux-2.6.34-rc7         
    37.00  0.5% .__percpu_counter_add                  /boot/vmlinux-2.6.34-rc7


Several Key Points

Per-process collection

Each perf command (top/record/stat) supports either a command or process id (PID) specifier. Some commands also support a system-wide collection from all CPUs (-a, --all-cpus).  You can optionally specify a command to execute when doing a system-wide perf run, which will have the effect of automatically stopping perf when the command you specified completes.

Event Codes

The perf events supports a collection of hardware and software event codes using the -e or --event <event> arguments. The "perf list" command displays the list of pre-defined architectutre-independent events. Below is a subset of the pre-defined vents:

% perf list

List of pre-defined events (to be used in -e):

  cpu-cycles OR cycles                       [Hardware event]
  instructions                               [Hardware event]
  cache-references                           [Hardware event]
  cache-misses                               [Hardware event]
  branch-instructions OR branches            [Hardware event]
  branch-misses                              [Hardware event]
  bus-cycles                                 [Hardware event]

  cpu-clock                                  [Software event]
  task-clock                                 [Software event]
  page-faults OR faults                      [Software event]
  minor-faults                               [Software event]
  major-faults                               [Software event]
  context-switches OR cs                     [Software event]
  cpu-migrations OR migrations               [Software event]

  L1-dcache-loads                            [Hardware cache event]
  L1-dcache-load-misses                      [Hardware cache event]
  L1-dcache-stores                           [Hardware cache event]
  L1-dcache-store-misses                     [Hardware cache event]
  L1-dcache-prefetches                       [Hardware cache event]
  L1-dcache-prefetch-misses                  [Hardware cache event]
  L1-icache-loads                            [Hardware cache event]
  L1-icache-load-misses                      [Hardware cache event]
  L1-icache-prefetches                       [Hardware cache event]
  L1-icache-prefetch-misses                  [Hardware cache event]
  LLC-loads                                  [Hardware cache event]
  LLC-load-misses                            [Hardware cache event]
  LLC-stores                                 [Hardware cache event]
  LLC-store-misses                           [Hardware cache event]
  LLC-prefetches                             [Hardware cache event]
  LLC-prefetch-misses                        [Hardware cache event]
  dTLB-loads                                 [Hardware cache event]
  dTLB-load-misses                           [Hardware cache event]
  dTLB-stores                                [Hardware cache event]
  dTLB-store-misses                          [Hardware cache event]
  dTLB-prefetches                            [Hardware cache event]
  dTLB-prefetch-misses                       [Hardware cache event]
  iTLB-loads                                 [Hardware cache event]
  iTLB-load-misses                           [Hardware cache event]
  branch-loads                               [Hardware cache event]
  branch-load-misses                         [Hardware cache event]

Native events

The pre-defined events listed above were designed to be used across all architectures, allowing performance analysts to monitor for certain common hardware phenomenon (e.g., branch mis-predicts and cache missed) without needing to dig into processor-specific documentation.  Under the covers, these pre-defined events map to architecture-specific "native events".  For a given processor type, there are typically many more native events available than those which map to perf's pre-defined set. To profile or count such native events, the perf tool provides a "raw event" specififer you can pass on the command line; e.g., 'perf --event=r<NNN>', where <NNN> is a hex code that identifies the specific native event.

The rNNN event can be very useful validating the above pre-defined event descriptors or using hardware events other than predefined events. For using the -rNNN raw events, you will need to obtain a utility called 'evt2raw'.  You can get this utility by downloading (via git) and building the latest libpfm4 library.

git clone git://


cd libpfm4


The ../libpfm4/perf_examples/evt2raw program will display the rNNN hex code for a given event; for example:

% /root/libpfm4/perf_examples/evt2raw PM_MRK_DATA_FROM_RMEM

The ../libpfm4/examples/showevtinfo program can be used to list all of the hardware counters available to perf. Here, we show the start of the list on a POWER7 machine.

% /root/libpfm4/examples/showevtinfo
Supported PMU models:
        [43, ppc970, "PPC970"]
        [44, ppc970mp, "PPC970MP"]
        [46, power4, "POWER4"]
        [47, power5, "POWER5"]
        [48, power5p, "POWER5+"]
        [49, power6, "POWER6"]
        [50, power7, "POWER7"]
        [51, perf, "perf_events generic PMU"]
        [67, power_torrent, "IBM Power Torrent PMU"]
Detected PMU models:
        [50, power7, "POWER7", 547 events, 1 max encoding, 6 counters, core PMU]
        [51, perf, "perf_events generic PMU", 99 events, 1 max encoding, 0 counters, OS generic PMU]
Total events: 3023 available, 646 supported
IDX      : 104857600
PMU name : power7 (POWER7)
Equiv    : None
Flags    : None
Desc     :  L2 I cache demand request due to BHT or redirect
Code     : 0x4898
IDX      : 104857601
PMU name : power7 (POWER7)
Name     : PM_GCT_UTIL_7_TO_10_SLOTS
Equiv    : None
Flags    : None
Desc     : GCT Utilization 7-10 entries
Code     : 0x20a0
IDX      : 104857602
PMU name : power7 (POWER7)
Name     : PM_PMC2_SAVED
Equiv    : None
Flags    : None
Desc     : PMC2 was counting speculatively. The speculative condition was met and the counter value was committed by copying it to the backup register.
Code     : 0x10022
#----------------------------- ...... skipped

To use the raw counters, you would substitute the rNNN code from the "Code" field. For example, to show perf top activity for hardware counter PM_VMX1_STALL use -rb008c.

% perf top -e rb008c


Perf Commands

In addition to perf top and perf list, other commands are available such as record, annotate, stat, trace, etc. Several of the commands such as top and stat produce immediate output. The record command creates a raw output file (default filename is and other commands such as annotate, report, sched and trace use the raw output file as input. This section only provides an introductory description for the most commonly used perf commands. User should check out the manual page for each command in order to get more detailed information. (The workload of perf command could be any arbitrary application. We pick a simple command sleep 11 as the example workload in this section.)


Perf Stat

The perf stat command provides a default report of task-clock-msecs, context-switches, page-faults, etc.

% perf stat sleep 11

 Performance counter stats for 'sleep 11':

       1.249874  task-clock-msecs         #      0.000 CPUs
              1  context-switches         #      0.001 M/sec
              0  CPU-migrations           #      0.000 M/sec
             60  page-faults              #      0.048 M/sec
        4937791  cycles                   #   3950.631 M/sec
        1911340  instructions             #      0.387 IPC 
         401690  branches                 #    321.384 M/sec
          35173  branch-misses            #      8.756 %   
         453074  cache-references         #    362.496 M/sec
          13110  cache-misses             #     10.489 M/sec

   11.001584298  seconds time elapsed

If perf events are provided (-e), only the specified events will be reported. (The all-cpus parameter (-a) is used for system-wide data collection from all CPUs.)

% perf stat -a -e L1-dcache-loads sleep 11

 Performance counter stats for 'sleep 11':

         692431  L1-dcache-loads        

   11.001530314  seconds time elapsed


Perf Record

The command perf record gathers a performance counter profile and writes the profile data to the file -- without displaying anything. The file can then be inspected later on, using perf report. (Other perf commands are designed to read the event data and create report output based on the other commands' functionality.)

For example, the following command runs sleep 11, gathers events from all CPUs:

-a: system-wide collection from all CPUs.
-f: overwrites data into the default output file
-g: does call-graph recording.
-e: selects the target event. (Use perf list to list available events)

% perf record -f -a -e cache-misses sleep 11
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.095 MB (~4130 samples) ]


Perf Report

Use perf report to output an analysis of, the report output includes the command, object, and function for the target executable. For example, the following command produces a sorted report of the executable sleep 10 that consume the most time: (This example shows 46% of CPU time spent on command perf)

% perf report
# Samples: 1977
# Overhead  Command  Shared Object            Symbol
# ........  .......  ................         ......
 7.28%      perf                  [.] __novmx__sigsetjmp_ent
 6.17%      init                              f1d138 [H] 0x00000000f1d138
 2.43%      perf                    8000000000ef06ec [H] 0x8000000000ef06ec
 2.02%      perf  [kernel.kallsyms]                  [k] .find_vma
 1.87%      perf  [kernel.kallsyms]                  [k] .show_map_vma
 1.77%      perf  [kernel.kallsyms]                  [k] .kmem_cache_alloc
 1.52%      swapper                          17a0e0  [H] 0x0000000017a0e0
 1.47%      perf  perf                               [.] 00000012.plt_call.strcat@@GLIBC_2.3+0
 1.47%      perf  [kernel.kallsyms]                  [k] .generic_getxattr
 1.42%      perf  [kernel.kallsyms]                  [k] .proc_pid_status
 1.32%      perf  [kernel.kallsyms]                  [k] .kmem_cache_free
 1.32%      perf  [kernel.kallsyms]                  [k] .__block_prepare_write



If you use (-g) option in previous perf record command, then you can see cache-misses event info on the call-graph.

% perf report -g
# Samples: 2139
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................  .......
 6.78%         perf      [.] __novmx__sigsetjmp_ent
     |          |         
     |          |--81.82%--__novmx__sigsetjmp_ent
     |          |          |         
     |          |          |--61.11%--__novmx__sigsetjmp_ent
     |          |          |          |         
     |          |          |          |--54.55%--__novmx__sigsetjmp_ent
     |          |          |          |          |         
     |          |          |          |          |--50.00%--00000012.plt_call.strcat@@GLIBC_2.3+0
     |          |          |          |          |          00000012.plt_call.strcat@@GLIBC_2.3+0
     |          |          |          |          |          00000012.plt_call.strcat@@GLIBC_2.3+0
     |          |          |          |          |          00000012.plt_call.strcat@@GLIBC_2.3+0




Perf Annotate

perf annonate reads the input file and displays an annotated version of the code. If the object file has debug symbols then the source code will be displayed alongside assembly code. If there is no debug info in the object, then annotated assembly is displayed.

% perf annotate -s .kmem_cache_alloc

 Percent |      Source code & Disassembly of vmlinux-2.6.34-rc7
         :      Disassembly of section .text:
         :      c000000000176f54 <.kmem_cache_alloc>:
    0.00 :      000000000176f54:        7c 08 02 a6     mflr    r0
    0.00 :      000000000176f58:        f8 01 00 10     std     r0,16(r1)
    0.00 :      000000000176f5c:        fb 01 ff c0     std     r24,-64(r1)
    0.00 :      000000000176f60:        fb 21 ff c8     std     r25,-56(r1)
    0.00 :      000000000176f64:        fb 41 ff d0     std     r26,-48(r1)
    0.00 :      000000000176f68:        fb 61 ff d8     std     r27,-40(r1)
    0.00 :      000000000176f6c:        fb 81 ff e0     std     r28,-32(r1)
    0.00 :      000000000176f74:        fb c1 ff f0     std     r30,-16(r1)

(this shows that 0% cache misses come from this system call it performs. )


Perf Sched

The perf sched is a family of tools that uses performance events to objectively characterise arbitrary workloads from a scheduling and latency point of view. The perf sched has five sub-commands currently:

perf sched record            # low-overhead recording of arbitrary workloads
perf sched latency           # output per task latency metrics
perf sched map               # show summary/map of context-switching
perf sched trace             # output the fine grained trace
perf sched replay            # replay a captured workload using simulated threads

The following shows a perf sched example usage:

% perf sched record -f -a sleep 11

% perf sched latency
     Task        | Runtime ms | Switches | Avg delay ms | Max delay ms | Max delay at         |
  events/1:28    |   0.006 ms |        1 | avg:0.021 ms | max:0.021 ms | max at: 61313.025873 s
  kjournald:4034 |   0.149 ms |        3 | avg:0.020 ms | max:0.020 ms | max at: 61308.995880 s
  events/2:29    |   0.008 ms |        1 | avg:0.020 ms | max:0.020 ms | max at: 61308.995869 s
  events/6:33    |   0.030 ms |        1 | avg:0.018 ms | max:0.018 ms | max at: 61312.875868 s
  events/0:27    |   0.029 ms |        4 | avg:0.018 ms | max:0.020 ms | max at: 61311.665870 s
  events/5:32    |   0.024 ms |        1 | avg:0.018 ms | max:0.018 ms | max at: 61304.905894 s
  perf:30802     |  20.890 ms |        1 | avg:0.013 ms | max:0.013 ms | max at: 61314.416794 s
  flush-8:0:4040 |   0.012 ms |        2 | avg:0.013 ms | max:0.013 ms | max at: 61307.165882 s
  TOTAL:         |  23.713 ms |       35 |


The perf sched map can interpret and analyze the events in, and can be used as a tool to check the system-wide workload balance. For example, the following command prints an abbreviated text map of scheduling events on an 8-CPU box:

8 columns stand for individual CPUs, from CPU0 to CPU7

A dot means an idle CPU

The two-letter shortcuts stand for tasks that are running on a CPU.

A '*" means that the CPU has the event.

The left column shows that a task occurs first time and its description.

% perf sched record -f -a sleep 11

% perf sched map
  *M0          .   .   .   .   .        61308.055860 secs M0 => sync_supers:1174
  *H0          .   .   .   .   .        61308.055866 secs
  *.           .   .   .   .   .        61308.055874 secs
   .  *N0      .   .   .   .   .        61308.372623 secs N0 => init:1
   .  *.       .   .   .   .   .        61308.372643 secs
   .   .  *O0  .   .   .   .   .        61308.995869 secs O0 => events/2:29
   .   .   O0  .  *P0  .   .   .        61308.995880 secs P0 => kjournald:4034
   .   .  *.   .   P0  .   .   .        61308.995881 secs
   .   .   .   .  *.   .   .   .        61308.995963 secs
   .   .   .   .  *Q0  .   .   .        61309.007010 secs Q0 => kblockd/4:1181
   .   .   .   .  *P0  .   .   .        61309.007016 secs


The perf sched trace print out the fine grained tracepoint information for the workload

% perf sched record -f -a sleep 11

% perf sched trace
perf-31331 [002] 795.70:sched_stat_runtime:comm=perf pid=31331 runtime=2926294[ns]
    init-0 [001] 796.34:sched_stat_sleep:comm=events/1 pid=28 delay=1629990220[ns]
    init-0 [001] 796.36:sched_wakeup:comm=events/1 pid=28 prio=120 success=1 target_cpu=001
    init-0 [001] 796.42:sched_stat_wait:comm=events/1 pid=28 delay=9780[ns]


The perf sched replay starts a scheduling workload 'simulator' which takes recorded events and turns them into sleep/run patterns of user-space threads which are then executed as if they real workload was running:

% perf sched replay
run measurement overhead: 250 nsecs
sleep measurement overhead: 54188 nsecs
the run test took 999968 nsecs
the sleep test took 1063229 nsecs
nr_run_events:        59
nr_sleep_events:      78
nr_wakeup_events:     34
task      0 (           <unknown>:         0), nr_events: 81
task      1 (    igned char commo:        28), nr_events: 5
task      2 (           <unknown>:     31331), nr_events: 4
task      3 (    4a00 T program_c:     31332), nr_events: 9
task      4 (    , REC->prio, REC:        12), nr_events: 4
task      5 (    0000083a4 T .sta:        10), nr_events: 3
task      6 (    c000000000005100:         1), nr_events: 7



​Perf Bench

The perf bench is a family of tools that provides general framework for benchmark suites. Currently the perf bench has following several sub-commands:

perf bench sched            # scheduler and IPC mechanism
     - messaging:             Benchmark for scheduler and IPC mechanisms
     - pipe:                  Flood of communication over pipe() between two processes
     - all:                   test all suite (pseudo suite)

perf bench mem              # memory access performance
     - memcpy:                Simple memory copy in various ways
     - all:                   test all suite (pseudo suite)

perf bench all              # test all subsystem (pseudo subsystem)

Here is a quick way to run all available benchmark suite.

% perf bench all

# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

     Total time: 0.300 [sec]

# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two tasks

     Total time: 11.102 [sec]

      11.102111 usecs/op
          90072 ops/sec

# Running mem/memcpy benchmark...
# Copying 1MB Bytes from 0xfff92cd0010 to 0xfff92de0010 ...

       4.628258 GB/Sec


Perf Diff

The perf diff command reads two files and displays the differential profiles. It will be useful to show the delta percentage between two different events info of a same symbol. Following example shows how to do perf diff for the workload sleep 11.

% perf record -o perf.data1 -f -e cycles sleep 11

% perf record -o perf.data2 -f -e instructions sleep 11

% perf report -i perf.data1
# Samples: 1001
# Overhead  Command      Shared Object  Symbol
# ........  .......  .................  ......
    47.95%  sleep  [kernel.kallsyms]    [k] .__percpu_counter_add
    35.76%  sleep  [kernel.kallsyms]    [k] .account_system_time
     9.49%  sleep  [kernel.kallsyms]    [k] .acct_update_integrals
     5.29%  sleep  [kernel.kallsyms]    [k] .trace_hardirqs_off
     1.10%  sleep  [kernel.kallsyms]    [k] .account_system_vtime
     0.10%  sleep  [kernel.kallsyms]    [k] .write_mmcr0


% perf report -i perf.data2
# Samples: 1001
# Overhead  Command      Shared Object  Symbol
# ........  .......  .................  ......
    79.62%  sleep  [kernel.kallsyms]    [k] .idle_cpu
    13.19%  sleep  [kernel.kallsyms]    [k] .rcu_irq_exit
     4.40%  sleep  [kernel.kallsyms]    [k] .irq_exit
     0.70%  sleep  [kernel.kallsyms]    [k] .trace_hardirqs_off
     0.60%  sleep  [kernel.kallsyms]    [k] .account_system_vtime
     0.50%  sleep  [kernel.kallsyms]    [k] .acct_update_integrals
     0.30%  sleep  [kernel.kallsyms]    [k] .write_mmcr0
     0.30%  sleep  [kernel.kallsyms]    [k] .account_system_time
     0.20%  sleep  [kernel.kallsyms]    [k] .raw_local_irq_restore
     0.10%  sleep  [kernel.kallsyms]    [k] .jiffies_to_timeval
     0.10%  sleep  [kernel.kallsyms]    [k] .__percpu_counter_add

% perf diff perf.data1 perf.data2
# Baseline      Delta      Shared Object  Symbol
# ........ ..........  .................  ......
     0.00%    +79.62%  [kernel.kallsyms]  [k] .idle_cpu
     0.00%    +13.19%  [kernel.kallsyms]  [k] .rcu_irq_exit
     0.00%     +4.40%  [kernel.kallsyms]  [k] .irq_exit
     5.31%     -4.61%  [kernel.kallsyms]  [k] .trace_hardirqs_off
     1.10%     -0.50%  [kernel.kallsyms]  [k] .account_system_vtime
     9.52%     -9.02%  [kernel.kallsyms]  [k] .acct_update_integrals
    35.87%    -35.57%  [kernel.kallsyms]  [k] .account_system_time
     0.10%     +0.20%  [kernel.kallsyms]  [k] .write_mmcr0
     0.00%     +0.20%  [kernel.kallsyms]  [k] .raw_local_irq_restore
    48.10%    -48.00%  [kernel.kallsyms]  [k] .__percpu_counter_add
     0.00%     +0.10%  [kernel.kallsyms]  [k] .jiffies_to_timeval

% perf diff perf.data2 perf.data1
# Baseline      Delta      Shared Object  Symbol
# ........ ..........  .................  ......
     0.10%    +48.00%  [kernel.kallsyms]  [k] .__percpu_counter_add
     0.30%    +35.57%  [kernel.kallsyms]  [k] .account_system_time
     0.50%     +9.02%  [kernel.kallsyms]  [k] .acct_update_integrals
     0.70%     +4.61%  [kernel.kallsyms]  [k] .trace_hardirqs_off
     0.60%     +0.50%  [kernel.kallsyms]  [k] .account_system_vtime
     0.30%     -0.20%  [kernel.kallsyms]  [k] .write_mmcr0


Perf Kmem

The perf kmem command can be used to trace and measure kernel memory(slab) properties.

First execute perf kmem record to record the kmem events of an arbitrary workload into a data file - without displaying anything. For example, the following command uses a key bytes and records workload sleep 11 into the default file.

% perf kmem -s bytes record sleep 11

Then you can use perf kmem stat to get the kernel memory statistics

% perf kmem stat

Total bytes requested: 65495393
Total bytes allocated: 65632424
Total bytes wasted on internal fragmentation: 137031
Internal fragmentation: 0.208786%
Cross CPU allocations: 135/1616247


​Perf Timechart

The perf timechart command can be used to create a SVG output file which shows how CPU cycles and I/O wait times are distributed across processes in the system over time.

First execute perf timechart record to gather a performance counter profile from it, and write into a data file - without displaying anything. For example, the following command records system-wide cache-misses events (-a -e) from all CPUs for the workload sleep 11, and overwrite (-f) the data into the default file.

% perf timechart record -f -a -e cache-misses sleep 11

Then you can use perf timechart to create a SVG output file - result.svg.

% perf timechart -o result.svg

Finally, use scp or some equivalent to copy the file to another system for viewing. Most browsers are able to display the SVG image. If not, you have to download some SVG viewer to see the image.


Perf Trace

Basiaclly the perf trace reads file (created by perf record) and display trace output

The following command can let you get all Power7 tracepoint events:

# perf list 2>&1 | grep sched
  sched:sched_switch                         [Tracepoint event]
  sched:sched_stat_runtime                   [Tracepoint event]
  sched:sched_stat_iowait                    [Tracepoint event]
  sched:sched_stat_sleep                     [Tracepoint event]
  sched:sched_stat_wait                      [Tracepoint event]
  sched:sched_process_fork                   [Tracepoint event]
  sched:sched_process_wait                   [Tracepoint event]
  sched:sched_process_exit                   [Tracepoint event]
  sched:sched_process_free                   [Tracepoint event]
  sched:sched_migrate_task                   [Tracepoint event]
  sched:sched_wakeup_new                     [Tracepoint event]
  sched:sched_wakeup                         [Tracepoint event]
  sched:sched_wait_task                      [Tracepoint event]
  sched:sched_kthread_stop_ret               [Tracepoint event]
  sched:sched_kthread_stop                   [Tracepoint event]


For example, the following commands will generate sched_stat_sleep info for workload sleep 11. (When you have to use the option -R to collect raw sample records from all opened counters.)

% perf record -R -f -a -e sched:sched_stat_sleep sleep 11

% perf trace
      init-0 [003] sched_stat_sleep: comm=events/3    pid=30     delay=9962210    [ns]
 events/3-30 [003] sched_stat_sleep: comm=sshd        pid=7652   delay=21484064   [ns]
  perf-14714 [006] sched_stat_sleep: comm=perf        pid=14715  delay=31317730   [ns]
   swapper-0 [000] sched_stat_sleep: comm=events/0    pid=27     delay=2529988594 [ns]
   swapper-0 [000] sched_stat_sleep: comm=sync_supers pid=1174   delay=6000010064 [ns]
      init-0 [001] sched_stat_sleep: comm=init        pid=1      delay=5005011138 [ns]
      init-0 [001] sched_stat_sleep: comm=events/1    pid=28     delay=9999991086 [ns]
      init-0 [002] sched_stat_sleep: comm=events/2    pid=29     delay=5468738094 [ns]
      init-0 [002] sched_stat_sleep: comm=kjournald   pid=4034   delay=7955082500 [ns]