Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

CPI analysis on POWER5, Part 1: Tools for measuring performance

An introduction to software resources

Duc Vianney, Alex Mericas, Bill Maron, Thomas Chen, Steve Kunkel, and Bret Olszewski of the IBM Systems & Technology Group Systems Performance team contributed to this article.

Summary:  This article begins a short series on workload performance analysis on Power Architecture™ systems. Part 1 introduces the CPU feature set and a variety of useful tools for collecting data.

View more content in this series

Date:  04 Apr 2006
Level:  Introductory
Also available in:   Russian

Activity:  11028 views
Comments:  

Cycles per instruction (CPI) is the measurement for analyzing the performance of a workload. CPI is simply defined as the number of processor clocked cycles needed to complete an instruction. It is calculated as CPI = Total Cycles / Number of Instructions Completed. A high CPI value usually implies underutilization of machine resources.

Profiling is a common approach to collect timing and resource utilization for a workload. Profiling allows you to identify hotspots in code and data, find performance-sensitive areas, and identify problem instructions, data areas, or both. The POWER5™ microprocessor, like many of the PowerPC®s that immediately preceded it, provides on-chip performance monitor units (PMUs) to record performance events through six performance monitor counters (PMCs), two of which are dedicated to count PowerPC instructions completed and total cycles for any given event. As a result, with an appropriate set of performance monitor application programming interfaces (PMAPIs) designed to provide access to those PMCs, you can profile many performance-sensitive events related to the core or the memory susbsystem, as well as the effects of simultaneous multithreading (SMT) and symmetric multiprocessing (SMP) on your workload. The data is quite extensive and informative since the POWER5 supports approximately 900 total events, 500 unique events, and 230 events per counter.

On a POWER5 system, you can break down your workload CPI into individual components, as the POWER5 has several programmable counters available to count events that can calculate the components of CPI and allow you to determine how to improve performance on a given workload.

This series introduces a CPI breakdown model to help you analyze where your workload spends its processing cycles while progressing through the core resources and the penalty incurred upon encountering these inhibitors. We give an overview of the POWER5 architecture, discuss the POWER5 performance monitoring facilities and performance events and profilers that are available on Linux® and AIX®, and show how to construct the CPI stack for your workload using the pmcount command. Finally, you'll see an example showing how to use the CPI breakdown model to identify and improve a performance problem.

This article reviews the performance tools available on POWER5 systems. After you've collected the profiling data, you can analyze the performance of your workload using the CPI (cycles per instruction) breakdown approach, which is the topic of the next article in this series.

POWER5 architecture

The POWER5 processor chip contains two microprocessor cores, chip and system pervasive functions, core interface logic, a 1.9MB level-2 (L2) cache and controls, a 36MB level-3 (L3) cache directory and controls, and the fabric controller that controls the flow of information and control data between the L2 and L3 and between chips. The L3 cache is a victim cache of the L2 cache. Each core contains a 64KB level-1 (L1) instruction cache (Icache), a 32KB level-1 data cache (Dcache), two fixed-point execution units (FXU - FX Exec Unit), two floating-point execution units (FPU - FP Exec Unit), two load/store execution units (LDU - LD Exec Unit), one branch execution unit (BRU - BR Exec Unit), and one execution unit to perform logical operations on the condition (CRU - CR Exec Unit). Instructions dispatched in program order in groups are issued out of program order to the execution units, with a bias towards oldest operations first. A group contains up to five internal instructions (IOPs), and the last instruction is always the branch instruction (or a no-op if there is no branch instruction). Only one group of instructions can be dispatched in a cycle, and all instructions in a group are dispatched together.

The 64-bit POWER5 core is a speculative, out-of-order execution core coupled with a multilevel storage hierarchy. Figure 1 shows an overview of the POWER5 pipeline organization. The pipeline structure comprises a master pipeline and several execution unit pipelines, all of which can progress independently from each other. As discussed in the paper "Performance Workloads Characterization on POWER5 with Simultaneous Multi Threading Support" (see Resources), the master pipeline presents speculative and in-order instructions to the mapping, sequencing, and dispatch function, and ensures an orderly completion of the real execution path. It throws away any potential speculative results associated with mispredicted branches. The execution unit pipelines allow out-of-order issuing of speculative and non-speculative instructions.

In general, up to eight instructions are fetched from the instruction cache (I cache) and fed into the pipeline to go through various pipeline stages until completion. In each cycle, up to five instructions are pulled from the instruction fetch buffer (Inst Queue) and sent through the three-stage instruction decode pipeline to form a dispatch group. Complex instructions are cracked into internal operations (IOPs) to allow for simpler inner core dataflow.

From dispatch to completion, instructions are tracked in groups. Each group has an entry in the Global Completion Table (GCT). A group can contain up to five IOPs. During dispatch, various machine resources are assigned to instructions. These resources are:

  • GCT: one entry per group
  • FXU Issue queue: one entry per Fixed point or load/store IOP
  • FPU Issue queue: one entry per Floating Point IOP
  • BRU Issue queue: one entry per branch IOP
  • CRU Issue queue: one entry per CR IOP
  • mappers: one entry per new destination (either GPR, FPR, XER, CR, Link/CNT)
  • LRQ: one entry per load IOP
  • SRQ: one entry per store IOP

Figure 1. POWER5 Pipeline Structure
Figure 1. POWER5 Pipeline Structure

LRQ is the Load Reorder Queue, a 32-entry queue which holds real addresses and tracks the order of loads. SRQ is the Store Reorder Queue, a 32-entry queue which tracks all stores active in the LSU. In Figure 1, IFAR represents the Instruction Fetch Address Register.

The following machine resources are released at various stages:

  • GCT: when the group completes, or in other words, when all instructions in the oldest group finish their execution
  • Issue queue: after the instructions have been successfully issued
  • mappers: when the group completes
  • LRQ: when the group completes
  • SRQ: when the store is sent to the storage subsystem

The branch prediction logic BR Scan scans all the fetched instructions looking for up to two branches per cycle. Depending upon the branch type found, various branch prediction mechanisms in Branch BR engage to help predict the target address of the branch or the branch direction or both.


POWER5 Performance Monitor

The POWER5 microprocessor, like its predecessors in the PowerPC line, provides Performance Monitor Unit (PMU) counters and a number of PMCs to monitor and record several performance events. The PowerPC 604 had two PMCs while the 604e had four. The POWER3™, POWER4™, and PowerPC 970 had eight PMCs. The POWER5 has six PMCs per thread, for a total of 12 per CPU to support the implementation of SMT. The number of supporting events has also increased over the years. The PowerPC 604e had 128 total events, 100 unique events, and 32 events per counter. The POWER4 had 900 total events, 300 unique events, and 115 events per counter. Meanwhile, the POWER5 has 900 total events, 500 unique events, and 230 events per counter (see Resources for more information).

Monitoring an event on a POWER5 processor is a challenging task because of the speculative, out-of-order execution of the core and the instruction groupings. At any clock cycle, you have to handle a typical situation of five instructions/group, 20 groups past dispatch, 32 outstanding loads, 16 outstanding misses, two independent threads, and the decoupled nest/core. As an example, analysis of stall conditions on the POWER5 processor might be complicated because the processor speculatively executes instructions out of order and completes instructions in groups. With up to five instructions per group, any one of the instructions could block completion, and multiple conditions could block any instructions (a translation miss followed by a cache miss, for example). Even if stall conditions could be measured on a per instruction (or even per group) basis, that information by itself is not sufficient to break down the completion stall component because speculation might cause the instructions to be discarded before they complete. Until a stall condition has resolved, it might not be possible to know its fundamental cause. To measure these speculative events, the POWER5 PMU has speculative counters. Two of the programmable counters have backup registers that are not accessible to software. When software writes an initial value to the counter, its backup register is also written. A counter can be configured for a particular stall condition and begins counting speculatively on any cycle when no group is completing. The first group that completes will report the last condition that held its completion. If the condition matches what the counter is configured for, the count value is committed by updating the backup register. If it does not match, the counter is rewound to its previously check-pointed value.

Profiling tools collect profiling data. Several profilers run on a POWER5 system. You can use OProfile, the system-wide statistical profiler for Linux if you are running Linux on POWER™ (LoP). If you are running AIX then you have many other performance tools developed to leverage the PMAPIs, such as pmlist, tcount, tprof, and the two hardware performance monitor (hpm) tools, hpmstat and hpmcount. Lastly, you can always use the tool, pmcount, which runs on both LoP and AIX.

Profiling in Linux on POWER

In Linux, the most popular profiling tool is OProfile (see Resources), a system-wide profiler for a Linux system running on the Intel®, Power PC, AMD, sparc64, and PA-RISC platforms. Support for POWER4, PowerPC 970, and POWER5 counters has been added to OProfile in release 0.8.2 by the LTC Toolchain team. RPMs with LoP support are included in SLES 9 SP1 and RHEL 4 U1. John Engel gives a description of OProfile and how to use it to identify performance bottlenecks for LoP in an article published in IBM developerWorks (see Resources). Here is the simple shell script for running OProfile.


Listing 1. Invoking OProfile

#!/bin/ksh
# Oprofile run script
opcontrol -init
# Loads the oprofile driver and oprofilefs 
opcontrol --vmlinux=/boot/vmlinux-2.6.5-7.97-pseries64
# Specifies the vmlinux kernel image 
opcontrol -reset
# Clears current session data
opcontrol --event=CYCLES:1000
# Takes a sample for every 1000th hardware event 
opcontrol -verbose -start
# Starts profiling
# Runs the application you want to profile here
opcontrol --dump
# Flushes the collected profiling data to to the oprofile daemon
opcontrol -stop
# Stops the profiling data collection
opcontrol -shutdown
# Stops data collection and removes the daemon
 

Running the program opreport generates a profiling report with a list of all symbols. The following command produces a report directed to the file named ammp.sym_out:

opreport -symbols > ammp.sym_out

The contents of ammp.sym_out can be displayed as follows:


Listing 2. The profiling report

more ammp.sym_out

CPU: ppc64 POWER5, speed 1656 MHz (estimated)
Counted CYCLES events (Processor cycles) with a unit mask of 0x00 (No unit mask)
 count 1200000
samples  %        image name               app name                 symbol name
90896    63.5556  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      mm_fv_update_nonbon
10219     7.1453  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      f_nonbon
7531      5.2658  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      atom
7432      5.1965  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      __vsqrt_GP
6614      4.6246  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      a_m_serial
5582      3.9030  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      tpac
4213      2.9458  libm.so.6                ammp_base.SLES9-RC5      __sinl
1905      1.3320  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      f_torsion
1636      1.1439  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      __vrec_GP
1342      0.9383  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      f_box
1219      0.8523  libm.so.6                ammp_base.SLES9-RC5      __ieee754_acos
1024      0.7160  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      f_bond
793       0.5545  ammp_base.SLES9-RC5      ammp_base.SLES9-RC5      fv_update_nonbon
652       0.4559  libm.so.6                ammp_base.SLES9-RC5      __acosl
278       0.1944  libc.so.6                ammp_base.SLES9-RC5      getc
.
.
.

The following command generates a profiling report with full path names:

opreport --long-filenames > ammp.filenames_out

The contents of ammp.filenames_out can be displayed similarly:


Listing 3. Profiling report, with path names

more ammp.filenames_out

CPU: ppc64 POWER5, speed 1656 MHz (estimated)
Counted CYCLES events (Processor cycles) with a unit mask of 0x00 (No unit mask)
 count 1200000
   CYCLES:1200000|
  samples|      %|
------------------
   142609 99.7140 /benchmarks/speccpu/speccpu/benchspec/CFP2000/188.ammp/run/000
00014/ammp_base.SLES9-RC5
           CYCLES:1200000|
          samples|      %|
        ------------------
           135282 94.8622 /benchmarks/speccpu/speccpu/benchspec/CFP2000/188.ammp
/run/00000014/ammp_base.SLES9-RC5
             6339  4.4450 /lib/tls/libm.so.6
              627  0.4397 /lib/tls/libc.so.6
              330  0.2314 /boot/vmlinux
                9  0.0063 /e1000
                8  0.0056 /ohci_hcd
                5  0.0035 /ehci_hcd
                4  0.0028 /usbcore
                3  0.0021 /scsi_mod
                2  0.0014 /ipr
      129  0.0902 /usr/bin/oprofiled
           CYCLES:1200000|
          samples|      %|
        ------------------
              105 81.3953 /boot/vmlinux
               22 17.0543 /usr/bin/oprofiled
                2  1.5504 /lib/tls/libc.so.6
       80  0.0559 /bin/bash
           CYCLES:1200000|
          samples|      %|
        ------------------
               33 41.2500 /lib/tls/libc.so.6
               24 30.0000 /boot/vmlinux
               19 23.7500 /bin/bash
                4  5.0000 /lib/ld-2.3.3.so
       42  0.0294 /boot/vmlinux
       29  0.0203 /sbin/iprinit
.
.
.

Profiling in AIX

AIX 5.2 introduced the libpmapi library, which contains APIs that are designed to provide access to some of the counting facilities of the PM feature included in selected IBM microprocessors. Those APIs include the following:

  • A set of system-level APIs to allow counting of the activity of a whole machine or of a set of processes with a common ancestor.
  • A set of first party kernel-thread-level APIs to allow threads running in 1:1 mode to count their own activity.
  • A set of third party kernel-thread-level APIs to allow a debug program to count the activity of target threads running in 1:1 mode.

With the support of POWER5 in AIX 5.3, many profiling tools that use PMAPIs have been enhanced including pmlist, tprof, hpmcount and hpmstat, and the sample program tcount.

The tcount command

tcount is a sample program that was shipped with AIX 5.2 and located in /usr/pmapi/samples/tcount. The program accepts event numbers, counting flags, and a workload as inputs, and then uses the PMAPIs to produce a listing of the PMCs for each CPU. The following is a description of tcount:


Listing 4. Usage message for tcount

# /usr/pmapi/samples/tcount -H

usage:  /usr/pmapi/samples/tcount -H
        /usr/pmapi/samples/tcount [-h] [-k] [-u] [-p] [-m|-M] [-t value]
               [-f filter] [-e event{,event}] [-T] workload

where:  -H                      this help screen
        -h                      count in hypervisor mode
        -k                      count in kernel mode
        -u                      count in user mode
        -p                      count in process tree mode (default is global)
        -r                      count in runlatch mode
        -t value                threshold value
        -m                      use upper threshold multiplier, if available
        -M                      use third threshold multiplier, if available
        -f v,u,c                event filter (default is v,u,c)
        -e event,event          comma-separated list of events to count
        -g group_id             event group ID
        -T                      retrieve timestamps
        workload                workload to execute

 Default counting modes are user,kernel,global.
 Valid filters are: verified, unverified, caveat.
 These represent the testing status of an event.

Following is an example of tcount usage:


Listing 5. Example usage of tcount

# /usr/pmapi/samples/tcount -pur -g 131 mytest 100000

called with 2 arguments, first one is 100000
loop count is 100000, array pad is 0
loop count is 100000
*** Configuration :
Mode = user; Process tree = off; Thresholding = off
Event Group is specified.
Group 131: pm_Dmiss
Counter  1, event 16: PM_DATA_FROM_L3
Counter  2, event 18: PM_DATA_FROM_LMEM
Counter  3, event 100: PM_LD_MISS_L1
Counter  4, event 171: PM_ST_MISS_L1
Counter  5, event  0: PM_INST_CMPL
Counter  6, event  0: PM_RUN_CYC

*** Results :
CPU       PMC 1         PMC 2         PMC 3         PMC 4         PMC 5         PMC 6
====  ============  ============  ============  ============  ============  ============
[ 0]  0             1670          10247         200569        2645054       3089454
[ 1]  0             0             0             0             0             0
[ 2]  0             0             0             0             8             0
[ 3]  0             0             0             0             8             0
====  ============  ============  ============  ============  ============  ============
ALL   0             1670          10247         200569        2645070       3089454

The pmlist command

The pmlist command lists information about supported processors, including:

  • The supported processors
  • The information summary for a specified processor
  • The event table for a specified processor
  • Any existing event groups for a specified processor
  • Any existing event sets for a specified processor
  • The event set and formula for a specified derived metric

Listing 6. Usage information for pmlist

pmlist
	utility to dump and search processors event and group tables
	currently supports text and spreadsheet output formats
usage: 	
pmlist -h
pmlist -l [ -o t | c ]
pmlist -s | -e >short|select< | -c counter[,event] | -g group | -S set | -D DerivedMetric -p procname] [-s] [-d]
[-o t|c] [-f filter]
where:
   -h                this help screen
   -l                lists all supported processor types
   -s                displays processor information summary
   -e short|select   lists all events with this short name or select event value
   -c -1             lists all events for all counters
   -c counter        lists all events for the specified counter
   -c counter,event  lists the specified event for the specified counter
   -D -1             lists all the derived metrics
   -D DerivedMetric  lists detailed information for the specified derived metric
   -g -1             lists all the event groups
   -g group          lists the specified event group
   -S -1             lists all the event sets
   -S set            lists the specified event set
   -p procname       specifies the processor for which information will be listed
   -d                displays event detailed description
   -o format         specifies the output format:
                        t is for text (default)
                        c is for comma separated values
   -f v,u,c          specifies the event filters (default is v,u,c).
                     these represent the testing status of an event:
                        v is for verified
                        u is for unverified
                        c is for caveat

To display the list of all supported processors, type:

# pmlist -l

The output looks like:


Listing 7. A list of supported processors

pmlist -l
Processors supported (specify with -p)
====================
PowerPC604
PowerPC604e
RS64-II
POWER3
RS64-III
POWER3-II
POWER4
MPC7450
POWER4-II
POWER5
PowerPC970

As previously discussed, the PMUs and PMCs for the processors supported by pmlist are processor dependent, and as a result, you should understand the correct meaning of the counter.

To display detailed information for event 4 of counter 1 of POWER5 processor, type:

# pmlist -p POWER5 -c 1,4 -d

The output looks like:


Listing 8. Detailed event information

Event #  Status  group Threshold share  Short Name  Long Name Description
=== Counter 1
#4,v,g,n,n,PM_3INST_CLB_CYC,Cycles 3 instructions in CLB

The cache line buffer (CLB) is an eight-deep, four-wide instruction buffer. Fullness is indicated in the eight valid bits associated with each of the four-wide slots, with full(0) corresponding to the number of cycles when eight instructions are in the queue, and full(7) corresponding to the number of cycles when one instruction is in the queue. This signal gives a real time history of the number of instruction quads valid in the instruction queue.

The same pmlist command specifying POWER4 processor would produce a different output.


Listing 9. Detailed event information for POWER4

# pmlist -p POWER4 -c 1,4 -d

Event #  Status  group Threshold share  Short Name  Long Name Description
=== Counter 1
#4,v,g,n,n,PM_DC_PREF_STREAM_ALLOC,D cache new prefetch stream allocated
   A new Prefetch Stream was allocated

The command pmlist for PowerPC970 yet displays another different output.


Listing 10. Detailed event information for PPC970

# pmlist -p PowerPC970 -c 1,4 -d

Event #  Status  group Threshold share  Short Name  Long Name Description
=== Counter 1
#4,v,g,n,n,PM_DATA_TABLEWALK_CYC,Cycles doing data tablewalks
   This signal is asserted every cycle when a tablewalk is active. While a tablewalk is active any request attempting
   to access the TLB will be rejected and retried.

To list information on an event on a POWER5, type:


Listing 11. Detailed event information for POWER5

# pmlist -p POWER5 -e PM_INST_CMPL

POWER5: information about PM_INST_CMPL event

Event#,Status,Grouped,Threshold,Shared,SelectEvent,ShortName,LongName
=== Pmc 1
174,v,g,n,n,00009,PM_INST_CMPL,Instructions completed
=== Pmc 2
174,v,g,n,n,00009,PM_INST_CMPL,Instructions completed
=== Pmc 3
=== Pmc 4
=== Pmc 5
  0,v,g,n,n,00009,PM_INST_CMPL,Instructions completed
=== Pmc 6

The same event on a POWER4 is described as follows:


Listing 12. The same event on POWER4

# pmlist -p POWER4 -e PM_INST_CMPL

POWER4: information about PM_INST_CMPL event
Event#,Status,Grouped,Threshold,Shared,SelectEvent,ShortName,LongName
=== Pmc 1
 86,c,g,n,n,8001,PM_INST_CMPL,Instructions completed
=== Pmc 2
=== Pmc 3
=== Pmc 4
 77,c,g,n,n,8001,PM_INST_CMPL,Instructions completed
=== Pmc 5
=== Pmc 6
 86,c,g,n,n,8001,PM_INST_CMPL,Instructions completed
=== Pmc 7
 78,c,g,n,n,8001,PM_INST_CMPL,Instructions completed
=== Pmc 8
 81,c,g,n,n,8001,PM_INST_CMPL,Instructions completed

The derived metrics used on a POWER5 system can be displayed using "pmlist -p" as follows:


Listing 13. Derived metrics

# pmlist -p POWER5 -D -1

Derived metrics supported:
        PMD_UTI_RATE                   Utilization rate
        PMD_MIPS                       MIPS
        PMD_INST_PER_CYC               Instructions per cycle
        PMD_HW_FP_PER_CYC              HW floating point instructions per Cycle
        PMD_HW_FP_PER_UTIME            HW floating point instructions / user time
        PMD_HW_FP_RATE                 HW floating point rate
        PMD_FX                         Total Fixed point operations
        PMD_FX_PER_CYC                 Fixed point operations per Cycle
        PMD_FP_LD_ST                   Floating point load and store operations
        PMD_INST_PER_FP_LD_ST          Instructions per floating point load/store
        PMD_PRC_INST_DISP_CMPL         % Instructions dispatched that completed
        PMD_DATA_L2                    Total L2 data cache accesses
        PMD_PRC_L2_ACCESS              % accesses from L2 per cycle
        PMD_L2_TRAF                    L2 traffic
        PMD_L2_BDW                     L2 bandwidth per processor
        PMD_L2_LD_EST_LAT_AVG          Estimated latency from loads from L2 (Average)
        PMD_UTI_RATE_RC                Utilization rate (versus run cycles)
        PMD_INST_PER_CYC_RC            Instructions per run cycle
        PMD_LD_ST                      Total load and store operations
        PMD_INST_PER_LD_ST             Instructions per load/store
        PMD_LD_PER_LD_MISS             Number of loads per load miss
        PMD_LD_PER_DTLB                Number of loads per DTLB miss
        PMD_ST_PER_ST_MISS             Number of stores per store miss
        PMD_LD_PER_TLB                 Number of loads per TLB miss
        PMD_LD_ST_PER_TLB              Number of load/store per TLB miss
        PMD_TLB_EST_LAT                Estimated latency from TLB miss
        PMD_MEM_LD_TRAF                Memory load traffic
        PMD_MEM_BDW                    Memory bandwidth per processor
        PMD_MEM_LD_EST_LAT             Estimated latency from loads from memory
        PMD_LD_LMEM_PER_LD_RMEM        Number of loads from local memory per loads from remote memory
        PMD_PRC_MEM_LD_RC              % loads from memory per run cycle

The hpmstat command

The hpmstat command is a utility used to continuously monitor a set of events counting system-wide activity. It reports results (raw hardware counts and derived metrics) at a given interval.


Listing 14. The hpmstat utility

# hpmstat -h

usage:
   hpmstat [-H] [-k] [-o file] [-r] [-s set] [-T] [-U] [-u] interval count 
hpmstat [-h]

where:
        interval       counting time interval (default is 1 and in seconds)
        count          number of iterations to count
        -H             adds hypervisor activity on behalf of the process
        -h             displays this help message
        -k             count system activity only (default is to count system,
                       user and hypervisor activity)
        -o file        output file name
        -r             enable runlatch, disable counts while executing in
                       idle cycle
        -s set         pre-defined set of events (1 to 9) - see command pmlist
        -T             write time stamps instead of time in seconds
        -U             the counting time interval is microseconds
        -u             count user activity only

The hpmcount command

The hpmcount command provides the execution wall clock time, raw hardware performance counts, derived hardware metrics, and resource utilization statistics (obtained from the getrusage() system call) for the application named by command.


Listing 15. The hpmcount utility

# hpmcount -h

usage:
   hpmcount [-a] [-H] [-k] [-o file] [-s set] command
   hpmcount [-h]

where:
        command        program to be executed
        -a             aggregate counters on POE runs
        -H             adds hypervisor activity on behalf of the process
        -h             displays this help message
        -k             adds system activity on behalf of the process
        -o file        output file name
        -s set         pre-defined set of events (1 to 9) - see command pmlist

Hardware events monitored by the hpm tools include:

  • Cycles
  • Instructions
  • Floating point instructions
  • Integer instructions
  • Load/stores
  • Cache misses
  • TLB misses
  • Branch taken/not taken
  • Branch mispredictions

The hpm tools provide derived metrics, including:

  • IPC - instructions per cycle
  • Floating point rate (Mflip/s)
  • FP computation intensity (flip per FP load/store)
  • Instructions per load/store
  • Load/stores per cache miss
  • Cache hit rate
  • Loads per load miss
  • Stores per store miss
  • Loads per TLB miss
  • Branches mispredicted %

As an example, hpmcount produces the following data for the command pwd as follows:


Listing 16. Sample hpmcount output

# hpmcount pwd

/usr/pmapi/tools
 Execution time (wall clock time): 0.001848 seconds

 ########  Resource Usage Statistics  ########

 Total amount of time in user mode            : 0.000276 seconds
 Total amount of time in system mode          : 0.001063 seconds
 Maximum resident set size                    : 148 Kbytes
 Average shared memory use in text segment    : 0 Kbytes*sec
 Average unshared memory use in data segment  : 0 Kbytes*sec
 Number of page faults without I/O activity   : 45
 Number of page faults with I/O activity      : 0
 Number of times process was swapped out      : 0
 Number of times file system performed INPUT  : 0
 Number of times file system performed OUTPUT : 0
 Number of IPC messages sent                  : 0
 Number of IPC messages received              : 0
 Number of signals delivered                  : 0
 Number of voluntary context switches         : 0
 Number of involuntary context switches       : 0

 #######  End of Resource Statistics  ########

  PM_FPU_1FLOP (FPU executed one flop instruction )           :               0
  PM_CYC (Processor cycles)                                   :           85563
  PM_MRK_FPU_FIN (Marked instruction FPU processing finished) :               0
  PM_FPU_FIN (FPU produced a result)                          :               0
  PM_INST_CMPL (Instructions completed)                       :           17597
  PM_RUN_CYC (Run cycles)                                     :           85563

  Utilization rate                                 :           2.799 %
  MIPS                                             :           9.522
  Instructions per cycle                           :           0.206
  HW Float point instructions per Cycle            :           0.000
  HW floating point / user time                    :           0.000 M HWflop/sec
  HW floating point rate (HW Flops / WCT)          :           0.000 M HWflops/sec

The tprof command

The tprof command reports CPU usage for individual programs and the system. The profiling report contains the following sections and subsections:

  • Summary report section
    1. CPU usage summary by process name
    2. CPU usage summary by threads (tid)
  • Global (pertains to the execution of all processes on system) profile section
    1. CPU usage of user mode routines
    2. CPU usage of kernel routines
    3. CPU usage summary for kernel extensions
    4. CPU usage of each kernel extension's subroutines
    5. CPU usage summary for shared libraries
    6. CPU usage of each shared library's subroutines
    7. CPU usage of each JAVA class
    8. CPU usage of each JAVA methods of each JAVA class
  • Process and thread level profile sections (one section for each process or thread)
    1. CPU usage of user mode routines for this process/thread
    2. CPU usage of kernel routines for this process/thread
    3. CPU usage summary for kernel extensions for this process/thread
    4. CPU usage of each kernel extension's subroutines for this process/thread
    5. CPU usage summary for shared libraries for this process/thread
    6. CPU usage of each shared library's subroutines for this process/thread
    7. CPU usage of each JAVA class for this process/thread
    8. CPU usage of JAVA methods of each JAVA class for this process/thread

tprof supports two profiling modes: time-based (default) or event-based. In time-based profiling, the decrementer interrupt drives tprof. In event-based profiling, either software-based events or Performance Monitor (PM) events drive the interrupt. Note two new options in the -E and -f flags. The -E flag enables event-based profiling. The -E flag parameter is one of the four software-based events (EMULATION, ALIGNMENT, ISLBMISS, DSLBMISS) or a PM event (PM_*). By default, the profiling event is processor cycles PM_CYC. All PM events are prefixed with PM_, such as PM_CYC for processor cycles or PM_INST_CMPL for instructions completed. The command pmlist lists all Performance Monitor events that are supported on a processor.

The -f flag specifies the sampling frequency for event-based profiling. For software-based events and processor cycles PM_CYC, supported frequencies range from 1 to 500 milliseconds, with a default of 10 milliseconds. For all other PM events, the range is from 10,000 to MAXINT occurrences of the event, with a default of 10,000 events. As an example, Listing 17 gives the listing from an event-based profiling of the command sleep in real-time mode:


Listing 17. Event-based profiling output

Configuration information
=========================
System: AIX 5.3 Node: stram Machine: 005D13DA4C00
Tprof command was:
    tprof -u -R -E PM_CYC -f 10 -r rootstring -x sleep 50
Trace command was:
    trace -a -J tprof -o rootstring.trc
Total Samples = 195
Total Elapsed Time = 1.96s
Performance Monitor based report
    Processor name: power5
    Monitored event: Processor cycles (PM_CYC)
    Sampling frequency:    10 ms
PURR was used to calculate percentages

The postprocessing of the report file named toto is as follows:


Listing 18. Postprocessed output

Configuration information
=========================
System: 5.3 Node: monvelo Machine: 0054BDAA4C00
Tprof command was:
    ./tprof -r toto
Tprof command used to produce input files was:
    ./tprof -c -A all -C all -r toto -x ls
Trace command was:
    trace -a -L 1000000 -T 500000 -j 000,001,002,003,38F,005,006,134,139,465,00A,234 -o toto.trc -Call
Total Samples = 368
Total Elapsed Time = 1.84s

Listing 19 shows the time-based profiling of the command pwd:


Listing 19. Time-based profiling

# tprof -x pwd

Starting Command pwd
/usr/pmapi/samples
stopping trace collection.
CPU: 4294967295 PID: 160112 TID 463293
Trace Started on Tue Nov 22 11:35:19 2005

CPU: 4294967295 TID 466957      TIME: 8255614 Nanoseconds
Trace Stoped on Tue Nov 22 11:35:19 2005

Global Hook Counts
Global Hook Counts
-----------------------------
           TrcOn: 1
          TrcOff: 1
          TrcHdr: 5
         TrcUtil: 649
        trc_exec: 3
        trc_fork: 2
       kern_prof: 2
         trc_ldr: 9
672 hooks processed ( incl. 672 utility hooks )
0.008 seconds in measured interval
process pid 282 name wait found in syms file
Generating pwd.prof

The contents of the file pwd.prof are:


Listing 20. Profile output file

Configuration information
=========================
System: AIX 5.3 Node: java1 Machine: 00C4ED1A4C00
Tprof command was:
    tprof -u -v -z -x pwd
Trace command was:
    /usr/bin/trace -ad -L 1000000 -T 500000 -j 000,001,002,003,38F,005,006,134,139,5A2,465,00A,234 -o -
Total Samples = 1
Total Elapsed Time = 0.01s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Process                                FREQ  Total Kernel   User Shared  Other
=======                                ====  ===== ======   ==== ======  =====
wait                                      1      1      1      0      0      0
=======                                ====  ===== ======   ==== ======  =====
Total                                     1      1      1      0      0      0

Process                   PID      TID  Total Kernel   User Shared  Other
=======                   ===      ===  ===== ======   ==== ======  =====
wait                      282      289      1      1      0      0      0
=======                   ===      ===  ===== ======   ==== ======  =====
Total                                       1      1      0      0      0

Coming up next

The next article in this series shows how to use the data produced by the tools this article introduced, and describes a CPI breakdown model allowing more detailed breakdowns of the sources of delays or inefficiencies in execution.


Resources

Learn

Get products and technologies

Discuss

About the author

Duc Vianney, Alex Mericas, Bill Maron, Thomas Chen, Steve Kunkel, and Bret Olszewski of the IBM Systems & Technology Group Systems Performance team contributed to this article.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration, Linux
ArticleID=107353
ArticleTitle=CPI analysis on POWER5, Part 1: Tools for measuring performance
publish-date=04042006
author1-email=dwpower@us.ibm.com
author1-email-cc=

IBM SmartCloud trial. No charge.

IBM PureSystems on a kaleideoscope background

Unleash the power of hybrid cloud computing today!


Special offers