Cycles per instruction (CPI) is the measurement for analyzing the performance of a workload. CPI is simply defined as the number of processor clocked cycles needed to complete an instruction. It is calculated as CPI = Total Cycles / Number of Instructions Completed. A high CPI value usually implies underutilization of machine resources.
Profiling is a common approach to collect timing and resource utilization for a workload. Profiling allows you to identify hotspots in code and data, find performance-sensitive areas, and identify problem instructions, data areas, or both. The POWER5™ microprocessor, like many of the PowerPC®s that immediately preceded it, provides on-chip performance monitor units (PMUs) to record performance events through six performance monitor counters (PMCs), two of which are dedicated to count PowerPC instructions completed and total cycles for any given event. As a result, with an appropriate set of performance monitor application programming interfaces (PMAPIs) designed to provide access to those PMCs, you can profile many performance-sensitive events related to the core or the memory susbsystem, as well as the effects of simultaneous multithreading (SMT) and symmetric multiprocessing (SMP) on your workload. The data is quite extensive and informative since the POWER5 supports approximately 900 total events, 500 unique events, and 230 events per counter.
On a POWER5 system, you can break down your workload CPI into individual components, as the POWER5 has several programmable counters available to count events that can calculate the components of CPI and allow you to determine how to improve performance on a given workload.
This series introduces a CPI breakdown model to help you analyze where your workload spends its processing cycles while progressing through the core resources and the penalty incurred upon encountering these inhibitors. We give an overview of the POWER5 architecture, discuss the POWER5 performance monitoring facilities and performance events and profilers that are available on Linux® and AIX®, and show how to construct the CPI stack for your workload using the pmcount command. Finally, you'll see an example showing how to use the CPI breakdown model to identify and improve a performance problem.
This article reviews the performance tools available on POWER5 systems. After you've collected the profiling data, you can analyze the performance of your workload using the CPI (cycles per instruction) breakdown approach, which is the topic of the next article in this series.
The POWER5 processor chip contains two microprocessor cores, chip and system pervasive functions, core interface logic, a 1.9MB level-2 (L2) cache and controls, a 36MB level-3 (L3) cache directory and controls, and the fabric controller that controls the flow of information and control data between the L2 and L3 and between chips. The L3 cache is a victim cache of the L2 cache. Each core contains a 64KB level-1 (L1) instruction cache (Icache), a 32KB level-1 data cache (Dcache), two fixed-point execution units (FXU - FX Exec Unit), two floating-point execution units (FPU - FP Exec Unit), two load/store execution units (LDU - LD Exec Unit), one branch execution unit (BRU - BR Exec Unit), and one execution unit to perform logical operations on the condition (CRU - CR Exec Unit). Instructions dispatched in program order in groups are issued out of program order to the execution units, with a bias towards oldest operations first. A group contains up to five internal instructions (IOPs), and the last instruction is always the branch instruction (or a no-op if there is no branch instruction). Only one group of instructions can be dispatched in a cycle, and all instructions in a group are dispatched together.
The 64-bit POWER5 core is a speculative, out-of-order execution core coupled with a multilevel storage hierarchy. Figure 1 shows an overview of the POWER5 pipeline organization. The pipeline structure comprises a master pipeline and several execution unit pipelines, all of which can progress independently from each other. As discussed in the paper "Performance Workloads Characterization on POWER5 with Simultaneous Multi Threading Support" (see Resources), the master pipeline presents speculative and in-order instructions to the mapping, sequencing, and dispatch function, and ensures an orderly completion of the real execution path. It throws away any potential speculative results associated with mispredicted branches. The execution unit pipelines allow out-of-order issuing of speculative and non-speculative instructions.
In general, up to eight instructions are fetched from the instruction cache (I cache) and fed into the pipeline to go through various pipeline stages until completion. In each cycle, up to five instructions are pulled from the instruction fetch buffer (Inst Queue) and sent through the three-stage instruction decode pipeline to form a dispatch group. Complex instructions are cracked into internal operations (IOPs) to allow for simpler inner core dataflow.
From dispatch to completion, instructions are tracked in groups. Each group has an entry in the Global Completion Table (GCT). A group can contain up to five IOPs. During dispatch, various machine resources are assigned to instructions. These resources are:
- GCT: one entry per group
- FXU Issue queue: one entry per Fixed point or load/store IOP
- FPU Issue queue: one entry per Floating Point IOP
- BRU Issue queue: one entry per branch IOP
- CRU Issue queue: one entry per CR IOP
- mappers: one entry per new destination (either GPR, FPR, XER, CR, Link/CNT)
- LRQ: one entry per load IOP
- SRQ: one entry per store IOP
Figure 1. POWER5 Pipeline Structure
LRQ is the Load Reorder Queue, a 32-entry queue which holds real addresses and tracks the order of loads. SRQ is the Store Reorder Queue, a 32-entry queue which tracks all stores active in the LSU. In Figure 1, IFAR represents the Instruction Fetch Address Register.
The following machine resources are released at various stages:
- GCT: when the group completes, or in other words, when all instructions in the oldest group finish their execution
- Issue queue: after the instructions have been successfully issued
- mappers: when the group completes
- LRQ: when the group completes
- SRQ: when the store is sent to the storage subsystem
The branch prediction logic BR Scan scans all the fetched instructions looking for up to two branches per cycle. Depending upon the branch type found, various branch prediction mechanisms in Branch BR engage to help predict the target address of the branch or the branch direction or both.
The POWER5 microprocessor, like its predecessors in the PowerPC line, provides Performance Monitor Unit (PMU) counters and a number of PMCs to monitor and record several performance events. The PowerPC 604 had two PMCs while the 604e had four. The POWER3™, POWER4™, and PowerPC 970 had eight PMCs. The POWER5 has six PMCs per thread, for a total of 12 per CPU to support the implementation of SMT. The number of supporting events has also increased over the years. The PowerPC 604e had 128 total events, 100 unique events, and 32 events per counter. The POWER4 had 900 total events, 300 unique events, and 115 events per counter. Meanwhile, the POWER5 has 900 total events, 500 unique events, and 230 events per counter (see Resources for more information).
Monitoring an event on a POWER5 processor is a challenging task because of the speculative, out-of-order execution of the core and the instruction groupings. At any clock cycle, you have to handle a typical situation of five instructions/group, 20 groups past dispatch, 32 outstanding loads, 16 outstanding misses, two independent threads, and the decoupled nest/core. As an example, analysis of stall conditions on the POWER5 processor might be complicated because the processor speculatively executes instructions out of order and completes instructions in groups. With up to five instructions per group, any one of the instructions could block completion, and multiple conditions could block any instructions (a translation miss followed by a cache miss, for example). Even if stall conditions could be measured on a per instruction (or even per group) basis, that information by itself is not sufficient to break down the completion stall component because speculation might cause the instructions to be discarded before they complete. Until a stall condition has resolved, it might not be possible to know its fundamental cause. To measure these speculative events, the POWER5 PMU has speculative counters. Two of the programmable counters have backup registers that are not accessible to software. When software writes an initial value to the counter, its backup register is also written. A counter can be configured for a particular stall condition and begins counting speculatively on any cycle when no group is completing. The first group that completes will report the last condition that held its completion. If the condition matches what the counter is configured for, the count value is committed by updating the backup register. If it does not match, the counter is rewound to its previously check-pointed value.
Profiling tools collect profiling data. Several profilers run on a POWER5 system. You can use OProfile, the system-wide statistical profiler for Linux if you are running Linux on POWER™ (LoP). If you are running AIX then you have many other performance tools developed to leverage the PMAPIs, such as pmlist, tcount, tprof, and the two hardware performance monitor (hpm) tools, hpmstat and hpmcount. Lastly, you can always use the tool, pmcount, which runs on both LoP and AIX.
In Linux, the most popular profiling tool is OProfile (see Resources), a system-wide profiler for a Linux system running on the Intel®, Power PC, AMD, sparc64, and PA-RISC platforms. Support for POWER4, PowerPC 970, and POWER5 counters has been added to OProfile in release 0.8.2 by the LTC Toolchain team. RPMs with LoP support are included in SLES 9 SP1 and RHEL 4 U1. John Engel gives a description of OProfile and how to use it to identify performance bottlenecks for LoP in an article published in IBM developerWorks (see Resources). Here is the simple shell script for running OProfile.
Listing 1. Invoking OProfile
#!/bin/ksh # Oprofile run script opcontrol -init # Loads the oprofile driver and oprofilefs opcontrol --vmlinux=/boot/vmlinux-2.6.5-7.97-pseries64 # Specifies the vmlinux kernel image opcontrol -reset # Clears current session data opcontrol --event=CYCLES:1000 # Takes a sample for every 1000th hardware event opcontrol -verbose -start # Starts profiling # Runs the application you want to profile here opcontrol --dump # Flushes the collected profiling data to to the oprofile daemon opcontrol -stop # Stops the profiling data collection opcontrol -shutdown # Stops data collection and removes the daemon |
Running the program opreport generates a profiling report with a list of all symbols. The following command produces a report directed to the file named ammp.sym_out:
opreport -symbols > ammp.sym_out |
The contents of ammp.sym_out can be displayed as follows:
Listing 2. The profiling report
more ammp.sym_out CPU: ppc64 POWER5, speed 1656 MHz (estimated) Counted CYCLES events (Processor cycles) with a unit mask of 0x00 (No unit mask) count 1200000 samples % image name app name symbol name 90896 63.5556 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 mm_fv_update_nonbon 10219 7.1453 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 f_nonbon 7531 5.2658 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 atom 7432 5.1965 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 __vsqrt_GP 6614 4.6246 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 a_m_serial 5582 3.9030 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 tpac 4213 2.9458 libm.so.6 ammp_base.SLES9-RC5 __sinl 1905 1.3320 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 f_torsion 1636 1.1439 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 __vrec_GP 1342 0.9383 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 f_box 1219 0.8523 libm.so.6 ammp_base.SLES9-RC5 __ieee754_acos 1024 0.7160 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 f_bond 793 0.5545 ammp_base.SLES9-RC5 ammp_base.SLES9-RC5 fv_update_nonbon 652 0.4559 libm.so.6 ammp_base.SLES9-RC5 __acosl 278 0.1944 libc.so.6 ammp_base.SLES9-RC5 getc . . . |
The following command generates a profiling report with full path names:
opreport --long-filenames > ammp.filenames_out |
The contents of ammp.filenames_out can be displayed similarly:
Listing 3. Profiling report, with path names
more ammp.filenames_out
CPU: ppc64 POWER5, speed 1656 MHz (estimated)
Counted CYCLES events (Processor cycles) with a unit mask of 0x00 (No unit mask)
count 1200000
CYCLES:1200000|
samples| %|
------------------
142609 99.7140 /benchmarks/speccpu/speccpu/benchspec/CFP2000/188.ammp/run/000
00014/ammp_base.SLES9-RC5
CYCLES:1200000|
samples| %|
------------------
135282 94.8622 /benchmarks/speccpu/speccpu/benchspec/CFP2000/188.ammp
/run/00000014/ammp_base.SLES9-RC5
6339 4.4450 /lib/tls/libm.so.6
627 0.4397 /lib/tls/libc.so.6
330 0.2314 /boot/vmlinux
9 0.0063 /e1000
8 0.0056 /ohci_hcd
5 0.0035 /ehci_hcd
4 0.0028 /usbcore
3 0.0021 /scsi_mod
2 0.0014 /ipr
129 0.0902 /usr/bin/oprofiled
CYCLES:1200000|
samples| %|
------------------
105 81.3953 /boot/vmlinux
22 17.0543 /usr/bin/oprofiled
2 1.5504 /lib/tls/libc.so.6
80 0.0559 /bin/bash
CYCLES:1200000|
samples| %|
------------------
33 41.2500 /lib/tls/libc.so.6
24 30.0000 /boot/vmlinux
19 23.7500 /bin/bash
4 5.0000 /lib/ld-2.3.3.so
42 0.0294 /boot/vmlinux
29 0.0203 /sbin/iprinit
.
.
.
|
AIX 5.2 introduced the libpmapi library, which contains APIs that are designed to provide access to some of the counting facilities of the PM feature included in selected IBM microprocessors. Those APIs include the following:
- A set of system-level APIs to allow counting of the activity of a whole machine or of a set of processes with a common ancestor.
- A set of first party kernel-thread-level APIs to allow threads running in 1:1 mode to count their own activity.
- A set of third party kernel-thread-level APIs to allow a debug program to count the activity of target threads running in 1:1 mode.
With the support of POWER5 in AIX 5.3, many profiling tools that use PMAPIs have been enhanced including pmlist, tprof, hpmcount and hpmstat, and the sample program tcount.
tcount is a sample program that was shipped with AIX 5.2 and located in /usr/pmapi/samples/tcount. The program accepts event numbers, counting flags, and a workload as inputs, and then uses the PMAPIs to produce a listing of the PMCs for each CPU. The following is a description of tcount:
Listing 4. Usage message for tcount
# /usr/pmapi/samples/tcount -H
usage: /usr/pmapi/samples/tcount -H
/usr/pmapi/samples/tcount [-h] [-k] [-u] [-p] [-m|-M] [-t value]
[-f filter] [-e event{,event}] [-T] workload
where: -H this help screen
-h count in hypervisor mode
-k count in kernel mode
-u count in user mode
-p count in process tree mode (default is global)
-r count in runlatch mode
-t value threshold value
-m use upper threshold multiplier, if available
-M use third threshold multiplier, if available
-f v,u,c event filter (default is v,u,c)
-e event,event comma-separated list of events to count
-g group_id event group ID
-T retrieve timestamps
workload workload to execute
Default counting modes are user,kernel,global.
Valid filters are: verified, unverified, caveat.
These represent the testing status of an event.
|
Following is an example of tcount usage:
Listing 5. Example usage of tcount
# /usr/pmapi/samples/tcount -pur -g 131 mytest 100000 called with 2 arguments, first one is 100000 loop count is 100000, array pad is 0 loop count is 100000 *** Configuration : Mode = user; Process tree = off; Thresholding = off Event Group is specified. Group 131: pm_Dmiss Counter 1, event 16: PM_DATA_FROM_L3 Counter 2, event 18: PM_DATA_FROM_LMEM Counter 3, event 100: PM_LD_MISS_L1 Counter 4, event 171: PM_ST_MISS_L1 Counter 5, event 0: PM_INST_CMPL Counter 6, event 0: PM_RUN_CYC *** Results : CPU PMC 1 PMC 2 PMC 3 PMC 4 PMC 5 PMC 6 ==== ============ ============ ============ ============ ============ ============ [ 0] 0 1670 10247 200569 2645054 3089454 [ 1] 0 0 0 0 0 0 [ 2] 0 0 0 0 8 0 [ 3] 0 0 0 0 8 0 ==== ============ ============ ============ ============ ============ ============ ALL 0 1670 10247 200569 2645070 3089454 |
The pmlist command lists information about supported processors, including:
- The supported processors
- The information summary for a specified processor
- The event table for a specified processor
- Any existing event groups for a specified processor
- Any existing event sets for a specified processor
- The event set and formula for a specified derived metric
Listing 6. Usage information for pmlist
pmlist
utility to dump and search processors event and group tables
currently supports text and spreadsheet output formats
usage:
pmlist -h
pmlist -l [ -o t | c ]
pmlist -s | -e >short|select< | -c counter[,event] | -g group | -S set | -D DerivedMetric -p procname] [-s] [-d]
[-o t|c] [-f filter]
where:
-h this help screen
-l lists all supported processor types
-s displays processor information summary
-e short|select lists all events with this short name or select event value
-c -1 lists all events for all counters
-c counter lists all events for the specified counter
-c counter,event lists the specified event for the specified counter
-D -1 lists all the derived metrics
-D DerivedMetric lists detailed information for the specified derived metric
-g -1 lists all the event groups
-g group lists the specified event group
-S -1 lists all the event sets
-S set lists the specified event set
-p procname specifies the processor for which information will be listed
-d displays event detailed description
-o format specifies the output format:
t is for text (default)
c is for comma separated values
-f v,u,c specifies the event filters (default is v,u,c).
these represent the testing status of an event:
v is for verified
u is for unverified
c is for caveat
|
To display the list of all supported processors, type:
# pmlist -l
The output looks like:
Listing 7. A list of supported processors
pmlist -l Processors supported (specify with -p) ==================== PowerPC604 PowerPC604e RS64-II POWER3 RS64-III POWER3-II POWER4 MPC7450 POWER4-II POWER5 PowerPC970 |
As previously discussed, the PMUs and PMCs for the processors supported by pmlist are processor dependent, and as a result, you should understand the correct meaning of the counter.
To display detailed information for event 4 of counter 1 of POWER5 processor, type:
# pmlist -p POWER5 -c 1,4 -d
The output looks like:
Listing 8. Detailed event information
Event # Status group Threshold share Short Name Long Name Description === Counter 1 #4,v,g,n,n,PM_3INST_CLB_CYC,Cycles 3 instructions in CLB |
The cache line buffer (CLB) is an eight-deep, four-wide instruction buffer. Fullness is indicated in the eight valid bits associated with each of the four-wide slots, with full(0) corresponding to the number of cycles when eight instructions are in the queue, and full(7) corresponding to the number of cycles when one instruction is in the queue. This signal gives a real time history of the number of instruction quads valid in the instruction queue.
The same pmlist command specifying POWER4 processor would produce a different output.
Listing 9. Detailed event information for POWER4
# pmlist -p POWER4 -c 1,4 -d Event # Status group Threshold share Short Name Long Name Description === Counter 1 #4,v,g,n,n,PM_DC_PREF_STREAM_ALLOC,D cache new prefetch stream allocated A new Prefetch Stream was allocated |
The command pmlist for PowerPC970 yet displays another different output.
Listing 10. Detailed event information for PPC970
# pmlist -p PowerPC970 -c 1,4 -d Event # Status group Threshold share Short Name Long Name Description === Counter 1 #4,v,g,n,n,PM_DATA_TABLEWALK_CYC,Cycles doing data tablewalks This signal is asserted every cycle when a tablewalk is active. While a tablewalk is active any request attempting to access the TLB will be rejected and retried. |
To list information on an event on a POWER5, type:
Listing 11. Detailed event information for POWER5
# pmlist -p POWER5 -e PM_INST_CMPL POWER5: information about PM_INST_CMPL event Event#,Status,Grouped,Threshold,Shared,SelectEvent,ShortName,LongName === Pmc 1 174,v,g,n,n,00009,PM_INST_CMPL,Instructions completed === Pmc 2 174,v,g,n,n,00009,PM_INST_CMPL,Instructions completed === Pmc 3 === Pmc 4 === Pmc 5 0,v,g,n,n,00009,PM_INST_CMPL,Instructions completed === Pmc 6 |
The same event on a POWER4 is described as follows:
Listing 12. The same event on POWER4
# pmlist -p POWER4 -e PM_INST_CMPL POWER4: information about PM_INST_CMPL event Event#,Status,Grouped,Threshold,Shared,SelectEvent,ShortName,LongName === Pmc 1 86,c,g,n,n,8001,PM_INST_CMPL,Instructions completed === Pmc 2 === Pmc 3 === Pmc 4 77,c,g,n,n,8001,PM_INST_CMPL,Instructions completed === Pmc 5 === Pmc 6 86,c,g,n,n,8001,PM_INST_CMPL,Instructions completed === Pmc 7 78,c,g,n,n,8001,PM_INST_CMPL,Instructions completed === Pmc 8 81,c,g,n,n,8001,PM_INST_CMPL,Instructions completed |
The derived metrics used on a POWER5 system can be displayed using "pmlist -p" as follows:
Listing 13. Derived metrics
# pmlist -p POWER5 -D -1
Derived metrics supported:
PMD_UTI_RATE Utilization rate
PMD_MIPS MIPS
PMD_INST_PER_CYC Instructions per cycle
PMD_HW_FP_PER_CYC HW floating point instructions per Cycle
PMD_HW_FP_PER_UTIME HW floating point instructions / user time
PMD_HW_FP_RATE HW floating point rate
PMD_FX Total Fixed point operations
PMD_FX_PER_CYC Fixed point operations per Cycle
PMD_FP_LD_ST Floating point load and store operations
PMD_INST_PER_FP_LD_ST Instructions per floating point load/store
PMD_PRC_INST_DISP_CMPL % Instructions dispatched that completed
PMD_DATA_L2 Total L2 data cache accesses
PMD_PRC_L2_ACCESS % accesses from L2 per cycle
PMD_L2_TRAF L2 traffic
PMD_L2_BDW L2 bandwidth per processor
PMD_L2_LD_EST_LAT_AVG Estimated latency from loads from L2 (Average)
PMD_UTI_RATE_RC Utilization rate (versus run cycles)
PMD_INST_PER_CYC_RC Instructions per run cycle
PMD_LD_ST Total load and store operations
PMD_INST_PER_LD_ST Instructions per load/store
PMD_LD_PER_LD_MISS Number of loads per load miss
PMD_LD_PER_DTLB Number of loads per DTLB miss
PMD_ST_PER_ST_MISS Number of stores per store miss
PMD_LD_PER_TLB Number of loads per TLB miss
PMD_LD_ST_PER_TLB Number of load/store per TLB miss
PMD_TLB_EST_LAT Estimated latency from TLB miss
PMD_MEM_LD_TRAF Memory load traffic
PMD_MEM_BDW Memory bandwidth per processor
PMD_MEM_LD_EST_LAT Estimated latency from loads from memory
PMD_LD_LMEM_PER_LD_RMEM Number of loads from local memory per loads from remote memory
PMD_PRC_MEM_LD_RC % loads from memory per run cycle
|
The hpmstat command is a utility used to continuously monitor a set of events counting system-wide activity. It reports results (raw hardware counts and derived metrics) at a given interval.
Listing 14. The hpmstat utility
# hpmstat -h
usage:
hpmstat [-H] [-k] [-o file] [-r] [-s set] [-T] [-U] [-u] interval count
hpmstat [-h]
where:
interval counting time interval (default is 1 and in seconds)
count number of iterations to count
-H adds hypervisor activity on behalf of the process
-h displays this help message
-k count system activity only (default is to count system,
user and hypervisor activity)
-o file output file name
-r enable runlatch, disable counts while executing in
idle cycle
-s set pre-defined set of events (1 to 9) - see command pmlist
-T write time stamps instead of time in seconds
-U the counting time interval is microseconds
-u count user activity only
|
The hpmcount command provides the execution wall clock time, raw hardware performance counts, derived hardware metrics, and resource utilization statistics (obtained from the getrusage() system call) for the application named by command.
Listing 15. The hpmcount utility
# hpmcount -h
usage:
hpmcount [-a] [-H] [-k] [-o file] [-s set] command
hpmcount [-h]
where:
command program to be executed
-a aggregate counters on POE runs
-H adds hypervisor activity on behalf of the process
-h displays this help message
-k adds system activity on behalf of the process
-o file output file name
-s set pre-defined set of events (1 to 9) - see command pmlist
|
Hardware events monitored by the hpm tools include:
- Cycles
- Instructions
- Floating point instructions
- Integer instructions
- Load/stores
- Cache misses
- TLB misses
- Branch taken/not taken
- Branch mispredictions
The hpm tools provide derived metrics, including:
- IPC - instructions per cycle
- Floating point rate (Mflip/s)
- FP computation intensity (flip per FP load/store)
- Instructions per load/store
- Load/stores per cache miss
- Cache hit rate
- Loads per load miss
- Stores per store miss
- Loads per TLB miss
- Branches mispredicted %
As an example, hpmcount produces the following data for the command pwd as follows:
Listing 16. Sample hpmcount output
# hpmcount pwd /usr/pmapi/tools Execution time (wall clock time): 0.001848 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 0.000276 seconds Total amount of time in system mode : 0.001063 seconds Maximum resident set size : 148 Kbytes Average shared memory use in text segment : 0 Kbytes*sec Average unshared memory use in data segment : 0 Kbytes*sec Number of page faults without I/O activity : 45 Number of page faults with I/O activity : 0 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 0 Number of involuntary context switches : 0 ####### End of Resource Statistics ######## PM_FPU_1FLOP (FPU executed one flop instruction ) : 0 PM_CYC (Processor cycles) : 85563 PM_MRK_FPU_FIN (Marked instruction FPU processing finished) : 0 PM_FPU_FIN (FPU produced a result) : 0 PM_INST_CMPL (Instructions completed) : 17597 PM_RUN_CYC (Run cycles) : 85563 Utilization rate : 2.799 % MIPS : 9.522 Instructions per cycle : 0.206 HW Float point instructions per Cycle : 0.000 HW floating point / user time : 0.000 M HWflop/sec HW floating point rate (HW Flops / WCT) : 0.000 M HWflops/sec |
The tprof command reports CPU usage for individual programs and the system. The profiling report contains the following sections and subsections:
-
Summary report section
- CPU usage summary by process name
- CPU usage summary by threads (tid)
-
Global (pertains to the execution of all processes on system) profile section
- CPU usage of user mode routines
- CPU usage of kernel routines
- CPU usage summary for kernel extensions
- CPU usage of each kernel extension's subroutines
- CPU usage summary for shared libraries
- CPU usage of each shared library's subroutines
- CPU usage of each JAVA class
- CPU usage of each JAVA methods of each JAVA class
-
Process and thread level profile sections (one section for each process or thread)
- CPU usage of user mode routines for this process/thread
- CPU usage of kernel routines for this process/thread
- CPU usage summary for kernel extensions for this process/thread
- CPU usage of each kernel extension's subroutines for this process/thread
- CPU usage summary for shared libraries for this process/thread
- CPU usage of each shared library's subroutines for this process/thread
- CPU usage of each JAVA class for this process/thread
- CPU usage of JAVA methods of each JAVA class for this process/thread
tprof supports two profiling modes: time-based (default) or event-based. In time-based profiling, the decrementer interrupt drives tprof. In event-based profiling, either software-based events or Performance Monitor (PM) events drive the interrupt. Note two new options in the -E and -f flags. The -E flag enables event-based profiling. The -E flag parameter is one of the four software-based events (EMULATION, ALIGNMENT, ISLBMISS, DSLBMISS) or a PM event (PM_*). By default, the profiling event is processor cycles PM_CYC. All PM events are prefixed with PM_, such as PM_CYC for processor cycles or PM_INST_CMPL for instructions completed. The command pmlist lists all Performance Monitor events that are supported on a processor.
The -f flag specifies the sampling frequency for event-based profiling. For software-based events and processor cycles PM_CYC, supported frequencies range from 1 to 500 milliseconds, with a default of 10 milliseconds. For all other PM events, the range is from 10,000 to MAXINT occurrences of the event, with a default of 10,000 events. As an example, Listing 17 gives the listing from an event-based profiling of the command sleep in real-time mode:
Listing 17. Event-based profiling output
Configuration information
=========================
System: AIX 5.3 Node: stram Machine: 005D13DA4C00
Tprof command was:
tprof -u -R -E PM_CYC -f 10 -r rootstring -x sleep 50
Trace command was:
trace -a -J tprof -o rootstring.trc
Total Samples = 195
Total Elapsed Time = 1.96s
Performance Monitor based report
Processor name: power5
Monitored event: Processor cycles (PM_CYC)
Sampling frequency: 10 ms
PURR was used to calculate percentages
|
The postprocessing of the report file named toto is as follows:
Listing 18. Postprocessed output
Configuration information
=========================
System: 5.3 Node: monvelo Machine: 0054BDAA4C00
Tprof command was:
./tprof -r toto
Tprof command used to produce input files was:
./tprof -c -A all -C all -r toto -x ls
Trace command was:
trace -a -L 1000000 -T 500000 -j 000,001,002,003,38F,005,006,134,139,465,00A,234 -o toto.trc -Call
Total Samples = 368
Total Elapsed Time = 1.84s
|
Listing 19 shows the time-based profiling of the command pwd:
Listing 19. Time-based profiling
# tprof -x pwd
Starting Command pwd
/usr/pmapi/samples
stopping trace collection.
CPU: 4294967295 PID: 160112 TID 463293
Trace Started on Tue Nov 22 11:35:19 2005
CPU: 4294967295 TID 466957 TIME: 8255614 Nanoseconds
Trace Stoped on Tue Nov 22 11:35:19 2005
Global Hook Counts
Global Hook Counts
-----------------------------
TrcOn: 1
TrcOff: 1
TrcHdr: 5
TrcUtil: 649
trc_exec: 3
trc_fork: 2
kern_prof: 2
trc_ldr: 9
672 hooks processed ( incl. 672 utility hooks )
0.008 seconds in measured interval
process pid 282 name wait found in syms file
Generating pwd.prof
|
The contents of the file pwd.prof are:
Listing 20. Profile output file
Configuration information
=========================
System: AIX 5.3 Node: java1 Machine: 00C4ED1A4C00
Tprof command was:
tprof -u -v -z -x pwd
Trace command was:
/usr/bin/trace -ad -L 1000000 -T 500000 -j 000,001,002,003,38F,005,006,134,139,5A2,465,00A,234 -o -
Total Samples = 1
Total Elapsed Time = 0.01s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Process FREQ Total Kernel User Shared Other
======= ==== ===== ====== ==== ====== =====
wait 1 1 1 0 0 0
======= ==== ===== ====== ==== ====== =====
Total 1 1 1 0 0 0
Process PID TID Total Kernel User Shared Other
======= === === ===== ====== ==== ====== =====
wait 282 289 1 1 0 0 0
======= === === ===== ====== ==== ====== =====
Total 1 1 0 0 0
|
The next article in this series shows how to use the data produced by the tools this article introduced, and describes a CPI breakdown model allowing more detailed breakdowns of the sources of delays or inefficiencies in execution.
Learn
-
Performance Workloads Characterization on
POWER5 with Simultaneous Multi Threading Support was presented at the Eighth
Workshop on Computer Architecture Evaluation using Commercial Workloads in February
2005.
-
A detailed
description of the POWER5 design is available from the IBM Journal of Research
and Development's POWER5 System Microarchitecture.
-
The book The
PowerPC Compiler Writer's Guide (IBM, 1996) describes, mainly by coding
examples, the code patterns that perform well on Power Architecture processors.
-
Alex Mericas', Performance
Monitor PowerPC Perspective was originally presented by Alex Mericas, in
February 2005.
-
Frank Levine's A
Programmer's View of Performance Monitoring in the PowerPC Microprocessor is a
detailed discussion on performance monitor support on the PowerPC 604 and 604e.
-
Performance
Monitoring on the PowerPC 604 Microprocessor by Charles Roth, Frank Levine, and
Ed Welbon, provides an in depth discussion on performance monitoring on the PowerPC
604.
-
Sam Siewert's Big iron
lessons, Part 3: Performance monitoring and tuning contains a general discussion
on performance tuning considerations for the system architect (developerWorks 2005).
-
For a typical workload CPI analysis, see the article "Performance Workloads
Characterization on POWER5 with Simultaneous Multi Threading Support." (PDF)
-
You will find Oprofile at SourceForge,
and a
guide to using OProfile at developerWorks.
-
Find benchmarks to fulfil your every need (or at least many of them) at spec.org.
-
The book Link for
Performance Tuning for Linux Servers doesn't just cover kernel tuning: it shows
how to maximize the end-to-end performance of real-world applications and databases
running on Linux.
-
Keep abreast of all Power Architecture-related
news and publications: subscribe to the
Power Architecture community newsletter.
Get products and technologies
-
See all
Power Architecture-related
downloads on one page.
Discuss
-
Take part in the IBM developerWorks Power Architecture technology forums.



