Performance Monitor Counter data analysis using Counter Analyzer

Use Counter Analyzer to analyze PMC data of POWER and Cell Broad Engine platforms

To understand what happens inside a processor when an application is executed, processor architects designed a set of special registers to count the events taking place when processors are executing instructions. These registers, called the Performance Monitor Counter, provide interesting information about the processor, such as how many I-cache misses take place, how many instructions are completed, and more. Counter Analyzer is one plugin of the Visual Performance Analyzer, an Eclipse-based performance analysis tool. It can analyze raw events, metrics, and CPI breakdown model, and help you better understand these events. This article introduces the Performance Monitor Counter and its related tools briefly, and then shows you how to use these tools and the Counter Analyzer together, using the Caesar cipher tool as example.

Share:

Qi Liang (liangqi@cn.ibm.com), Staff Software Engineer, IBM

author photoQi Liang is a software engineer of IBM China Systems and Technology Lab. Qi worked on Visual Performance Analyzer, which is a GUI performance analysis tool. He has a rich knowledge in performance area and experience in a variety of performance tools on AIX, Linux, and Java. He has also worked on Eclipse plug-in development for more than five years, and was part of the Eclipse BIRT (Business Intelligence Reporting Tool) project before joining IBM.



03 February 2009

Also available in Chinese

Introduction to Performance Monitor Counter

Modern processors have a special hardware facility, the Performance Monitor Unit (PMU), to collect the events related with the operations in the processor. If a D-cache miss (the processor fails to find data in the D-cache) occurs, an interrupt is raised so that the corresponding register can record this event by increasing its value. The Performance Monitor Counter (PMC) register helps reveal the mystery in the chip.

Take a look at the PMCs in POWER5™. Each POWER5 processor core has six 32-bit PMC registers, PMC1 through PMC6. PMC1, PMC2, PMC3, and PMC4 are programmable, so you can specify what events to collect by setting the Monitor Mode Control Register 1 (MMCR1). PMC5 and PMC6 are not programmable, so they are always collected, no matter what event group you select. PMC5 counts PM_INST_CMPL for the completed instructions, and PMC6 counts PM_RUN_CYC for cycles. Normally each PMC is incremented by the number of times the corresponding event occurred in one cycle. So, at any time, only six events can be collected at the same time. The processor designer defines which six events can be collected at the same time as the event group. You can specify the events for the first four PMCs by selecting the event group.

The following table shows the PMCs of different POWER processors.

Table 1. Performance Monitor Counters
ProcessorPerformance Monitor CountersEventsEvent Groups
PowerPC 970823049
PowerPC 970 MP823051
POWER4824463
POWER4 II824463
POWER56474163
POWER5 II6483188
POWER66553202

How to evaluate performance using PMC

There are three approaches to evaluating performance using PMCs from a low level to a high level: the raw PMC event, metrics, and the CPI breakdown model.

Raw PMC events tells you the specific things about your code executing in the processor. For example, PM_TLB_MISS in POWER5 tells you how many TLB (Translation Look-aside Buffer) misses took place. TLB is the cache holding the mapping information from the virtual address to the physical page in memory. If there are too many TLB misses, your code read data in large address range. You may need to improve spatial locality.

It is not easy to understand how good or bad the performance is and the exact meaning of one specific event. You need some kind of high-level information. The second approach is metrics, which is the value calculated by the formula defined by events or pre-defined variables. For example, if you want to know how many MIPS (Million Instructions per Second) is, you can define one metric named PMD_MIPS. By its definition, you know MIPS = 10-6 * Instruction Count / Execution Time. PM_INST_CMPL increases by one when one instruction is completed in the pipeline, so it can be used to represent the Instruction Count. You have a pre-defined variable named total_time for the wall-clock time of the whole data collection. So, define PMD_MIPS as: PMD_MIPS = 1e-06 * PM_INST_CMPL / total_time.

The last one is CPI (Cycle per Instruction) breakdown. CPI means how many cycles an instruction takes to complete in average. This is a very important measurement for overall system performance. The lower value of CPI means one instruction takes less time to execute. The cycles can be broken down into several parts in terms of how much time is spent in different pipeline stage. Of course, it means good performance. The breakdown means to divide the cycles into different parts in terms of pipeline status.

Again, taking POWER5 as an example, the time an instruction stays in the pipeline can be divided into three parts:

  • Pipeline is dealing with instruction, including instruction decode, issue, and execution.
  • Pipeline stalls because the GCT is empty.
  • Pipeline stalls because of other reasons.

The Global completion table (GCT) is a table representing a group of instructions currently being processed by the processor. It stores the instruction, the logical program order of instructions, and the completion order of instructions in the group. The empty GCT means there is nothing to do in the pipeline.

Total time (PMD_TOTAL_CPI)Pipeline is executing an instruction (PMD_CPI_CMPL_CYC)
Pipeline stalls because of GCT empty (PMD_CPI_GCT_EMPTY)
Pipeline stalls because of other reasons (PMD_CPI_STALL_CYC)

PMD_CPI_GCT_EMPTY can be caused by:

  • I-cache miss (Processor cannot get the next instruction from instruction cache),
  • Branch mis-prediction (Processor guesses the wrong branch, so it has to flush the pipeline and load instruction from proper branch again), or
  • Other reasons.

PMD_CPI_STALL_CYC can be caused by:

  • Stall by LSU instruction,
  • Stall by FXU instruction,
  • Stall by FPU instruction, or
  • Other reasons.
Total time (PMD_TOTAL_CPI)Pipeline is executing an instruction (PMD_CPI_CMPL_CYC)
Pipeline stalls because of GCT empty (PMD_CPI_GCT_EMPTY)I-cache miss (PMD_CPI_GCT_EMPTY_IC_MISS)
Branch mis-prediction (PMD_CPI_GCT_EMPTY_BR_MPRED)
Other reasons (PMD_CPI_GCT_EMPTY_OTHER)
Pipeline stalls because of other reasons (PMD_CPI_STALL_CYC)Stall by LSU instruction (PMD_CPI_STALL_LSU)
Stall by FXU instruction (PMD_CPI_STALL_FXU)
Stall by FPU instruction (PMD_CPI_STALL_FPU)
Other reasons (PMD_CPI_STALL_OTHERS)

With these breakdowns, you can get the final tree structure named CPI breakdown model. The CPI breakdown model is completely determined by the PMC events designed for the specific processor.

PMD_TOTAL_CPI - Total cycles PMD_CPI_CMPL_CYC - Completion cycles
PMD_CPI_GCT_EMPTY - Completion Table empty (GCT empty)PMD_CPI_GCT_EMPTY_IC_MISS - I-Cache Miss Penalty
PMD_CPI_GCT_EMPTY_BR_MPRED - Branch Mispredication Penalty
PMD_CPI_GCT_EMPTY_OTHER
PMD_CPI_STALL_CYCPMD_CPI_STALL_LSU - Stall by LSU instructionPMD_CPI_STALL_LSU_REJECT - Stall by LSU RejectPMD_CPI_STALL_LSU_ERAT_MISS - Stall by LSU Translation Reject
PMD_CPI_STALL_LSU_REJECT_OTHERS
PMD_CPI_STALL_LSU_DCACHE_MISS - Stall by LSU D-cache miss
PMD_CPI_STALL_LSU_OTHERS
PMD_CPI_STALL_FXU - Stall by FXU instructionPMD_CPI_STALL_FXU_DIV - Stall by any form of DIV/MTSPR/MFSPR instruction
PMD_CPI_STALL_FXU_OTHERS
PMD_CPI_STALL_FPU - Stall by FPU instructionPMD_CPI_STALL_FPU_DIV - Stall by any form of FDIV/FSQRT instruction
PMD_CPI_STALL_FPU_OTHERS
PMD_CPI_STALL_OTHERS

Tools collecting Performance Monitor Counter data

Many profiling tools can collect PMC data. This article just discusses the AIX® tools hpmcount and hpmstat, and Cell SDK tool, cpc, or cellperfctr. All of these tools can output XML format file to feed the Counter Analyzer.

hpmcount and hpmstat - AIX

AIX V5.3 and V6.1 provide hpmcount and hpmstat to collect events from PMC. They do almost the same work, except the scope to monitor. hpmcount collects events from the workload application that hpmcount launches, while hpmstat collects events of the whole system.

Before touching hpmcount and hpmstat, have a look at pmlist, the utility tool listing PMC events, event groups, and metrics for the POWER processor.

First, run pmlist to know whether the processor you are working with is supported.

Listing 1. List all processors hpmcount/hpmstat supports
bash-3.00# pmlist -l
Processors supported (specify with -p)
====================
RS64-II
POWER3
RS64-III
POWER3-II
POWER4
POWER4-II
POWER5
PowerPC970
POWER5-II
POWER6
PowerPC970MP

You can also know the description of events and event groups given processor type with pmlist. Here is the group "0" of the POWER5 processor.

Listing 2. Event group 0 of POWER5
bash-3.00# pmlist -p POWER5 -g 0
Group #0: pm_utilization
Group name: CPI and utilization data
Group description: CPI and utilization data
Group status: Verified
Group members:
Counter  1, event 190: PM_RUN_CYC  : Run cycles
Counter  2, event 71: PM_IOPS_CMPL  : Internal operations completed
Counter  3, event 56: PM_INST_DISP  : Instructions dispatched
Counter  4, event 12: PM_CYC [shared core] : Processor cycles
Counter  5, event  0: PM_INST_CMPL  : Instructions completed
Counter  6, event  0: PM_RUN_CYC  : Run cycles

It lists the ID number, name, description, status, and six events of group "0." If you want to know all event groups, use "-g -1." See Resources for information on the pmlist command.

hpmcount

Now launch hpmcount to collect group "0" and output the result in an XML format file.

Listing 3. Collect event group 0 using hpmcount
bash-3.00# hpmcount –g 0 –o sleep –x sleep 2
  • -g 0 collects the performance events defined in group 0.
  • -o sleep specifies the output file name as sleep.
  • -x specifies XML output. The XML format output file is what Counter Analyzer needs.
  • sleep 2 is the workload application to run.

The output file has a name like "sleep_0000.319628". Rename it to sleep.pmf. For additional information on hpmcount, see the Resources section.

hello hpmstat

You can also run hpmstat to collect group "0," except that it has no workload application since hpmstat collects events in system-wide mode.

Listing 4. Collect event group 0 using hpmstat
				bash-3.00# hpmstat -g 0 -o hpmstat.pmf -x

The output file has the exact name hpmstat.pmf as you specify. This is another difference between hpmcount and hpmstat. For additional information on hpmstat, see the Resources section.

Multiplexing mode

You can specify more than one event group to collect in multiplexing mode using both hpmcount and hpmstat. For example, you can collect group 0 and group 1 using hpmstat.

Listing 5. Collect event groups 0 and 1 using hpmstat
				bash-3.00# hpmstat -g 0,1 -o hpmstat2.pmf -x

Download the output file named hpmstat2.pmf. Now see what the multiplexing mode is.

Processors only have a fixed number of PMCs. For example, POWER5 has only six PMCs, and it collects six events defined in one event group at any time. So, in multiplexing mode, hpmcount and hpmstat control the PMCs to collect events of different event groups in different time slices in turn. For example, collect two event groups, G0 (E1, E2, E3, E4, E5, and E6) and G1 (E7, E8, E9, E10, E5, and E6) for six time slices (T1, T2, T3, T4, T5, and T6). E1 through E10 are the events to collect, and both G0 and G1 have E5 and E6. In the following table, you can see how these events are collected.

T1T2T3T4T5T6...
PMC1E1E7E1E7E1E7...
PMC2E2E8E2E8E2E8...
PMC3E3E9E3E9E3E9...
PMC4E4E10E4E10E4E10...
PMC5E5E5E5E5E5E5...
PMC6E6E6E6E6E6E6...
G0G1G0G1G0G1...

E5 and E6 are collected all the time, but other events are collected by interval. So their values are incomparable since their collection time is different. In order to do some calculation with these events, you have to use a normalized event count for calculation. The normalized event count makes you handle events as if you collect them during the whole data collection. The normalized event count is defined Enormalized = ∑E * Ttotal / Tevent.

E1normalized = (E1T1 + E1T3 + E1T5) * (T1 + T2 + T3 + T4 + T5 + T6) / (T1 + T3 + T5)

E2normalized = (E2T1 + E2T3 + E2T5) * (T1 + T2 + T3 + T4 + T5 + T6) / (T1 + T3 + T5)

...

E5normalized = (E5T1 + E5T2 + E5T3 + E5T4 + E5T5 + E5T6) * (T1 + T2 + T3 + T4 + T5 + T6) / (T1 + T2 + T3 + T4 + T5 + T6)

...

CPC - Cell SDK on Cell Broad Engine

Cell SDK 3.0 also contains a tool named CPC or cellperfctr (Cell Performance Counter) to collect performance events from the Cell Broad Engine PMC.

Before you start to run CPC, you need to find a small SPU application named "simple" in Cell SDK. You can get it from the cell-tutorial-source-*.*-*.rpm. Make all tutorial applications in Cell SDK 3.0, and run "simple" application, as follows.

Listing 6. Get SPU application "simple"
bash-3.00# rpm -ql cell-tutorial-source
/opt/cell/sdk/src/tutorial_source.tar
bash-3.00# cp /opt/cell/sdk/src/tutorial_source.tar ~
bash-3.00# tar xvf tutorial_source.tar
tutorial/
tutorial/Makefile
tutorial/euler/
tutorial/euler/STEP3_multi_spe/
...
bash-3.00# cd tutorial/
bash-3.00# export CELL_TOP=/opt/cell/sdk/
bash-3.00# make
bash-3.00# cd simple/
bash-3.00# ./simple
Hello Cell (0x1820008)
Hello Cell (0x1820688)
Hello Cell (0x1820900)
Hello Cell (0x1820b98)
Hello Cell (0x1821578)
Hello Cell (0x1820e10)
Hello Cell (0x1821ce0)
Hello Cell (0x1821300)
Hello Cell (0x18221d0)
Hello Cell (0x1821088)
Hello Cell (0x18217f0)
Hello Cell (0x1821a68)
Hello Cell (0x1821f58)
Hello Cell (0x1822448)
Hello Cell (0x18226c0)
Hello Cell (0x1822938)

The program has successfully executed.
bash-3.00#

This application is really simple, creating 16 SPE threads. Each thread prints "Hello Cell" and its ID. The simplest way to run CPC is to collect system clock cycles, whose event name is System_Clock_Cycles, or C.

Listing 7. Collect system clock cycles using CPC
bash-3.00# cpc -e C -X cycles.pmf ./simple
  • -e C specifies to collect cycle events.
  • -X simple.pmf specifies the output file name as the XML file, simple.pmf.
  • ./simple specifies to run the workload application.

You can also specify more than one event. CPC supports at most four events as a set, and also supports more than one event set. In the following case, two event sets, 2100, 2101, 2102, 2103 and 2106, 2109, 2111, 2119 are collected. -i 10u means to perform hardware sampling with 10 microseconds as the sample interval. CPC has a rich set of options to use for Cell BE processor PMC data collection. You can see download CPC help for details.

Listing 8. Collect system clock cycles using CPC
bash-3.00# cpc -e 2100,2101,2102,2103 -e 2106,2109,2111,2119 -i 10u -X 2sets.pmf ./simple

Sample application: Caesar cipher tool

The article uses the Caesar cipher tool as sample program for later discussion. Caesar cipher is a very old and straightforward encryption algorithm. It shifts the letter in the plain message n character backwards to get the cipher text. For example, if the shift number is 2, "H" is substituted by "J," "z" is substituted by "b," and so on.

Substitutions
Plain TextHelloworld!
Cipher TextJgnnqyqtnf!

In this sample program, only alphabetic and numeric characters (a-z, A-Z, and 0-9) can be ciphered. Other characters like comma will remain changed. If the letter shifts beyond 'z', 'Z', or '9', the shift will start from 'a', 'A', or '0' again. You can download caesar_src.zip. The command syntax is shown in the following:

Listing 9. Command syntax of Caesar
Syntax: caesar [enc|dec] shift_number input_file output_file
enc|dec: 	  Encipher (enc) or decipher (dec)
shift_number: The number of characters to shift
input_file:   The input file. 
              The plain text for encipher, or the cipher text for decipher.
output_file:  The output file. 
              The cipher text for decipher, or the plan text for encipher.

Here is the source listing of the function, void cipher(char * buffer, int length, int shift), in caesar.c.

Listing 10. Caesar encipher/decipher
7	// char * buffer - The buffer holding the bytes read from file
8	// int length 	 - How many bytes are read into buffer
9	// int shift     - The shift number for ciphering
10	// 
11	// Return the ciphered bytes  
12	int cipher(char * buffer, int length, int shift) {
13		int bytes = 0;
14		int j = 0;
15	    for (; j < length; j ++) {
16
17	         if (buffer[j] >= 'a' && buffer[j] <= 'z') {
18	             buffer[j] = (buffer[j] - 'a' + 26 + shift) % 26 + 'a';
19	             bytes ++;
20	         }
21	         if (buffer[j] >= 'A' && buffer[j] <= 'Z') {
22	             buffer[j] = (buffer[j] - 'A' + 26 + shift) % 26 + 'A';
23	             bytes ++;
24	         }
25	         if (buffer[j] >= '0' && buffer[j] <= '9') {
26	             buffer[j] = (buffer[j] - '0' + 10 + shift) % 10 + '0';
27	             bytes ++;
28	        }
29	    }
30    
31	    return bytes;
32	}

The different code executes in terms of the character is in "a-z", "A-Z", or "0-9". Please note that the shift number could be negative, which means shifting letters backwards.

Prepare an input file hello.txt containing "Hello world!" first.

Listing 11. Prepare plain text file as input
bash-3.00# echo ‘Hello world!’ > hello.txt
bash-3.00# cat hello.txt
Hello world!

And then compile it and run it to cipher the message in hello.txt. Finally, you get the ciphered message "Jgnnq yqtnf!" in hello.enc.

Listing 12. Compile and run
bash-3.00# xlc –o caesar1 caesar1.c
bash-3.00# ./caesar1
Syntax: caesar [enc|dec] shift_number input_file output_file
bash-3.00# ./caesar1 enc 2 hello.txt hello.enc
10 bytes are processed.
1283 ticks elapsed.
bash-3.00# cat hello.enc
Jgnnq yqtnf!

It works well. Now use hpmcount to launch this cipher tool to cipher a huge input file, hundreds of MB for example, to see what the PMC data looks like and where you can make some improvement.

In order to get complete CPI breakdown data, you need to collect event groups 0, 1, 5, 28, 29, 30, 31, 40, 43, 44, 48, 49, 79, 81, and 91. If you just want the total CPI, you can choose any group, since the PMC5 and PMC6 always collect instruction completion and cycle events.

Listing 13. Collect event group 0 using hpmcount
bash-3.00# hpmcount -g 0,1,5,28,29,30,31,40,43,44,48,49,79,81,91 
-o caesar1_cpi -x ./caesar1 enc 4 huge.txt huge.enc
147040048 bytes are processed.
13627187 ticks elapsed.

The command outputs the file like caesar1_cpi_0000.311434. Rename it to caesar1_cpi.pmf for later use.


User Counter Analyzer to view PMC data

Counter Analyzer, a VPA (Visual Performance Analyzer) plug-in, helps you analyze and understand PMC data easier and better. It reads the XML format file generated by hpmcount/hpmstat and cpc. Besides supporting the three analysis methods we discussed earlier, Counter Analyzer also supports PMC data comparison, and provide metrics definition and a CPI breakdown model for a bunch of processors.

ProcessorMetricsCPI Breakdown Model
PowerPC 970YN
PowerPC 970 MPYN
POWER4YY
POWER4 IIYY
POWER5YY
POWER5 IIYY
POWER6YY

Install VPA

You can get Visual Performance Analyzer at http://www.alphaworks.ibm.com/tech/vpa. It's very easy to install VPA. Assume you have downloaded vpa-rcp-${version}-win32.zip and are going to install it to C:\. There are only three steps to take:

  • Unzip vpa-rcp-${version}-win32.zip to c:\
  • cd c:\vpa-rcp
  • Run vpa.exe

If you see the following welcome view, VPA ran successfully. Then select Tools -> Counter Analyzer to switch to the Counter Analyzer perspective.

Figure 1. VPA welcome view
VPA Welcome View

Before opening the XML output file, you need to change its suffix name to .pmf (Performance Monitoring File), which helps the Visual Performance Analyzer to recognize it as the input for Counter Analyzer. After Counter Analyzer opens the file, the raw PMC events are displayed in the Counter Analyzer editor. There are three editor pages, Details, Metrics, and CPI Breakdown, which display the raw PMC events, metrics, and CPI breakdown individually. Open caesar1_cpi.pmf in VPA.

View raw PMC events

After opening caesar1_cpi.pmf, you will see many events, event count values, and their descriptions in the tool tip in the Details page. Event name like, PM_CMPLU_STALL_DIV[u], includes the counting mode in square brackets. The [u] means that the event happens during applications executing in user mode. The possible values are k (kernel mode), h (hypervisor mode), r (runlatch mode), and n (nointerrupt mode).

The selected event in the following figure is PM_CMPLU_STALL_DIV collected in user mode. Because hpmcount doesn't support per-processor data collection, the events from different processors are accumulated into one value. The data from the uni-process system looks the same as those from the mult-processors system.

You need to notice that the event counts are normalized values as the article discussed previously if you specify more than one event set. So the event counts look like they are collected within the whole collection period.

Figure 2. View raw PMC events
View Raw PMC Events

View metrics

You go into the metrics page by clicking the Metrics tab. The left pane (named metrics) lists all metrics. They might be grouped, depending on the processor of your system. In this case, the metrics of POWER5 are grouped into "cpi_breakdown" and "performance." If there is no metric group defined, all metrics are listed. The right pane, named variable pane, lists the variables required for metrics calculation. For example, the metric PMD_MIPS is defined as 1e-06 * PM_INST_CMPL / total_time. So, its value depends on the total data collection time besides PM_INST_CMPL. From the .pmf file, Counter Analyzer knows how long the data collection is performed, as it's displayed in the variable pane. So, if metrics have other variables in the formula definition, you need to input the appropriate value in the variable pane.

Figure 3. View metrics
View Metrics

View CPI breakdown data

The last page is the CPI breakdown page, which displays CPI breakdown data as a tree, as discussed earlier. You named the tree node as a component.

  • fig6 means this component has a formula defined with events. For example, PMD_CPI_GCT_EMPTY = PM_GCT_NOSLOT_CYC / PM_INST_CMPL.
  • fig7 means there is no formula defined for this component, which is determined by its parent and siblings. PMD_CPI_GCT_EMPTY_OTHER = PMD_CPI_GCT_EMPTY - PMD_CPI_EMPTY_IC_MISS - PMD_CPI_GCT_EMPTY_BR_MPRED.
Figure 4. View CPI breaddown data
View CPI Breaddown Data

You can also export CPI breakdown data as HTML by choosing Export as HTML ... in the context menu. This is helpful if you need the report for your performance analysis work.

Compare PMC data

You can compare PMC data from two counter data collection runs. For example, you are doing performance tuning work for an application. You can collect the PMC data against the original application as baseline data. After you do some improvement in your code, you can run it again and compare it with the baseline so that you know how much and where the code is improved, and understand your changes better.

In the case of the sample program, you had baseline PMC data caesar1_cpi.pmf. Now examine the source file to see where you can make some improvement.

Listing 14. Caesar encipher/decipher
7	// char * buffer - The buffer holding the bytes read from file
8	// int length 	 - How many bytes are read into buffer
9	// int shift     - The shift number for ciphering
10	// 
11	// Return the ciphered bytes  
12	int cipher(char * buffer, int length, int shift) {
13		int bytes = 0;
14		int j = 0;
15	    for (; j < length; j ++) {
16
17	         if (buffer[j] >= 'a' && buffer[j] <= 'z') {
18	             buffer[j] = (buffer[j] - 'a' + 26 + shift) % 26 + 'a';    (1)
19	             bytes ++;
20	         }
21	         if (buffer[j] >= 'A' && buffer[j] <= 'Z') {                   (2)
22	             buffer[j] = (buffer[j] - 'A' + 26 + shift) % 26 + 'A';    (1)
23	             bytes ++;
24	         }
25	         if (buffer[j] >= '0' && buffer[j] <= '9') {                   (2)
26	             buffer[j] = (buffer[j] - '0' + 10 + shift) % 10 + '0';    (1)
27	             bytes ++;
28	        }
29	    }
30    
31	    return bytes;
32	}

After examining the PMC data and source file carefully, you can see two places to improve.

  • From raw PMC events, you found there are a lot of PM_CMPLU_STALL_DIV events, which means it takes a lot of time for the pipeline to wait for the availability of the division function unit. After reading the whole source file, you find (1) uses division heavily. Change the way to shift letter.
    Listing 15. New encipher code
                    buffer[j] += shift;
                    if (buffer[j] > 'z')
                        buffer[j] -= 26;
                    else (buffer[j] < ‘a’)
                    buffer[j] += 26;
  • The check for the variables a-z, A-Z, or 0-9 should be exclusive. You can add else before the later two ifs to avoid unnecessary character comparison.

Now run hpmcount to get PMC data again and get the output file like caesar2_cpi_0000.311458. Rename it to caesar2_cpi.pmf. Click fig8 on the tool bar. In the Select To Compare Dialog, select caesar1_cpi.pmf and click the Add button. Then, select caesar2_cpi.pmf and add it again. The first file works as a baseline during comparison. You can also select metrics or CPI breakdown model. In this case, select the predefined POWER5 metrics and CPI breakdown model.

Figure 5. Select .pmf files to compare
Select .pmf files to Compare

After you click OK, the Counter Analyzer starts to load these two files and displays comparison results as follows. Columns (A) and (B) represent the two files to compare individually. The next two columns are delta value (B - A) and the percentage value ((B - A) * 100 / A). The red value means the larger one, while the blue value means the smaller one.

Figure 6. Compare raw PMC events
Compare Raw PMC Events

From the raw PMC event comparison, you can see:

  • The total cycles decreases by 21.88%. So the overall time performance is improved.
  • Both PM_CMPLU_STALL_DIV and PM_CMPLU_STALL_FXU becomes 0. This matches well with the optimization that you remove all division.
  • Both PM_CMPLU_STALL_DCACHE_MISS and PM_CMPLU_STALL_LSU decrease by about 40-50%. Fewer pipeline stalls about D-cache miss and load/store instructions happens. The data locality is improved.

You can also view the metircs comparison and CPI breakdown data comparison. Here is the CPI breakdown in Figure 7 below. You can see the total CPI is improved by 4.41%, from 1.688 to 1.614. The PMD_CPI_STALL_FXU (the pipeline stalls related to fix point number division) become 0.

Figure 7. Compare CPI breakdown Data
Compare CPI Breakdown Data

Conclusion

Performance Monitor Counter is an important facility for performance monitoring or measurement. You can collect PMC data using tools like hpmcount, hpmstat, and CPC. Besides inspecting raw PMC events, you also get high-level measurement with metrics and the CPI breakdown model. Counter Analyzer supports all three analysis methods, raw PMC events, metrics, and CPI breakdown data, and can help you understand PMC data better. It also supports comparing PMC data from two different runs so that you can find out how much and which parts are improved and understand your optimization better.


Download

DescriptionNameSize
Source code for this articlecaesar_src.zip17KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=368037
ArticleTitle=Performance Monitor Counter data analysis using Counter Analyzer
publish-date=02032009