Performance tuning C/C++ applications with Performance Advisor in Rational Developer for Power Systems Software

Performance Advisor, introduced in IBM Rational Developer for Power Systems Software 8.5, provides a rich set of tools that enable you to get better performance from your C/C++ applications that run on IBM Power Systems. In this tutorial, Mike Kucera walks you through the main functionality of the Performance Advisor and shows you how to improve the performance of an example application.

Mike Kucera (mkucera@ca.ibm.com), Software Developer, IBM

author photoMike Kucera is a member of the Performance Advisor developer team and has had a significant impact on the design of the user interface. He is also a committer on the Eclipse CDT open source project.



26 June 2012

Also available in Chinese Russian Portuguese

Performance Advisor overview

The 8.5 release of IBM® Rational® Developer for Power Systems Software™ introduced a new component called Performance Advisor, which provides a rich set of features for performance tuning C and C++ applications on IBM® AIX® and IBM® PowerLinux™.

Performance Advisor is both easy to use and very powerful. If you are new to performance tuning, you will find that Performance Advisor is a great way to get started, because the user interface is simple and the tool provides plenty of feedback and guidance. If you are an experienced performance tuner, you will find a rich set of tools that you can use to effectively isolate and fix performance issues.

Rational Developer for Power Systems Software (often called RD Power or RDp unofficially) is already well-known for its development and debugging tools, and Performance Advisor is well-integrated with those. You can use Performance Advisor as a stand-alone tool, or you can seamlessly integrate it into your existing code, build, test, and debug cycle.

This tutorial takes you through a day in the life of a Performance Advisor user.


Understanding performance tuning

Before we begin the tutorial, let's take a look under the hood to see how Performance Advisor works.

Where the performance data comes from

Performance Advisor gathers data from several sources. The raw application performance data comes from low-level operating system tools that sample the state of the processor and memory at regular intervals. The debug information generated by the compiler allows this data to be matched back to the original source code. XLC compilers can generate XML report files that provide information on optimizations that were performed during compilation. And finally, the application's build and runtime systems are analyzed to determine whether there are any potential environmental problems.

All of this data is automatically gathered, correlated, analyzed, and presented to you in a way that is quick to access and easy to understand. This makes it much easier to determine the best strategies for optimizing your application.

Sampling the instruction pointer

The main source of performance data on AIX is the tprof command (on Linux, the equivalent tool is OProfile).

While your application is running, tprof will wake up approximately every 10 milliseconds to record a sample of the state of the processor's instruction pointer, which contains the memory address of the currently executing instruction. Each sample is called a tick. After the performance run is complete, the debug information generated by the compiler is used to map each tick to its corresponding source code line. From this correlated data, you can tell which parts of the program execute most often. These parts are called hot and are usually the best place to start looking for opportunities to optimize the program code.

Pitfalls of sampling-based performance data

There could be several processes executing on the target system at the same time as the application being profiled. It is common that many of the ticks get attributed to competing processes. For this reason, it is best to run on a relatively "quiet" system. An ideal situation would be to have a machine dedicated to performance testing.

Because the raw data is based on sampling, increasing the number of samples increases the statistical relevance of the collected data. We suggest that the application being profiled run for at least 30 seconds, preferably much longer. This can be done by choosing an input set that causes the application to run for an extended time or by using a script to run the application multiple times in a loop.

You can run tprof directly from the AIX command line, but it can be quite difficult to use this way. It is highly configurable and takes many command line options. The raw data files it generates are very large, and they can be time-consuming to analyze manually. Performance Advisor uses tprof to gather the raw performance data, but this is done transparently. Therefore, as an end user, you never have to deal with these low-level tools directly.

Sampling the call stack

Call stack sampling data is also collected from low-level system tools. This data comes from the procstack command on AIX and from the OProfile command on Linux.

The application's call stack is sampled at regular intervals, and all of the application's currently executing functions are recorded. You can then explore the runtime call paths leading to and from any function of interest by using a graphical viewer. With this information, you can answer questions such as, "Is this function hot because it takes too long to execute or because it is called too often?"

System Scorecard report

The performance of your application also depends on how it was built and on the environment where it runs. Often, a simple change to the build or runtime environment can have a large impact on an application's performance without actually having to change the application's source code.

Performance Advisor analyzes the build and runtime hosts and scores them based on several criteria; including hardware level, OS level, compiler version, and build options. A report called the System Scorecard is generated that provides recommendations for how to improve the system configuration for better application performance. This is a good place to start if you are looking for some low-hanging fruit.

Compiler transformation reports

XLC compilers can produce XML report files that describe optimizations that were performed during compilation. These reports are not strictly required, but if they are generated, more information will be available for analysis. One of the most interesting things that these reports reveal is the location of function calls that were inlined during compilation.


Tutorial: Using Performance Advisor to improve the performance of a C++ application

You can use the Performance Advisor to analyze and compare the performance of an application across several machines running AIX or PowerLinux, but to keep things simple, we will focus on performance tuning on a single AIX machine.

A sample application called RayTracer is provided in the Downloads section, because we use it throughout this article. RayTracer is a small C++ application that generates image files, picturing various geometric shapes. We will use Performance Advisor to incrementally improve the performance of this application and to compare its performance to a baseline as we go.

You can download the RayTracer sample program and follow along with the demonstration, but keep in mind that performance results are dependent on the system where the application runs, so you will get different numbers from the ones shown in the upcoming examples.

Prerequisite:

You will need access to an AIX server with XLC 11.1 and the Rational Developer for Power Systems Software server component installed.

Getting started

  1. Begin by switching to the Performance Advisor perspective (Window > Open Perspective > Other > Performance Advisor).
Figure 1. Performance Advisor perspective
screen capture

The first thing that you need is a connection to the remote machine where the application will be built and run.

  1. Find the Remote Systems view and, under New Connection, right-click the AIX node and select New Connection, and then follow the wizard.
Figure 2. Creating a new connection
Remote Systems tab view

Next, we need a remote C++ project that we will use to edit and build our application.

  1. Extract the RayTracer source code to a folder somewhere on the remote machine.
  2. From the main menu, select New > Remote C/C++ Project.
Figure 3. Creating a remote C/C++ project
Select 'Remote C/C++ Project' from the 'New' menu
  1. Follow the wizard to set up the project. (There are a few different types of remote projects in Rational Developer for Power Systems Software, but Performance Advisor supports all of them.)
Figure 4. The new remote C/C++ project wizard
First page of the remote C/C++ project wizard

Building the application

The application must be compiled with debug information enabled to collect line-level performance data. For both XLC and GCC compilers, this is done by passing the -g option. Additionally, if you are using XLC, you will need the -qlistfmt=xml=all option to generate XML transformation reports during a build.

For a more detailed overview of compiler options used by Performance Advisor, please see the documentation. The makefile provided with RayTracer uses the correct options for XLC on the AIX platform.

Creating a launch configuration

We want to be able to launch the RayTracer application from within the IDE.

  1. From the main menu select Run > Run Configurations.
Figure 5. The Run Configurations dialog window
Creating a Remote Compiled Application launch
  1. In the dialog window, double-click Remote Compiled Application, and then browse to the location of the RayTracer executable file.

Tip:
If you click the Run button, you will see that RayTracer is launched on the remote machine and the console output shows locally in the Console View.

Creating a performance tuning session

Now comes the fun part: performance tuning the application.

The main view in Performance Advisor is the Performance Explorer view. From here, you can start performance runs, organize the data, and analyze the results.

Performance runs are organized by using two artifacts: Sessions and Activities. Each Activity represents a single performance run of the application, and a Session is just a list of Activities. There are two types of Activities, System Scorecard and Hotspot Detection. Both will be covered in the upcoming steps.

  1. To create a Session, click the New Session toolbar button at the top of the Performance Explorer view to open the New Performance Tuning Session wizard.
Figure 6. New Session Toolbar Button
Click the New Session toolbar button
Figure 7. New Performance Tuning Session wizard, first page
Create a Performance Tuning Session

You can use the New Performance Tuning Session wizard to configure complex scenarios, such as tuning large applications across several machines. For this tutorial however, we will set up a simple scenario where we build and tune our small RayTracer application on a single machine.

The first page of the wizard asks for the following information:

  • Name of the session
  • Build host: Rational Developer for Power Systems Software 8.5 introduces a new feature that allows a remote project to be synchronized across more than one host. For projects using this feature, you would select the build host at this point. This tutorial uses only one host, so the default is correct.
  • Runtime host: Performance Advisor supports a scenario where you can build on one host but execute the performance run on a different host. This is intended for organizations that have dedicated performance testing machines, or if you just want to test your application on a different machine than where you built it, but you don't want to copy all of the project files there. For this tutorial, leave the runtime host the same as the build host.
  • Temporary data directory: During performance data collection, some temporary files are created. You need to provide a folder to store these files during the run (they are automatically deleted after the run). Click the Use Default button to pick a default location under your home directory.
  1. After you have provided the information, click Next.
Figure 8. New Performance Tuning Session wizard, second page
Define Application

In the second page of the wizard, you provide the location of multiple executables and shared libraries that make up the application. This information is used to provide more accurate recommendations (this data can also be updated after the session is created).

  1. Browse to the location of the RayTracer executable file, and add it to the list of the executables.
  2. Click Next.
Figure 9. New Performance Tuning Session wizard, third page
Create System Scorecard Activity

The third page will prompt you to create a System Scorecard Activity along with the new Session. This is a good idea to do if you are performance tuning on a particular remote machine for the first time. On this page, you can specify a minimum and a preferred Power platform version. If you do this, the recommendations that are generated later will be focused on the platforms that you care about most.

  1. For this simple example, just leave the defaults and click Finish.

The new Session and Activity show in the Performance Explorer view.

Figure 10. Performance Explorer with the new Session and Activity
Performance Explorer now shows the new Session

Running the System Scorecard

The System Scorecard Activity starts in the new state, meaning that it is ready to run. The bottom panel of the Performance Explorer is used to run Activities.

  1. Select the System Scorecard Activity, and click the Begin Data Collection button.
Figure 11. Running a System Scorecard Activity
The activity goes from new to running

The Activity goes into the running state. In the background, Performance Advisor is analyzing the runtime host and the executable. When this process is finished, the Activity goes into the complete state. The time of completion now appears next to the Activity.

  1. Double-click the Activity to open the Scorecard Viewer.
Figure 12. The System Scorecard viewer
System Scorecard showing link to 2 recommedations

Here (Figure 12), we can see that we are actually missing some key best practices. It turns out that RayTracer was initially built without turning on compiler optimization.

  1. Let's find out what we can do about this by clicking the link that says 2 recommendations to open the Recommendations view.

The Recommendations view shows automatically generated recommendations for the currently selected Activity. Each recommendation indicates a Confidence Level, and the higher the confidence level, the more likely that following the recommendation will have a positive impact on performance.

Figure 13. The Recommendations view
Recommendations: increase optimization level and review warnings

Performance Advisor has determined that the application should be rebuilt using the -O compiler option. This is a quick and easy way to get better performance.

Increasing optimization levels will also probably lengthen the time that it takes for a build to finish. Performance Advisor will selectively recommend optimization level increases specifically for the hottest parts of your application. That way, you get the most benefit by optimizing the hot parts of the application while avoiding spending time optimizing parts that have little effect on overall performance.

Establishing a baseline

Performance Advisor has recommended that we rebuild RayTracer with a higher compiler optimization level. But before we do that, it's a good idea to establish a baseline for comparison. That means executing a performance run of the application before making any changes.

  1. Right-click the Session, and select New Activity.
  2. Create a new Hotspot Detection Activity, name it Hotspot Detection 1, and use the launch configuration that you created previously.
Figure 14. The New Activity window
select Hotspot Detection as the Activity type

The new Activity will appear in the Performance Explorer view.

Figure 15. Performance Explorer with hotspot detection
Performance Explorer now contains the new session
  1. Select the new Activity, and click the button that says Launch Program and Collect Data.
  2. When the Activity is complete, right-click it, and select Set as Baseline.
Figure 16. Setting the baseline
Set Hotspot Detection 1 as the baseline
Figure 17. The Baseline Activity
Hotspot Detection 1 is now the baseline

Comparing performance runs

  1. Open the makefile, add -O2 to the compiler options, and rebuild the application.
Figure 18. The makefile editor
Add -O2 using the makefile editor

Now let's see what effect this has on the application's performance.

  1. Create another Hotspot Detection Activity, name it Hotspot Detection 2, and run it.
  2. When that has finished, right-click it, and select Compare with Baseline.

The Hotspots Comparison browser shown in Figure 19 will open.

Figure 19. Hotspots Comparison browser
MyApplication showing 2.034x faster

This viewer compares the results of two performance runs. At the top of the viewer, it says that the application got approximately 2 times faster. That's a big result from such a small change!


Analyzing the performance data

Now let's try making a change to the application itself. But before we can do that, we need to look at the performance data and figure out what change we should make.

Hotspots browser

Double-click the Hotspot Detection 2 Activity to open the Hotspots Browser.

Figure 20. Hotspots browser
sqrt is the hottest function in the table

The left pane of the Hotspots Browser shows the Process Hierarchy Tree. This tree shows all of the processes and threads that were sampled during the performance run.

Filtering and finding functions

The Hotspot Browser has two features that make it easier to find what you are looking for:

  • You can create filters against the process tree. In fact, the My Application and Other Processes nodes are just predefined filters. Any custom filters that you created are listed under the My Filters node. To create your own filter, simply right-click the My Filters node, and select New Filter.
  • Also, if you are looking for a function with a specific name, you can filter the function table by typing part of the function name into the filter box above the table. The wildcard * is supported and is useful when dealing with complex C++ function names.

Processes that correspond to the application being profiled are isolated under the My Application node. All other processes running on the system at the same time as the application show up under the Other Processes node.

You can expand the My Application node to drill down and examine the processes, threads, and modules that make up the application. RayTracer is single-threaded, so only one thread is shown in this example. For multithreaded or multiprocess applications, each thread and process can be examined individually or as a group.

You can select a node in the Process Hierarchy Tree to see the functions that were sampled in that level of the hierarchy. By default, the functions are sorted by the amount of time that each function takes from the profile. The hottest functions are at the top of the list and make a good starting point for performance tuning.

By looking at this data, we can see that the sqrt function is taking up a significant percentage of the application's execution time. But sqrt is a library function, so we can't directly alter it. Besides, it's not our job to optimize library code. So instead, let's try to find out how sqrt is used by our application.

Invocations browser

  1. Right-click on the sqrt entry in the function hotspots table, and select Show callers/callees, which will open the Invocations Browser.
Figure 21. Invocations browser
Call graph with sqrt in the center

The Invocations Browser is focused on one function at a time, and it shows a graphical representation of all of the sampled call stacks that include that function.

The Invocations Browser is very flexible. You can zoom in and out, isolate specific call paths, and focus on different parts of the application. Here, we can see that the hottest function that calls sqrt is this one:

MyShape::sphere_find_intersectionLet's look at the code for that function.

  1. Right-click the sphere_find_intersection node, and select Open Source.
Figure 22. Open Source option
open source selected on the context menu

Performance Source Viewer

The Performance Source Viewer opens on the source code for this function:
sphere_find_intersection

The viewer displays your source code along with line-level performance data (see Figure 23). To the left of the code, you can see how much execution time each individual line of code takes from the total profile.

Figure 23. Performance source code viewer
Sphere.cpp open in the viewer

Next to the Performance Source Viewer is the standard Outline View. When the Performance Source Viewer is open, the Outline View shows a breakdown of the code blocks within each function in the file. You can use this to find hot blocks of code, such as hot loops.

Figure 24. The standard Eclipse outline view
sphere_find_intersection in the outline view
  1. Back in the Performance Source Viewer, click the first toolbar button to jump to the hottest line of code in the file.

It's no surprise that the hottest line contains a call to the sqrt function.

Figure 25. Go to the hottest line
Go to hottest line toolbar button

It looks like there's an opportunity for a simple optimization. The value returned by the sqrt function is not used in the first branch of the following if statement. Changing the code so that sqrt is called only in the else branch might have a positive effect on performance, so let's try making that change.

The code cannot be directly edited in the Performance Source Viewer because the line-level performance data would not line up correctly after a change.

  1. Click the Switch to Editor toolbar button.
Figure 26. The remote C/C++ editor
editing the code to optimize it
  1. Change the code so that sqrt is called only in the else branch, and then rebuild the application.

Comparing source changes

  1. Create another Hotspot Detection activity, name it Hotspot Detection 3, and run it.
  2. When it is complete, compare it to the previous Activity by right-clicking it and selecting Compare with Previous.
Figure 27. Select Compare with previous
Compare with previous menu item selected

As the screen capture in Figure 28 shows, the code change has had a positive impact on performance: the application is now 2.353 times faster than the previous run.

Figure 28. Comparison browser
My Application is 2.353x faster

The function impact table shows that sqrt has had a significant impact on the change. But there are three functions above sqrt that have even bigger impacts. By looking at the Invocations Browser (Figure 29), we can see that these functions are all called by sqrt, so by reducing the calls to sqrt, we also reduce the calls to those functions.

Figure 29. Invocations browser
sqrt calls 3 other functions in the graph

Now let's compare Hotspot Detection 3 to the baseline.

Figure 30. Comparison browser
My Application is now 4.785x faster

The application is 4.785 times faster as a combined result of the two changes. Not bad!

Automatic source tracking

We can go one step further and compare the line-level performance for sphere_find_intersection before and after the change.

  1. Open the Performance Source viewer on sphere_find_intersection in Hotspot Detection 2.

Looking at the viewer, you can see the original code from before the change (along with the original performance data).

  1. Now open the viewer on the same file in Hotspot Detection 3.
  2. Dock the two viewers next to each other to see both at the same time.
Figure 31. Comparing the source code
Two instances of Sphere.cpp docked together

At this point, you might be thinking, "But I edited that file. How can I see the code from before the change was made?"

Performance Advisor includes a feature called Automatic Source Tracking. Every time you execute a performance run, the states of all of the source files in the project are saved as a snapshot. If you go back to view performance data from previous runs, you will see exactly what the code looked like at that time. This enables you to view detailed comparisons of line-level performance data between any two performance runs. This feature is completely transparent and works automatically. It does not interfere with any version control system that you might already be using, such as IBM® Rational Team Concert™.

Function inlining

  1. Let's go back a bit and take another look at the comparison between Hotspot Detection 1 and Hotspot Detection 2.
Figure 32. Comparison browser
compare the first two activities again

Several of the top functions have no speedup information. In fact, they say "Not detected in Activity Hotspot Detection 2." What's going on here?

The only difference between the two runs is that we turned on compiler optimization in the second run. One of the main optimizations performed by the compiler is function inlining.

  1. Open the Performance Source viewer on the RayTrace.cpp file in Hotspot Detection 1.

The viewer shows nothing terribly interesting. There is some line-level data, but all it really tells us is that there are no hot lines of code in this file.

Figure 33. RayTrace.cpp from Hotspot Detection
RayTrace.cpp with no interesting information
  1. Now open the same file in Hotspot Detection 2.
Figure 34. RayTrace.cpp from Hotspot Detection 2
RayTrace.cpp with inlining information

There's a lot more information in the viewer for the second run. The little arrow icons on the left side of the viewer indicate lines of code that contain function calls that were inlined by the compiler.

  1. Hover the mouse cursor over an arrow to get a pop-up window that shows which functions were inlined.

Clicking on an arrow icon will expand the source to show the source for the inlined function directly inline with the rest of the code.

Figure 35. Exploring function inlining
Expand MyPoint inline

The reason that several of the top functions in the Function Comparison browser were not detected in the second run is that the compiler fully inlined those functions into their call sites. This had a large positive impact on the application's performance. The result of this impact can be explored at a high level of detail by using the Performance Source viewer's support for function inlining.


Summary

Performance Advisor provides a rich set of tools for performance tuning C/C++ applications on AIX and PowerLinux. This tutorial provided a quick overview of the most basic features.

Here is a list of some of the more advanced features not covered in this article:

  • The Recommendations view is always available. Recommendations are generated for every Activity and are an excellent way to get guidance on where to look next for performance opportunities.
  • Complex tuning scenarios are supported. You can build and performance test your application across many AIX and PowerLinux servers, and then compare the results.
  • You do not have to be sitting in front of your computer to run a performance test. Performance Advisor allows you to schedule performance runs for when you are offline. For example, you can schedule a performance run for overnight and analyze the data the next day.
  • Performance Advisor comes with a set of shell scripts that can be used to execute a performance run on servers that do not have Rational Developer for Power Systems Software installed or on servers that you do not have direct access to. Simply give the scripts to your clients to run, and then import the resulting data for analysis.
  • Performance data can be shared between team members by using the import and export capabilities.

Download

DescriptionNameSize
Sample applicationRayTracer.zip137KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Rational software on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Rational, AIX and UNIX, DevOps
ArticleID=822121
ArticleTitle=Performance tuning C/C++ applications with Performance Advisor in Rational Developer for Power Systems Software
publish-date=06262012