I've been quiet for a bit as I tried to figure out whether I should talk about low level application profiling tools like tprof, oprofile, and procstack, or whether I talk a bit about higher level performance analysis tools.
I decided I would jump right in to a higher level tool that's part of the Rational Developer for Power product, called the Rational Performance Analyzer. Full disclosure - this is my day job - I work on Rational Developer for Power, so I'm a bit biased about the product...
Even still, let me spend a little time talking about tprof and procstack before jumping into Rational Developer for Power (RDP)...
I call tprof a 'data gathering tool' - it gathers profile data at a fairly low level, and writes it to a text report. It does a great job of doing hardware sampling of the entire operating system, mapping ticks from raw addresses to shared library/executable offsets, using the equivalent of the procmap command against the running processes. If you really know what you are doing, you can use the text reports from tprof to pretty much solve all your performance problems. But - it is very time consuming, and you end up writing quite a few scripts (or full-fledged programs) to aggregate the sample information in useful ways.
tprof can get you information on what is hot, but you also need to know why it is hot. There can be 2 fundamental reasons why a function is hot. One reason: the function could take a long time to compute a result. But, the other possibility is that the function is being called more often than it should. tprof can help with figuring out why a routine is hot. If the routine is called relatively frequently, the prolog/epilog code will be hotter (in relative terms) than the body of the function. If the prolog/epilog code is relatively the same heat as the body of the function, the routine is called relatively infrequently. For those that aren't familiar with the terms 'prolog' and 'epilog', they refer to the code a compiler generates for the entry point(s) and the return point(s) (respectively).
If you know a function is hot because it is being called relatively frequently, you can either recompile the code with debug information and run a tool like gprof to get call graph information, or, you can use procstack in a loop (perhaps a few times a second) to get an idea of what call stacks are hot. While procstack is harder to use because you have to write some scripts up around it, it has the advantage of not requiring your code to be rebuilt, and it does not perturb your application execution (too much) because it is sampling the application much like tprof is.
Rational Developer for Power tries to hit the sweet spot of providing a powerful sampler based profiler, using tprof and procstack to gather data, but takes care of all the data aggregation from the data gathering tools and compiler output.
Here's a snap-shot of the Performance Analyzer UI after running tprof/procstack against an application - the first view into the application is more or less what you would expect:
Performance Explorer View: The Performance Explorer view on the left lets you organize your performance runs. You set up performance sessions (where you are solving a particular problem), and then you can proceed to do multiple performance activities for the session, changing different things and experimenting, saving your results with names that make sense to you.
Hot Spots View: The hot spots view in the top right breaks the systems processes into 2 chunks - the processes associated with your application and all the other processes, and by default focuses on your application processes. The functions in the processes are sorted, hottest at the top, so you can see where time is being spent in your application easily.
Recommendations View: The recommendations view on the bottom is a nice feature. The Performance Analyzer analyzes your application (as it's name would suggest) and provides recommendations on things you might want to consider up-front to improve the performance. It has a fuzzy logic engine built in that it uses to provide a confidence level in it's recommendations too. In this case, two of the files were not built with optimization, and it has a pretty high confidence that re-building those files with optimization will improve performance. The recommendations provide details that are XLC compiler specific too - this example says the compiler option -O2 should be used.
The views above are important bread and butter for performance analysis, but other than the recommendations view, they are a clean roll-up of information that tprof provides. If you click on one of the hot methods though, you can see some pretty cool stuff. Clicking on the 'way2obj::addtobound' C++ method, you get an enhanced source view:
The view looks like an editor, but it's a read-only view of the source that was captured as part of profiling. This is important - the Performance Analyzer keeps track of the source code associated with a particular run so that you can be sure the source you are looking at matches the performance data gathered. Delta changes to source code are saved to keep the size of the metadata reasonable. As will be shown in a bit, it also means you can compare two different runs, with two different captured sources, against each other and not have to worry about hot line numbers getting out of sync with the source.
Note the hot spots down the left margin - they indicate how much time is spent on the different lines. Also note that some lines are grey and others are in yellow italics. Yellow italics mean we are taking a best guess at the amount of time on the line. That's because of the realities of compiler optimization and code motion where the relationship between a source line and generated code can sometimes get fuzzy (perhaps I'll cover this in a separate blog...)
One really important feature in performance analysis is understanding what the optimizer has done to the code, and one major factor is inlining (all good optimizing compilers can 'inline' a particular method call into the caller, eliminating the overhead of the call and, sometimes more importantly, optimizing the inlined code more aggressively because of the increased information regarding the use of parameters and return values). The Performance Analyzer uses inline reports from the XLC compiler to determine what was inlined where and then marries it with the tprof hits and line number data to provide detailed performance information on what is inlined where. You can see a blue circled arrow
before lines with method or function calls where the XLC optimizer has inlined the code. By default, the Performance Analyzer adds up all the time for the inline call for that line, but you can click on the blue arrow to expand the inline call (which itself may have additional inline calls that can be expanded) to get more detailed time on the hot spot. This is really handy. With most profilers, you would need to resort to looking at assembler code to figure out where the hot code was that was inlined, but the Rational Performance Analyzer figures it all out for you.
So - with being able to save activities as you experiment, one really nice feature is being able to compare two runs. A pretty common scenario is: our application is running 10% slower this week than last week - figure out who 'broke' the code... This is pretty simple with the Performance Analyzer. Do one run with the old code. One run with the new code, saving each performance run to it's own activity. Select, them both, right click, compare, and you get the following:
Performance Analyzer does all the annoying math for you to figure out what routines got faster, what routines got slower, and which ones have come (or gone). And - it sorts the results in terms of how those routines affect the overall application - not just whether they got faster or slower. In the preceding example, the biggest contributor to speeding up the program came from getregfillnum, which ran 1.773x faster. Note that defineneighbourhood1 actually sped up a lot more - 13x - but it contributed less to the overall application performance speed up. This makes it really easy to track down where regressions (or - if you are lucky enough - performance improvements) came from. And this is why saving the source for the different runs is so important. You can click into these routines and see the right version of the source for each of these runs, and then compare the source code to see what's changed.
Finally, I'll briefly touch on the Invocations View (this blog article is already too long, but I wanted to touch on it). Remember I mentioned at the start that the Performance Analyzer does data gathering of tprof and procstack. So far, we've only talked about tprof data. We use the procstack data to get a sampling of call stacks, which we provide a graphic view of:
The key thing to remember is that the invocations view is sample-based. So - it doesn't necessarily show all possible paths to a particular routine. What it shows is the paths that were sampled as we ran procstack against the application. The Performance Analyzer can get a good idea of the call paths that matter using this technique, and assign time to the different paths, but it can't provide call counts (i.e. how many times a routine was called) because the code is not instrumented in any way - it is just being sampled.