Rich-client application performance, Part 1: Tools, techniques, and tips for analyzing performance

Significant performance issues are likely to arise even in well-planned applications. In this two-part article, Chris Grindstaff offers techniques for analyzing and addressing performance problems. In this first installment, you'll learn how to measure the performance of Eclipse-based Rich Client Platform (RCP) applications, determine if slowdowns are caused by CPU or I/O bottlenecks, and keep the UI thread idle to maintain responsiveness. Part 2 addresses memory problems.

Chris Grindstaff (chris@gstaff.org), Software Engineer, IBM, Software Group

Chris GrindstaffChris Grindstaff is a software engineer at IBM in Research Triangle Park, North Carolina. Chris wrote his first program at age 7, when he convinced his grade school teacher that "typing" sentences would be just as onerous a punishment as writing them by hand. Chris is currently interested in a variety of open source projects. He has worked extensively with Eclipse and authored several popular Eclipse plug-ins, which can be found on his Web site.



31 July 2007

Also available in Chinese Japanese

Even with good up-front planning, any application is likely to have some significant performance issues. This two-part article offers some techniques to help you analyze those problems, with a focus on Eclipse-based rich client applications running on Windows. Here in Part 1, I show how to measure the performance of your Eclipse-based RCP application, determine if slowdowns are caused by CPU or I/O bottlenecks, and keep the UI thread idle to maintain responsiveness. I also give you some tips on avoiding threading mistakes and improving application startup performance. Part 2 discusses some approaches to tracking down memory problems. Many of these techniques have applicability beyond Eclipse applications.

Key concepts

Whenever you're investigating a performance problem, the first order of business is to determine whether the task in question is predominantly CPU-bound or I/O-bound.

CPU-bound means that the CPU is the bottleneck to getting the job done, so a faster CPU completes the action faster. For example, if your 100MHz CPU takes 50 seconds to sort a list of 100,000 e-mails, you can expect that it would take 5 seconds to sort on a 1GHz CPU.

But a faster CPU doesn't always make a task run faster. I/O-bound tasks are those for which I/O completion is the bottleneck. Good examples are reading a large file from disk or downloading a file from a Web site. In general, the CPU speed is irrelevant for I/O-bound tasks because the I/O subsystem handles file reading. Typically, the source device can't maintain a high enough transfer rate to keep the CPU busy. The CPU has nothing to do while it waits for the data, so it sleeps.

No guessing

Don't bother guessing why your application is slow. You'll probably be wrong. Profile, don't guess.

So several factors can cause slowdowns: the CPU is busy, the application is doing too much I/O, it is waiting for I/O to complete, or some combination. It rarely makes sense to guess the cause because tools do a better job. The next section shows some tools that help determine whether a task is CPU or I/O bound.

Monitoring tools for Windows

The cost of application monitoring

Developers often ask if profiling an application changes its behavior. This is a good start, but they should be asking to what degree profiling affects an application. In general, the more intrusive the technique, the more significant the impact. Instrumenting is more expensive than sampling, which in turn is more expensive than a few well-placed log messages.

For example, most profilers introduce enough overhead that a method call's absolute times are no longer valid, only the relative times. The overhead of profiling also varies depending on whether you are sampling the call stacks or instrumenting each method call. Instrumenting causes a small, fast method that is called frequently to show up as a hot spot because the instrumentation overhead adds up as the method is called repeatedly. On the other hand, stack sampling would not cause it to appear as a hot spot because it spends little time on the call stack.

Depending on the sort of analysis you're doing, you may not care how much overhead profiling imposes on the application. For example, you might be willing to pay a higher price if you can accurately capture how many times a method is called, which can be crucial to understanding the application's order analysis.

On the Windows® operating system, I typically use a combination of Performance Monitor (Perfmon) and Sysinternals Process Explorer (see Resources). Process Explorer completely replaces the built-in Task Manager, so I rarely use Task Manager. As with all monitoring tools, you need to be careful (see The cost of application monitoring). If you tell Process Explorer to refresh its views too frequently, you place more load on the machine.

I base my choice of tool on whether I'm doing ad-hoc performance analysis or longer runs. If I want to monitor an application for a longer time period or correlate performance data with application logs, I typically turn to Perfmon. It's easy to set up Perfmon to log to a comma-separated value (CSV) file with time stamps so you can correlate with other logs.

To open Perfmon:

  1. Click Start > Run > perfmon.
  2. Click on Perfmon's + button to launch the Add Counters dialog (see Figure 1).
  3. Click on the Explain button for a detailed explanation of each counter.
Figure 1. Perfmon Add Counters dialog
Windows performance monitor add counters dialog

I typically add the following counters with Process selected as the Performance object:

  • % User Time: Amount of work the process is doing.
  • Handle count: Number of handles opened by the process. Among other things, handles represent the number of files or sockets an application has open.
  • IO Data Bytes/sec: Amount of disk, network, or device I/O the process is doing.
  • Private Bytes: Amount of memory associated with the process that cannot be shared -- a rough indicator of your application's size. (This value corresponds to VM Size in Task Manager.)
  • Thread Count: Number of threads associated with the process.

On the other hand, if I want to observe a reproducible problem in "real time," I use Sysinternals Process Explorer. Process Explorer has the advantage that it can focus on a single process instead of the entire machine. When you're looking at a specific problem, it's often desirable to look only at the application in question.

Double-click within Process Explorer on the application you want to monitor to open the Properties dialog for that process (see Figure 2):

Figure 2. Javaw process properties
JEdit procexp

The javaw process in Figure 2 is from JEdit. In this example, I opened a 14MB text file from disk. Here's what you can tell from the three graphs in Figure 2, starting at the bottom and working upward:

  • The large spike in the I/O Bytes History graph represents the disk I/O required to read the 14MB file. If you hover on the line, it shows that 14MB was read.
  • The Private Bytes jump by 33MB. A 14MB text file will become 28MB on the Java™ heap, primarily because of the Java language's use of 16-bit Unicode characters. The remaining 5MB is needed by Swing and JEdit objects to manage the editing.
  • The large spike in the CPU Usage History shows the processing that occurs after the file has been read into memory. In this case, JEdit is updating the display, syntax-highlighting the file, and so on.

If the slow operation is I/O bound, you need to determine what part of the application is responsible for the I/O. If the slow operation is CPU-bound, then it's time to pull out your profiler.


Setting up a profiler

Setting up a profiler for RCP applications differs from profiling many other types of applications because RCP applications are often launched with an executable or shell script instead of launching the Java runtime directly. The issue is further complicated because the RCP launcher creates the command-line arguments for the Java processor and launches it. This extra level of indirection can get in the way when you're trying to profile or otherwise precisely control JVM invocation arguments. Instead of relying on the application launcher to start the Java runtime, I often extract the Java command line and start it directly. Here's one way to do that:

  1. Start the application normally.
  2. After it's launched, start Process Explorer and find the javaw or Java process. Open the process's properties and copy the command-line arguments from the details (see Figure 3).
  3. Paste this into a batch file and modify it for your needs. (This way, you create a core batch file and several variants that add or remove VM arguments, classpath entries, and so on.)
Figure 3. Command-line arguments for Java process
Command line arguments for Java process

Identifying long-running actions in the UI thread

Most modern operating systems have a single UI thread. Likewise, the Standard Widget Toolkit (SWT) has a single UI thread. You must be careful not to perform long-running operations such as excessive disk I/O, network calls, or plain old lots of work from this single thread.

To see why, imagine you have a button in your application. You want to do some work when the button is clicked, so you add an event handler to it. When a user clicks on the button, the OS calls into the GUI toolkit, which in turn calls your event handler. The code for your event handler is now running on the UI thread and, as long as your code is running, the UI thread can't respond to other UI events. This means the UI appears to freeze, which is disconcerting to the user. The point here is that if your code is running on the UI thread, the application can't handle UI events from the OS. If your application has a button to cancel a long-running operation but you're doing work in the UI thread, the cancellation event won't be delivered until after the operation has finished! (If your code runs for too long on the UI thread, the OS steps in, notifies the user, and offers the option to terminate the application.)

That's why slow and unpredictable I/O is a problem on the UI thread. Each of the types of I/O can have wildly different characteristics. Disk I/O tends to follow a linear model of latency + transfer rate * amount of data. Network I/O, on the other hand, is much less regular. It's both slower than disk I/O and much less reliable because it's affected by (potentially transient) network congestion between endpoints.

The impact of network I/O on the UI thread is easy to miss during development because you're likely to be on a fast network with low latency. In such an environment, it's easy to do network calls on the UI thread inadvertently and not find out until customers running over a slower or less reliable network notice that the UI freezes every time they go to the network. Couple that with bad socket timeouts, and it's not uncommon to see the "white screen of death" Windows puts up when your application doesn't respond to UI events for at least five seconds.

Table 1 shows some techniques for finding long-running operations in a UI thread, along with the advantages and disadvantages:

Table 1. Techniques for finding long-running operations in a UI thread
TechniqueProsCons
Use a profilerNot much trouble to set up if you already have one.Usually cost money.
Run-time overhead may be too high.
Instrument JDKOnce you set it up, it works with your application until you upgrade the JDK.
Very low runtime overhead.
Not easy to share with others.
Instrument your codeScales to many users because customers, QA, developers, and others can run with this instrumentation enabled.May require you to rearchitect your application to find all the places you make network calls.
Must be diligent in not adding new methods that aren't instrumented.
Need to post-process logs. Log files can grow large.

Instrumenting techniques

You can use a number of techniques to gain insight into what your application is doing. This section explains a few.

Using aspects

You can use aspect-oriented techniques to "weave" your changes into the classes you're instrumenting. For example, it's straightforward to weave code into SocketInputStream and SocketOutputStream that checks if the streams are being accessed on the UI thread (see Resources for links to more information on aspect-oriented techniques and tools).

Swing vs. SWT

Swing and SWT differ in how they name the UI thread. In SWT, the UI thread is always named main. In Swing, you ask the event queue if the current thread is the dispatch thread with java.awt.EventQueue.isDispatchThread(). The remaining examples use the SWT method; if you're using Swing, substitute appropriately.

Using breakpoints

If you can run your application under a debugger, it's sometimes easier to instrument the JDK by employing conditional breakpoints. I once worked on a large application that was making network calls on the UI thread. The application's structure (lots of third-party code) made it difficult to isolate who was responsible for the network calls. Placing a conditional breakpoint in Eclipse on the SocketInputStream class, like the one in Figure 4), made it easy to determine the offender:

Figure 4. Conditional breakpoint
Conditional breakpoint

Using a security manager

Another option I've used successfully is to replace the application's security manager with an instrumented security manager. Lots of interesting calls pass through the security manager. For example, the security manager in Listing 1 logs a message when it tries to open a socket on the GUI thread:

Listing 1. Log an error when opening a socket on the UI thread
SecurityManager securityManager = new SecurityManager() {
    public void checkPermission(Permission perm) {
        if(perm instanceof java.net.SocketPermission) {
            if(Thread.currentThread().getName().equals("main&")) {
                logger.log(Level.SEVERE, "Network call on UI thread&");
                new Error().printStackTrace();
            }
        }
    }
};
System.setSecurityManager(securityManager);

Instrumenting your code

If your application is well layered so that your network calls pass through a single (or a handful) of places, you can have the application code check the current thread before making network calls, as shown in Listing 2. I leave this type of code active in production builds because the thread check is cheap. Creating and logging the exception incurs some overhead, but the stack trace is invaluable for tracking down who caused the problem.

Listing 2. Log an error when doing network calls on UI thread
if(Thread.currentThread().getName().equals("main")) {
    logger.log(Level.SEVERE, "Network call on UI thread");
    new Error().printStackTrace();
}

Modifying the JDK classes

As a last resort, you can instrument the JDK by modifying its classes. Such a move is unsupported, tricky, and hacky -- and may violate the license -- but for those rare situations when one of the aforementioned techniques can't help, it's a valuable option. The gist of this technique is to recompile the classes in the JDK, and then use -Xbootclasspath/p: to prepend the JAR or directory to your boot classpath.


Avoiding long-running actions in the UI thread

Here are some techniques for avoiding long-running actions in the UI thread, using a common example: a table or tree that is filled from some sort of database query, network call, or disk.

Good

Do not assume you can populate the table on the UI thread. That works fine with hundreds of items but not thousands.

Better

Do not assume the table or tree must be completely filled before showing the user the initial results. For example, if you're writing an e-mail client, you don't want to load all the mail messages for all folders and populate the table before showing users a "page" full of mail messages.

Better yet

Take advantage of SWT/JFace virtual widgets. You can use several different techniques, but all boil down to "defer work as long as possible." In the UI thread, populate your tree or table with placeholder values and, in a background Job, retrieve the real values and update the tree as you fetch them.

Best

Be careful how much work you do in your event handlers. In particular, watch out for your SWT selection handlers, especially those tied to tables, trees, and lists. I see a lot of code that gets this wrong. For example, Listing 3 shows a selection listener from a mail application; each time a message is selected, a database query is executed to read and update the UI with the mail details:

Listing 3. Selection listener that responds to each selection change
viewer.addSelectionChangedListener(new ISelectionChangedListener() {
    public void selectionChanged(SelectionChangedEvent event) {
        new Job("go to db") {
            protected IStatus run(IProgressMonitor monitor) {
                //do expensive work here
                return Status.OK_STATUS; 
            }
        }.schedule();
    }
});

Listing 3's developer, knowing this approach to be expensive, does the work in a background Job. The problem is the developer didn't anticipate that a user would select the first message in the inbox and then hold down the keyboard Down arrow for several seconds. Typically in a situation like this, each selection change causes a new background Job to be executed. Before long, the JobManager is flooded with Jobs.

Instead of a selection handler, a postSelection handler, as shown in Listing 4, would be a better choice. JFace provides PostSelection handlers to perform event-coalescing so that the last selection is the only one your application is told about -- not the flurry of selections your application likely doesn't care about. You can think of this as ignoring the fine-grained events because they're too noisy and instead focusing on the coarse-grained ones.

Listing 4. Selection listener that responds only to the last selection change
viewer.addPostSelectionChangedListener(new ISelectionChangedListener() {
    public void selectionChanged(SelectionChangedEvent event) {
        new Job("go to db") {
            protected IStatus run(IProgressMonitor monitor) {
                //do expensive work here
                return Status.OK_STATUS; 
            }
        }.schedule();
    }
});

Dealing with disk I/O

You've heard the adage before: memory read/writes are measured in nanoseconds, disk read/writes in milliseconds. In most of the RCP applications I've worked on, the CPU, not disk I/O, has been the bottleneck. That said, disk I/O can still be an issue and you should not ignore it. A common problem area in RCP applications is inefficient reading of images from disk.

I typically use Sysinternals Process Monitor to gain insight into how an application uses files. For example, it's easy spot the use of unbuffered file reads in Figure 5:

Figure 5. Process monitor showing unbuffered read I/O
process monitor showing unbuffered I/O

If you look at each of the lines in Figure 5, you'll see that a file named big.txt is being read. The last column shows that the file is being read one byte at a time. Listing 5 shows the code that causes this:

Listing 5. Unbuffered I/O (don't do this)
InputStream in = new FileInputStream(args[0]);
int c;
while ((c = in.read()) != -1) {
    //stuff characters in buffer, etc
}

On my ThinkPad T60p with a 7200RPM disk drive, it takes 24 seconds to read a 7MB file. Switching to a BufferedInputStream reduces the time to 350 milliseconds. Most of this speed-up can be attributed to better utilizing the hard disk. Most examples in your programs will not be as dramatic as this, but it's still worth fixing nonbuffered I/O with buffered streams such as BufferedInputStream.

Another problem I see in RCP applications is what I call image burn, which occurs when an image is frequently read from disk and disposed. Depending on how frequently this happens, the image may be a good candidate for caching.


Threading and Eclipse Jobs

An application should make good use of the computer's resources. For optimal use of the CPU, the number of application threads should be sized relative to the processor count and the type of work the threads are doing. Each Java thread has a certain amount of native memory associated with it, and there's some amount of CPU overhead in context-switching between threads, so more isn't always better.

I often see threads misused in client applications. It's a good thing that threads are being used for long-running operations that historically would have blocked the UI thread, but in RCP-based applications Jobs should (almost) always be preferred over threads.

You still need to be careful that you don't flood the JobManager with Jobs because its default pool of workers will grow without bounds.

Jobs are analogous to a Runnable: they describe the task, not the thread to run on. Jobs are (not surprisingly) managed by a JobManager, which maintains a pool of workers. This means many Jobs can be handled by a pool of workers. Another big advantage Jobs have over threads is that Jobs are instrumented out-of-the-box. You can set a few flags and run your application, and the JobManager tells you when each and every Job is created, scheduled, run, and completed. The JobManager also tells you how it's managing the worker pool. This is a major advantage when you're trying to understand when background Jobs are executed and how long they take to run.

To enable this support, add the lines shown in Listing 6 to a file. (This information can be found in the .options file of the org.eclipse.core.jobs bundle.) Then start your RCP application with -debug Path_to_debug_file.

Listing 6. Enable Job debug information
# Prints debug information on running background jobs
org.eclipse.core.jobs/jobs=true
# Includes current date and time in job debug information
org.eclipse.core.jobs/jobs/timing=true
# Computes location of error on mismatched IJobManager.beginRule/endRule
org.eclipse.core.jobs/jobs/beginend=true
# Pedantic assertion checking on locks and deadlock reporting
org.eclipse.core.jobs/jobs/locks=true
# Throws an IllegalStateException when deadlock occurs
org.eclipse.core.jobs/jobs/errorondeadlock=true
# Debug shutdown behaviour
org.eclipse.core.jobs/jobs/shutdown=true

Jobs also support advanced scheduling rules. You can create simple or complex scheduling rules that govern when Jobs run. Listing 7 shows one way you can create a rule that prevents two Jobs from running at the same time:

Listing 7. Preventing two Jobs from running simultaneously
ISchedulingRule onlyOne = new ISchedulingRule() {
    public boolean isConflicting(ISchedulingRule rule) {
        return rule == this;
    }
    public boolean contains(ISchedulingRule rule) {
        return rule == this;
    }
};
Job job1 = new LongRunningJob();
Job job2 = new LongRunningJob();
job1.setRule(onlyOne);
job2.setRule(onlyOne);
job1.schedule();
job2.schedule();
return onlyOne;

The code in Listing 7 works because the scheduling rule's isConflicting() method is called before a Job runs. While job1 is executing, it "owns" the rule. Browse the implementers of ISchedulingRule for more examples (see Resources).

It's also easy to misuse java.util.Timer. Many applications end up creating several Timers. Each Timer creates a dedicated thread to manage it. Typically, an application should create only a single Timer and give it multiple java.util.TimerTasks to manage, but many developers, working in isolation, create their own Timer, which just wastes more threads. In almost all cases in an RCP application, java.util.Timers can be replaced with a Job scheduled to execute some time in the future.

You can get a lot of the same benefits from java.util.concurrent Executor, ScheduledThreadPoolExecutor, and Task if you're running in a JVM version greater than 1.4. The java.util.Timer class hasn't been deprecated, but for all intents and purposes, it's been replaced with ScheduledExecutorService.

One last point worth mentioning about threads: don't busy sleep unless you absolutely must. Listing 8 shows an example of what not to do:

Listing 8. Busy sleeping
while(someCondition) {
    ...more code here...
    Thread.sleep(aFewMilliseconds);
    ...more code here...
}

Applications that do this often create lots of unnecessary garbage in that block of code and generate lots of context switches. Instead of busy sleeping, you should use the higher-level synchronization classes provided in java.util.concurrent, such as BlockingQueue, Semaphore, FutureTask, or CountDownLatch. The concurrent classes provide a way to consume no CPU while waiting for a condition to be true. Sometimes this isn't possible when you're calling third-party code that doesn't use monitors. In such cases, the best you can do is try to minimize the amount of garbage you create while polling.


Analyzing and improving startup performance

Improving startup performance can be challenging with RCP applications. In general, startup performance is a function of disk I/O, classloading, and bytecode verification. Of course it's also possible to make your bundles start slowly by doing too much work in them, but typically that isn't the biggest part of startup. Startup tends to be the death of a thousand cuts. Often no one thing takes a lot of time, but everything takes just a little time, which eventually adds up to a lot of time.

RCP applications are built on top of OSGi, the Dynamic Module System for Java. OSGi provides a simple means to hook classloading in a global fashion. Some folks have used this classloading hook to create a Java class cache that improves startup by avoiding going to the disk so frequently. This is a promising technique that needs more research to determine its efficacy.

Another technique Eclipse encourages to improve startup is lazy bundle activation: A bundle isn't loaded and activated until it's needed. Typically, when analyzing startup performance, I gather a list of all the activated bundles and stack traces for why they were activated. I then go through the list deciding if I think the bundle really needs to start. In cases where I don't believe one needs to start, I remove it to characterize the improvement (and to see what breaks). Once I know what sort of improvement results from removing the bundles, I contact the developer(s) who own the code to discuss removing or delaying the bundle's activation.

To gather bundle activation and classloading information, use the debug options in Listing 9, found in the .options file in the org.eclipse.osgi bundle, or take a look at the latest version in CVS (see Resources):

Listing 9. Enable OSGi debug options
org.eclipse.osgi/debug=true
org.eclipse.osgi/debug/bundleTime=true
org.eclipse.osgi/debug/monitorbundles=true
org.eclipse.osgi/monitor/activation=true
org.eclipse.osgi/monitor/classes=true

However, plug-in developers can go out of their way to thwart lazy loading. For example, one product I worked on had a set of stackable views. An extension was defined so others could contribute their own stacked view. At startup, one or potentially none of the views would be visible, but the extension writer went ahead and created the views, even if they weren't going to be shown. A change was made so the extension showed just the view's title and icon, but it didn't activate the bundle that contributed the extension until the end user actually tried to view it. This made a significant improvement because fewer bundles were activated.

As another example, let's say you're building an application that has a login dialog. Your goal should be to activate only the bundles needed to show the login dialog. I've seen applications where 70% of all bundles are activated just to show the login dialog.

In a way, I'm suggesting that you engage in a shell game, where the time to start the application is moved around but the total is still the same. The users shouldn't need to "pay" for a feature of the product they aren't using. The goal is to spread out the cost instead of paying for everything up front. If an application still isn't fast enough after you've done all you can to improve the actual performance, don't forget to improve the perceived performance.


Conclusion

I can't stress strongly enough that a performance advocate needs to be involved in your application's architecture and design phase. At a minimum, the architects or lead developers must know basic order analysis or time complexity (for example, Big O) to understand how the application's storage requirements or execution time changes as it grows. A last-minute sprint to fix big performance problems tends to be reactive -- and less effective -- because it's nearly impossible to make major architectural changes late in the game.

But even the best-architected applications will have performance bottlenecks, and you need tools and techniques for assessing and addressing them. Now you know how to measure RCP application performance, determine whether slowdowns are due to CPU or I/O bottlenecks, use some instrumentation techniques, keep the UI thread responsive, sidestep thread misuse by using Jobs, and improve startup performance. In the next installment, I'll share some techniques for understanding memory usage and chasing down memory leaks.

Acknowledgment

Thanks to Brian Goetz for his suggestions and ideas during this article's review process.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, Open source
ArticleID=243748
ArticleTitle=Rich-client application performance, Part 1: Tools, techniques, and tips for analyzing performance
publish-date=07312007