Developing with real-time Java, Part 2: Improve service quality

Use real-time Java to reduce variability in Java applications

Some Java™ applications fail to provide reasonable quality of service despite achieving other performance goals, such as average latency or overall throughput. By introducing pauses or interruptions that aren't under the application's control, the Java language and runtime system can sometimes be responsible for an application's inability to meet service-performance metrics. This article, second in a three-part series, explains the root causes of delays and interruptions in a JVM and describes techniques you can use to mitigate them so that your applications deliver more consistent service quality.

Mark Stoodley (mstoodle@ca.ibm.com), WebSphere Real Time Technical Lead, IBM

Mark Stoodley joined the IBM Toronto Lab in 2002 after completing his doctorate in computer engineering at the University of Toronto. Mark has developed optimizations for two different JIT compilers at IBM and is now a team lead within the Testarossa JIT team as well as technical lead for the IBM WebSphere Real Time JVM, working with a team spread across three continents. He spends his spare moments negotiating physically with his house to improve its appearance.



Charlie Gracie (charlie_gracie@ca.ibm.com), J9 Garbage Collection Team Lead, IBM

Charlie Gracie joined the IBM Ottawa Lab in 2004 to work on the J9 Virtual Machine team, after graduating with a BCS degree from the University of New Brunswick.



08 September 2009

Also available in Chinese Russian Japanese Vietnamese Portuguese

Variability in a Java application — usually caused by pauses, or delays, occurring at unpredictable times — can occur throughout the software stack. Delays can be introduced by:

  • Hardware (during processes such as caching)
  • Firmware (processing of system-management interrupts such as CPU temperature data)
  • Operating system (responding to an interrupt or executing a regularly scheduled daemon activity)
  • Other programs running on the same system
  • The JVM (garbage collection, Just-in-time compilation, and class loading)
  • The Java application itself

You can rarely compensate at a higher level for delays introduced by a lower level, so if you try to solve variability only at the application level, you may merely shift JVM or OS delays around without solving the real problem. Fortunately, the latencies for lower levels tend to be relatively shorter than the ones at higher levels, so only if your requirement for reducing variability is extremely high do you need to look lower than the JVM or the OS. If your requirements aren't quite so high, you can probably afford to focus your efforts at the JVM level and in your application.

Real-time Java gives you the tools you need to battle the sources of variability in a JVM and in your applications to deliver the quality of service your users require. This article covers the sources of variability at the JVM and application levels in detail and describes tools and techniques you can use to mitigate their effects. Then it introduces a simple Java server application that demonstrates some of these concepts.

Addressing the sources of variability

The primary sources of variability in a JVM stem from the Java language's dynamic nature:

  • Memory is never explicitly freed by the application but is instead reclaimed periodically by the garbage collector.
  • Classes are resolved when the application first uses them.
  • Native code is compiled (and can be recompiled) by a Just-in-time (JIT) compiler while the application runs, based on which classes and methods are invoked frequently.

At the Java application level, management of threading is the key area related to variability.

Garbage-collection pauses

When the garbage collector runs to reclaim memory that is no longer used by the program, it can stop all the application threads. (This type of collector is known as a Stop-the-world, or STW, collector.) Or it can perform some of its work concurrently with the application. In either case, the resources the garbage collector needs are not available to the application, so garbage collection (GC) is a source of pauses and variability in Java application performance, as is generally well-known. Although each of the many GC models has its pros and cons, when the goal for an application is short GC pauses, the two main choices are generational and real-time collectors.

Generational collectors organize the heap into at least two sections typically called the new and the old (sometimes called tenured) spaces. New objects are always allocated in the new space. When the new space runs out of free memory, the garbage is collected only in that space. Use of a relatively small new space can keep the usual GC cycle time quite short. Objects that survive some number of new-space collections are promoted into old space. Old-space collections typically occur much less frequently than new-space collections, but because old space is much larger than new space, these GC cycles can take much longer. Generational garbage collectors offer relatively short average GC pauses, but the cost of old-space collections can cause the standard deviation of these pause times to be quite large. Generational collectors are most effective in applications for which the set of live data does not change much over time but lots of garbage is being generated. In this scenario, old-space collections are extremely rare, and so GC pause times are due to short new-space collections.

In contrast to generational collectors, real-time garbage collectors control their behavior to reduce the length of GC cycles greatly (by exploiting cycles when the application is otherwise idle) or to reduce the impact of these cycles on application performance (by performing work in small increments according to a "contract" with the application). Using one of these collectors allows you to predict the worst case for completing a specific task. For example, the garbage collector in the IBM® WebSphere® Real-Time JVMs divides the GC cycles up into small pieces of work — called GC quanta — that can be completed incrementally. The scheduling of quanta has an extremely low impact on application performance, with delays as low as hundreds of microseconds but typically less than 1 millisecond. To achieve these kinds of delays, the garbage collector must be able to plan its work by introducing the concept of an application utilization contract. This contract governs how frequently the GC is allowed to interrupt the application to perform its work. For example, the default utilization contract is 70 percent, which only allows the GC to use up to 3 ms out of every 10 ms, with typical pauses around 500 microseconds, when running on a real-time operating system. (See "Real-time Java, Part 4: Real-time garbage collection" for a detailed description of IBM WebSphere Real Time garbage collector operation.)

Heap size and application utilization are important tuning options to consider when running an application on a real-time garbage collector. As application utilization is increased, the garbage collector receives less time to complete its work, so a larger heap is required to make sure that the GC cycle can be completed incrementally. If the garbage collector cannot keep up with the allocation rate, the GC falls back to a synchronous collection.

For example, an application running on the IBM WebSphere Real-Time JVMs, with their default 70 percent application utilization, requires more heap by default than if it were running on a JVM using a generational garbage collector (which provides no utilization contract). Because real-time garbage collectors control the length of GC pauses, increasing the heap size lowers the GC frequency without making individual pause times longer. In non-real-time garbage collectors, on the other hand, increasing the heap size typically reduces the frequency of GC cycles, which lowers the overall impact of the garbage collector; when a GC cycle does occur, the pauses are generally larger (because there's more heap to examine).

In the IBM WebSphere Real Time JVMs, you can adjust the size of the heap with the -Xmx<size> option. For example, -Xmx512m specifies a 512MB heap. You can also adjust the application utilization. For example, -Xgc:targetUtilization=80 sets it to 80 percent.

Java class-loading pauses

The Java language specification requires that classes be resolved, loaded, verified, and initialized when an application first references them. If the first reference to a class C occurs during a time-critical operation, then the time to resolve, verify, load, and initialize C might cause that operation to take longer than expected. Because loading C includes verifying that class — which may require other classes to be loaded — the full delay a Java application incurs in order to be able to use a particular class for the first time can be significantly longer than expected.

Why might a class only be referenced for the first time later in an application's execution? Rarely executed paths are one common reason for a new class loading. For example, the code in Listing 1 contains an if condition that may execute rarely. (Exception and error handling has been mostly omitted, for brevity, from all of this article's listings.)

Listing 1. Example of a rarely executed condition loading a new class
Iterator<MyClass> cursor = list.iterator();
while (cursor.hasNext()) {
    MyClass o = cursor.next();
    if (o.getID() == 17) {
        NeverBeforeLoadedClass o2 = new NeverBeforeLoadedClass(o);
        // do something with o2
    }
    else {
        // do something with o
    }
}

Exception classes are another example of classes that may not load until well into an application's execution, because exceptions are ideally (though not always) rare occurrences. Because exceptions are rarely quick to process, the additional overhead of loading extra classes may push operation latency above a critical threshold. In general, exceptions thrown during time-critical operations should be avoided whenever possible.

New classes can also be loaded when certain services, such as reflection, are used within the Java class library. The underlying implementation of the reflection classes generates new classes on-the-fly to be loaded in the JVM. Repeated use of the reflection classes in timing-sensitive code can result in ongoing class-loading activity that introduces delays. Using the -verbose:class option is the best way to detect these classes being created. Probably the best way to avoid their creation during the program is to avoid using reflection services to map class, field, or methods from strings during the time-critical parts of your application. Instead, call these services early in your application and store the results for use later on to prevent most of these kinds of classes from being created on-the-fly when you don't want them to be.

A generic technique to avoid class-loading delays during the time-sensitive parts of your application is to preload classes during application startup or initialization. Although this preloading step introduces some additional startup delay (unfortunately, improving one metric often has negative consequences for other metrics), if used carefully it can eliminate unwanted class loading later on. This startup process is simple to implement, as shown in Listing 2:

Listing 2. Controlled class loading from a list of classes
Iterator<String> classIt = listOfClassNamesToLoad.iterator();
while (classIt.hasNext()) {
    String className = classIt.next();
    try {
        Class clazz = Class.forName(className);
        String n=clazz.getName();
    } catch (Exception e) {
    System.err.println("Could not load class: " + className);
    System.err.println(e);
}

Notice the clazz.getName() call, which forces the class to be initialized. Building the list of classes requires gathering information from your application while it runs, or using a utility that can determine which classes your application will load. For example, you could capture the output of your program while running with the -verbose:class option. Listing 3 shows what the output of this command might look like if you use an IBM WebSphere Real Time product:

Listing 3. Excerpt of output from java run with -verbose:class
    ...
    class load: java/util/zip/ZipConstants
    class load: java/util/zip/ZipFile
    class load: java/util/jar/JarFile
    class load: sun/misc/JavaUtilJarAccess
    class load: java/util/jar/JavaUtilJarAccessImpl
    class load: java/util/zip/ZipEntry
    class load: java/util/jar/JarEntry
    class load: java/util/jar/JarFile$JarFileEntry
    class load: java/net/URLConnection
    class load: java/net/JarURLConnection
    class load: sun/net/www/protocol/jar/JarURLConnection
    ...

By storing the list of classes loaded by your application during one execution and using that list to populate the list of class names for the loop shown in Listing 2, you can be sure that those classes load before your application starts to run. Of course, different executions of your application may take different paths, so the list from one execution may not be complete. On top of that, if your application is under active development, newly written or modified code may rely on new classes that aren't part of the list (or classes that are in the list may no longer be required). Unfortunately, maintaining the list of classes can be an extremely troublesome part of using this approach to class preloading. If you use this approach, keep in mind that the name of the class output by -verbose:class does not match the format that's needed by Class.forName(): the verbose output separates class packages with forward slashes, whereas Class.forName() expects them to be separated by periods.

For applications for which class loading is an issue, some tools can help you manage preloading, including the Real Time Class Analysis Tool (RATCAT) and the IBM Real Time Application Execution Optimizer for Java (see Resources). These tools provide some automation for identifying the correct list of classes to preload and incorporating class-preloading code into your application.

JIT code-compilation pauses

Yet a third source of delays within the JVM itself is the JIT compiler. It acts while your application runs to translate the program's methods from the bytecodes generated by the javac compiler into the native instructions of the CPU the application runs on. The JIT compiler is fundamental to the Java platform's success because it enables high application performance without sacrificing the platform neutrality of Java bytecodes. Over the last decade and more, JIT compiler engineers have made tremendous strides in improving throughput and latency for Java applications.

A JIT optimization example

A good example of JIT optimization is specialization of arraycopies. For a frequently executed method, the JIT compiler can profile the length of a particular arraycopy call to see if certain lengths are most common. After profiling the call for a while, the JIT compiler may find that the length is almost always 12 bytes long. With this knowledge, the JIT can generate an extremely fast path for the arraycopy call that directly copies the 12 bytes required in the manner most efficient for the target processor. The JIT inserts a conditional check to see if the length is 12 bytes, and if it is, then the ultraefficient fast-path copy is performed. If the length isn't 12, then a different path occurs that performs the copy in the default manner, which may involve much longer overhead because it can handle any array length. If most operations in the application use the fast path, then the common operation latency will be based on the time it takes to copy those 12 bytes directly. But any operation that requires a copy of a different length will appear to be delayed relative to the common operation timing.

Unfortunately, such improvements are accompanied by pauses in Java application performance, because the JIT compiler "steals" cycles from the application program to generate compiled (or even to recompile) code for a particular method. Depending on the size of the method that is compiled and how aggressively the JIT chooses to compile it, the compilation time can range from less than a millisecond to more than a second (for particularly large methods that are observed by the JIT compiler to be contributing significantly to the execution time of the application). But the activity of the JIT compiler itself isn't the only source of unexpected variations in application-level timings. Because JIT compiler engineers have focused almost exclusively on average-case performance to improve throughput and latency performance most efficiently, JIT compilers commonly perform a variety of optimizations that are "usually" right or "mostly" high-performance. In the common case, these optimizations are extremely effective, and heuristics have been developed that do a pretty good job of fitting the optimization to the situations that are most common while an application runs. In some cases, however, such optimizations can introduce too much performance variability.

In addition to preloading all classes, you can also request that the JIT compiler explicitly compile the methods of those classes during application initialization. Listing 4 extends the class-preloading code in Listing 2 to control method compilation:

Listing 4. Controlled method compilation
Iterator<String> classIt = listOfClassNamesToLoad.iterator();
while (classIt.hasNext()) {
    String className = classIt.next();
    try {
        Class clazz = Class.forName(className);
        String n = clazz.name();
        java.lang.Compiler.compileClass(clazz);
    } catch (Exception e) {
        System.err.println("Could not load class: " + className);
        System.err.println(e);
    }
}
java.lang.Compiler.disable();  // optional

This code causes a set of classes to be loaded and the methods of those classes all to be compiled by the JIT compiler. The last line disables the JIT compiler for the remainder of application execution.

This approach generally results in lower overall throughput or latency performance than allowing the JIT compiler full freedom in selecting which methods will be compiled. Because the methods have not been invoked before the JIT compiler runs, the JIT compiler has much less information about how best to optimize the methods it compiles — so expect these methods to execute more slowly. Also, because the compiler is disabled, no methods will be recompiled even if they are responsible for a large fraction of the program's execution time, so adaptive JIT compilation frameworks like those used in most modern JVMs will not be active. The Compiler.disable() command isn't absolutely necessary for reducing a large number of JIT-compiler-induced pauses, but the pauses that remain will be more aggressive recompilations performed on the hot methods of the application, which typically require longer compilation times with higher potential impact on application timings. The JIT compiler in a particular JVM may not be unloaded when the disable() method is invoked, so there may still be memory consumed, shared libraries loaded, and other artifacts of the JIT compiler present during the application program's run-time phase.

The degree to which native code compilation affects application performance varies with the application, of course. Your best approach to see if compilation could be a problem is to turn on verbose output, indicating when compilations occur to see if they might affect your application timings. For example, with the IBM WebSphere Real Time JVM, you can turn on the JIT verbose logging with the -Xjit:verbose command-line option.

Beyond this preload and early-compile approach, there isn't much an application writer can do to avoid pauses incurred by the JIT compiler, short of using exotic vendor-specific JIT compiler command-line options — a risky approach. JVM vendors rarely support these options in production scenarios. Because they aren't default configurations, they're less well tested by the vendors, and they can change in both name and meaning from one release to the next.

However, some alternative JVMs can provide a few options for you, depending on how important JIT-compiler-induced pauses are to you. Real-time JVMs designed for use in hard real-time Java systems generally provide more options. The IBM WebSphere Real Time For Real Time Linux® JVM, for example, has five code-compilation strategies available with varying capability to reduce JIT compiler pauses:

  • Default JIT compilation, whereby the JIT compiler thread runs at low priority
  • Default JIT compilation at low priority with Ahead-of-time (AOT) compiled code used initially
  • Program-controlled compilation at startup with recompilation enabled
  • Program-controlled compilation at startup with recompilation disabled
  • AOT-compiled code only

These options are listed in general descending order of expected level of throughput/latency performance and expected pause times. So the default JIT compilation option, which uses a JIT compilation thread running at the lowest priority (which can be lower than application threads), provides the highest expected throughput performance but is also expected to show the greatest pauses that are due to JIT compilation (of these five options). The first two options use asynchronous compilation, which means that an application thread that tries to invoke a method that has been selected for (re)compilation need not wait until the compilation is complete. The last option has the lowest expected throughput/latency performance but zero pauses from the JIT compiler because the JIT compiler is completely disabled in this scenario.

The IBM WebSphere Real Time for Real Time Linux JVM provides a tool called admincache that allows you to create a shared class cache containing class files from a set of JAR files and, optionally, to store AOT-compiled code for those classes in the same cache. You can set an option in your java command line that causes classes stored in the shared class cache to be loaded from the cache and AOT code to be automatically loaded into the JVM when the class is loaded. A class-preloading loop like the one shown in Listing 2 is all that's required to ensure you get the full benefits of the AOT-compiled code. See Resources for a link to the admincache documentation.

Thread management

Controlling the execution of threads in a multithreaded application such as a transaction server is crucial to eliminating variability in transaction times. Although the Java programming language defines a threading model that includes a notion of thread priorities, the behavior of threads in a real JVM is largely implementation-defined with few rules a Java program can rely upon. For example, although Java threads can be assigned 1 of 10 thread priorities, the mapping of those application-level priorities to OS priority values is implementation-defined. (It's perfectly valid for a JVM to map all Java thread priorities onto the same OS priority value.) On top of that, the scheduling policy for Java threads is also implementation-defined but usually ends up being time-sliced so that even high-priority threads end up sharing CPU resources with lower-priority threads. Sharing resources with lower-priority threads can cause higher-priority threads to experience delays when they are scheduled out so that other tasks can get a time slice. Keep in mind that the amount of CPU a thread gets becomes dependent not only on the priority but also on the total number of threads that need to be scheduled. Unless you can strictly control how many threads are active at any given time, the time it takes even your highest-priority threads to execute an operation may fall within a relatively large range.

So even if you specify the highest Java thread priority (java.lang.Thread.MAX_PRIORITY) for your worker threads, that may not provide much isolation from lower-priority tasks on the system. Unfortunately, other than using a fixed set of working threads (do not continue to allocate new threads while relying on GC to collect unused ones, or grow and shrink your thread pool) and trying to minimize the number of low-priority activities on the system while your application runs, there may not be much more you can do because the standard Java threading model does not provide the tools needed to control threading behavior. Even a soft real-time JVM, if it relies on the standard Java threading model, cannot usually provide much help here.

A hard real-time JVM that supports the Real Time Specification for Java (RTSJ), however — such as the IBM WebSphere Real Time for Real Time Linux V2.0 or Sun's RTS 2 — can give markedly improved threading behavior over standard Java. Among its enhancements to the standard Java language and VM specifications, the RTSJ introduces two new types of threads, RealtimeThread and NoHeapRealtimeThread, which are much more rigorously defined than the standard Java threading model. These kinds of threads provide true preemptive priority-based scheduling: If a high-priority task needs to execute and a lower-priority task is currently scheduled on a processor core, then the lower-priority task is preempted so that the high-priority task can execute.

Most real-time OSs can perform this preemption in on the order of tens of microseconds, which only affects applications with extremely sensitive timing requirements. Both new thread types also typically use a FIFO (first-in, first out) scheduling policy instead of the familiar round-robin scheduling used by JVMs running on most OSs. The most obvious difference between the round-robin and FIFO scheduling policies is that, among threads of the same priority, once scheduled a thread continues to execute until it blocks or voluntarily releases the processor. The advantage of this model is that the time to execute a particular task can be more predictable because the processor isn't shared, even if there are several tasks with the same priority. On top of that, if you can keep that thread from blocking by eliminating synchronization and I/O activity, the OS will not interfere with the task once it starts. In practice, however, eliminating all synchronization is extremely difficult, so it can be hard to reach this ideal for realistic tasks. Nonetheless, FIFO scheduling provides an important helping hand to an application designer trying to put a cap on delays.

You can think of the RTSJ as a large box of tools that can help you to design applications with real-time behavior; you can use just a couple of the tools or you can completely rewrite your application to provide extremely predictable performance. It is usually not difficult to modify your application to use RealtimeThreads, and you can do it without even having access to a real-time JVM to compile your Java code, through careful use of the Java reflection services.

Taking advantage of the variability benefits of FIFO scheduling, however, can require some further changes to your application. FIFO scheduling behaves differently from round-robin scheduling, and the differences can cause hangs in some Java programs. For example, if your application relies on Thread.yield() to allow other threads to run on a core — a technique frequently used to poll for some condition without using a full core to do it — then the desired effect will not occur because, with FIFO scheduling, Thread.yield() does not block the current thread. Because the current thread remains schedulable and it is already the thread at the front of the scheduling queue in the OS kernel, it will simply continue to execute. So a coding pattern intended to provide fair access to CPU resources while waiting for a condition to become true in fact consumes 100 percent of whichever CPU core it happens to start running on. And that's the best possible result. If the thread that needs to set that condition has a lower priority, then it may never be able to get access to a core to set the condition. With today's multicore processors, this problem may be less likely to occur, but it emphasizes that you need to think carefully about which priorities you use if you employ RealtimeThreads. The safest approach is to make all threads use a single priority value and eliminate the use of Thread.yield() and other kinds of spin loops that will fully consume a CPU because they never block. Of course, taking full advantage of the priority values available to RealtimeThreads will give you the best chance of meeting your service-quality goals. (For more tips on using RealtimeThreads in your application, refer to "Real-time Java, Part 3: Threading and synchronization.")


A Java server example

In the remainder of this article, we'll apply some of the ideas introduced in earlier sections to a relatively simple Java server application built using the Executors service in the Java Class Library. With only a small amount of application code, the Executors service allows you to create a server managing a pool of worker threads, as shown in Listing 5:

Listing 5. Server and TaskHandler classes using Executors service
import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.ThreadFactory;

class Server {
    private ExecutorService threadPool;
    Server(int numThreads) {
        ThreadFactory theFactory = new ThreadFactory();
        this.threadPool = Executors.newFixedThreadPool(numThreads, theFactory);
    }

    public void start() {
        while (true) {
            // main server handling loop, find a task to do
            // create a "TaskHandler" object to complete this operation
            TaskHandler task = new TaskHandler();
            this.threadPool.execute(task);
        }
        this.threadPool.shutdown();
    }

    public static void main(String[] args) {
        int serverThreads = Integer.parseInt(args[0]);
        Server theServer = new Server(serverThreads);
        theServer.start();
    }
}

class TaskHandler extends Runnable {
    public void run() {
        // code to handle a "task"
    }
}

This server creates as many worker threads as needed up to the maximum specified when the server is created (decoded from the command line in this particular example). Each worker thread performs some bit of work using the TaskHandler class. For our purposes, we'll create a TaskHandler.run() method that should take the same amount of time every time it runs. Any variability in the time measured to execute TaskHandler.run(), therefore, is due to pauses or variability in the underlying JVM, some threading issue, or pauses introduced at a lower level of the stack. Listing 6 shows the TaskHandler class:

Listing 6. TaskHandler class with predictable performance
import java.lang.Runnable;
class TaskHandler implements Runnable {
    static public int N=50000;
    static public int M=100;
    static long result=0L;
    
    // constant work per transaction
    public void run() {
        long dispatchTime = System.nanoTime();
        long x=0L;
        for (int j=0;j < M;j++) {
            for (int i=0;i < N;i++) {
                x = x + i;
            }
        }
        result = x;
        long endTime = System.nanoTime();
        Server.reportTiming(dispatchTime, endTime);
    }
}

The loops in this run() method compute M (100) times the sum of the first N (50,000) integers. The values of M and N were chosen so that the transaction times on the machine we ran it on measured around 10 ms so that a single operation could be interrupted by an OS scheduling quantum (which typically lasts about 10 ms). We constructed the loops in this computation so that a JIT compiler could generate excellent code that would execute for an extremely predictable amount of time: the run() method does not explicitly block between the two calls to System.nanoTime() used to time how long the loops take to run. Because the measured code is highly predictable, we can use it to show how significant sources of delays and variability do not necessarily originate from the code you're measuring.

Let's make this application slightly more realistic by forcing the garbage-collector subsystem to be active while TaskHandler code is running. Listing 7 shows this GCStressThread class:

Listing 7. GCStressThread class to generate garbage continuously
class GCStressThread extends Thread {
    HashMap<Integer,BinaryTree> map;
    volatile boolean stop = false;

    class BinaryTree {
        public BinaryTree left;
        public BinaryTree right;
        public Long value;
    }
    private void allocateSomeData(boolean useSleep) {
        try {
            for (int i=0;i < 125;i++) {
                if (useSleep)
                    Thread.sleep(100);
                BinaryTree newTree = createNewTree(15); // create full 15-level BinaryTree
                this.map.put(new Integer(i), newTree);
            }
        } catch (InterruptedException e) {
            stop = true;
        }
    }

    public void initialize() {
        this.map = new HashMap<Integer,BinaryTree>();
        allocateSomeData(false);
        System.out.println("\nFinished initializing\n");
    }

    public void run() {
        while (!stop) {
            allocateSomeData(true);
        }
    }
}

The GCStressThread maintains a set of BinaryTrees via a HashMap. It iterates over the same set of Integer keys for the HashMap storing new BinaryTree structures, which are simply fully populated 15-level BinaryTrees. (So there are 215 = 32,768 nodes in each BinaryTree stored into the HashMap.) The HashMap holds 125 BinaryTrees at any one time (the live data), and every 100 ms it replaces one of them with a new BinaryTree. In this way, this data structure maintains a fairly complicated set of live objects as well as generates garbage at a particular rate. The HashMap is first initialized with a full set of 125 BinaryTrees using the initialize() routine, which does not bother pausing between allocations of each tree. Once the GCStressThread has been started (just before the server is started) it operates throughout the handling of the TaskHandler operations by the server's worker threads.

We won't use a client to drive this server. We'll simply create NUM_OPERATIONS == 10000 operations directly inside the server main loop (in the Server.start() method). Listing 8 shows the Server.start() method:

Listing 8. Dispatching operations inside the server
public void start() {
    for (int m=0; m < NUM_OPERATIONS;m++) {
        TaskHandler task = new TaskHandler();
        threadPool.execute(task);
    }
    try {
        while (!serverShutdown) { // boolean set to true when done
            Thread.sleep(1000);
        }
    }
    catch (InterruptedException e) {
    }
}

If we collect statistics on the times to complete each TaskHandler.run() invocation, we can see how much variability is introduced by the JVM and by the application's design. We used an IBM xServer e5440 with eight physical cores with the Red Hat RHEL MRG real-time operating system. (Hyperthreading is disabled. Note that although hyperthreading can provide some throughput improvement in a benchmark, because its virtual cores are not full, physical-core performance of operations on processors with hyperthreading enabled can have markedly different timings.) When we ran this server with six threads on the eight-core machine (we'll generously leave one core for the Server main thread and one for the GCStressorThread to use) with the IBM Java6 SR3 JVM, we get the following (representative) results:

$ java -Xms700m -Xmx700m -Xgcpolicy:optthruput Server 6
10000 operations in 16582 ms
Throughput is 603 operations / second
Histogram of operation times:
9ms - 10ms      9942    99 %
10ms - 11ms     2       0 %
11ms - 12ms     32      0 %
30ms - 40ms     4       0 %
70ms - 80ms     1       0 %
200ms - 300ms   6       0 %
400ms - 500ms   6       0 %
500ms - 542ms   6       0 %

You can see that almost all of the operations complete within 10 ms, but some operations take longer than a half second (50 times longer). That's quite a variation! Let's see how we can eliminate some of this variability by eliminating delays incurred by Java class loading, JIT native code compilation, GC, and threading.

We first collected the list of classes loaded by the application through an entire run with -verbose:class. We stored the output to a file and then modified it so that there was one properly formatted name on each line of that file. We included a preload() method into the Server class to load each of the classes, JIT compile all the methods of those classes, and then disable the JIT compiler, as shown in Listing 9:

Listing 9. Preloading classes and methods for the server
private void preload(String classesFileName) {
    try {
        FileReader fReader = new FileReader(classesFileName);
        BufferedReader reader = new BufferedReader(fReader);
        String className = reader.readLine();
        while (className != null) {
            try {
                Class clazz = Class.forName(className);
                String n = clazz.getName();
                Compiler.compileClass(clazz);
            } catch (Exception e) {
            }
            className = reader.readLine();
        }
    } catch (Exception e) {
    }
    Compiler.disable();
}

Class loading isn't a significant problem in our simple server because our TaskHandler.run() method is so simple: once that class is loaded, not much class loading occurs later on in the execution of the Server, which can be verified by running with -verbose:class. The main benefit derives from compiling the methods before running any measured TaskHandler operations. Although we could have used a warm-up loop, this approach tends to be JVM-specific because the heuristics the JIT compiler uses to select methods to compile differ among JVM implementations. Using the Compiler.compile() service provides more controllable compilation activity, but as we mentioned earlier in the article, we should expect a throughput drop in using this approach. The results from running the application with these options are:

$ java -Xms700m -Xmx700m -Xgcpolicy:optthruput Server 6
10000 operations in 20936 ms
Throughput is 477 operations / second
Histogram of operation times:
11ms - 12ms     9509    95 %
12ms - 13ms     478     4 %
13ms - 14ms     1       0 %
400ms - 500ms   6       0 %
500ms - 527ms   6       0 %

Notice that although the longest delays haven't changed much, the histogram is much shorter than it was initially. Many of the shorter delays were clearly introduced by the JIT compiler, so performing the compilations earlier and then disabling the JIT compiler was clearly a step forward. Another interesting observation is that the common operation times have gotten somewhat longer (from around 9 to 10 ms, to 11 to 12 ms). The operations have been slowed down because the quality of the code generated by a forced JIT compilation before the methods have been invoked is typically lower than that of fully exercised code. That's not a surprising result, because one of the great advantages of the JIT compiler is exploiting dynamic characteristics of the application that's running to make it run more efficiently.

We'll continue to use this class-preloading and method-precompiling code in the rest of the article.

Because our GCStressThread generates a constantly changing set of live data, using a generational GC policy isn't expected to provide much pause-time benefit. Instead, we tried the real-time garbage collector in the IBM WebSphere Real Time for Real Time Linux V2.0 SR1 product. The results were initially disappointing, even after we added the -Xgcthreads8 option, which allows the collector to use eight GC threads rather than the default single thread. (The collector cannot reliably keep up with the allocation rate of this application with only a single GC thread.)

$ java -Xms700m -Xmx700m -Xgcpolicy:metronome -Xgcthreads8 Server 6
10000 operations in 72024 ms
Throughput is 138 operations / second
Histogram of operation times:
11ms - 12ms     82      0 %
12ms - 13ms     250     2 %
13ms - 14ms     19      0 %
14ms - 15ms     50      0 %
15ms - 16ms     339     3 %
16ms - 17ms     889     8 %
17ms - 18ms     730     7 %
18ms - 19ms     411     4 %
19ms - 20ms     287     2 %
20ms - 30ms     1051    10 %
30ms - 40ms     504     5 %
40ms - 50ms     846     8 %
50ms - 60ms     1168    11 %
60ms - 70ms     1434    14 %
70ms - 80ms     980     9 %
80ms - 90ms     349     3 %
90ms - 100ms    28      0 %
100ms - 112ms   7       0 %

Using the real-time collector has lowered the maximum operation time substantially, but it has also increased the spread of the operation times. Worse, the throughput rate has dropped substantially.

The final step is to use RealtimeThreads — rather than regular Java threads — for the worker threads. We created a RealtimeThreadFactory class that we can give to the Executors service, as shown in Listing 10:

Listing 10. RealtimeThreadFactory class
import java.util.concurrent.ThreadFactory;
import javax.realtime.PriorityScheduler;
import javax.realtime.RealtimeThread;
import javax.realtime.Scheduler;
import javax.realtime.PriorityParameters;

class RealtimeThreadFactory implements ThreadFactory {
    public Thread newThread(Runnable r) {
        RealtimeThread rtThread = new RealtimeThread(null, null, null, null, null, r);

        // adjust parameters as needed
        PriorityParameters pp = (PriorityParameters) rtThread.getSchedulingParameters();
        PriorityScheduler scheduler = PriorityScheduler.instance();
        pp.setPriority(scheduler.getMaxPriority());

        return rtThread;
    }
}

Passing an instance of the RealtimeThreadFactory class to the Executors.newFixedThreadPool() service causes the worker threads to be RealtimeThreads using FIFO scheduling with the highest priority available. The garbage collector will still interrupt these threads when it needs to perform work, but no other lower-priority tasks will interfere with the worker threads:

$ java -Xms700m -Xmx700m -Xgcpolicy:metronome -Xgcthreads8 Server 6
Handled 10000 operations in 27975 ms
Throughput is 357 operations / second
Histogram of operation times:
11ms - 12ms     159     1 %
12ms - 13ms     61      0 %
13ms - 14ms     17      0 %
14ms - 15ms     63      0 %
15ms - 16ms     1613    16 %
16ms - 17ms     4249    42 %
17ms - 18ms     2862    28 %
18ms - 19ms     975     9 %
19ms - 20ms     1       0 %

With this last change, we substantially improve both the worst operation time (down to only 19 ms) as well as the overall throughput (up to 357 operations per second). So we have substantially improved upon the variability of operation times, but we paid a pretty steep price in throughput performance. The operation of the garbage collector, using up to 3 ms of every 10 ms, explains why an operation that typically takes about 12 ms can be extended by up to 4 to 5 ms, which is why the bulk of the operations now take around 16 to 17 ms. The throughput drop is probably more than you expect because the real-time JVM, in addition to using the Metronome real-time garbage collector, also has modified locking primitives that protect against priority inversion, an important problem when FIFO scheduling is used (see "Real-time Java, Part 1: Using Java code to program real-time systems"). Unfortunately, the synchronization between the master thread and the worker threads contributes more overhead that ultimately has an impact on throughput, although it's not measured as part of any operation time (so it doesn't show up in the histogram).

So while our server benefits from the modifications made to improve predictability, it certainly experiences a fairly big throughput drop. Nonetheless, if those few incredibly long operation times represent an unacceptable level of service quality, then using RealtimeThreads with a real-time JVM may be just the right solution.


Wrap-up

In the world of Java applications, throughput and latency have traditionally been the metrics chosen by application and benchmark designers for reporting and optimization. This choice has had a widespread impact on the evolution of Java runtimes built to improve performance. Although Java runtimes started out as interpreters with extremely slow runtime latency and throughput, modern JVMs can compete well with other languages on these metrics for many applications. Until relatively recently, though, the same could not be said about some other metrics that can have a big impact on an application's perceived performance — especially variability, which affects quality of service.

The introduction of real-time Java has given application designers the tools they need to address sources of variability in a JVM and in their applications to deliver the quality of service their consumers and customers expect. This article introduced a number of techniques you can use to modify a Java application to reduce pauses and variability that stem from the JVM and from thread scheduling. Reducing variability frequently incurs a drop in latency and throughput performance. The degree to which that drop is acceptable determines which tools are appropriate for a particular application.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=426954
ArticleTitle=Developing with real-time Java, Part 2: Improve service quality
publish-date=09082009