JVM updates in WebSphere Application Server V8: Adaptive spinning and the lock nursery

This article introduces refinements to concurrency-related optimizations available in the Java™ Virtual Machine shipped with IBM® WebSphere® Application Server V8. As these optimizations are related to synchronization between threads, this discussion begins with a short introduction to synchronization concepts within the JVM, as well as existing optimizations, and then introduces two new refinements that can help to reduce memory and CPU consumption within the JVM. This content is part of the IBM WebSphere Developer Technical Journal.

Michael Dawson, Advisory Software Developer, IBM Ottawa Lab

Michael DawsonMichael Dawson graduated in 1989 from the University of Waterloo with a bachelor's degree in computer engineering and in 1991 from Queens University with a master's degree in electrical engineering, specializing in cryptography. He then did security consulting work and developed EDI security products. Next, he was the development lead for a start-up that delivered security products across various platforms. He has since held leadership roles in teams developing e-commerce applications and delivering them as services including EDI communication services, credit card processing, on-line auctions, and electronic invoicing. The technologies used ranged from C/C++ to Java and J2EE platforms and components across a range of operating systems. In 2006, Michael joined IBM and works on the J9 JVM and WebSphere Real Time.



02 November 2011

Introduction

One of the advantages of Java is its native support for threading and concurrency. While it is best to avoid synchronization between threads whenever possible, the performance of an application often depends on the Java Virtual Machine's (JVM) ability to handle synchronization quickly and efficiently when it occurs. The IBM JVM has long incorporated optimizations to make synchronization fast and efficient. Refinements made to these optimizations, which are present in the JVM shipped with IBM WebSphere Application Server V8, include techniques that reduce both the memory and CPU requirements for the JVM's support of efficient synchronization, which in turn can lead to improved application performance.

Java has built-in support for synchronization between threads. The synchronized keyword used either on methods or on a block ensures that only a single thread is running the code within that method or block at a time. The keyword also provides certain guarantees on the visibility of data between threads once a thread has run through a synchronized method or block. Some examples include those shown in Listing 1.

Listing 1. Synchronization between threads
public class Test {
   synchronized void syncMethod() {
      /* do something, exclusive lock on object instance of type Test */
   }
   
   static synchronized public void aMethod(){
      /* do something, exclusive lock on object for Test class */
   }
   
   public void bMethod(){
      synchronized(syncObject){
         /* do something, exclusive lock on syncObject */
      }
   }
}

In order to support synchronization, a JVM must conceptually support a monitor for each Java object (object monitor). Only a single thread can own the monitor at a given time, and any other thread that tries to acquire the monitor while it is held (in other words, "lock" it) will be blocked until it becomes available.

While these monitors could be implemented using operating system locking structures and primitives on a one for one basis, the large number of objects and the rate at which they are created or destroyed makes this approach inefficient and would result in poor performance. If you have to use an operating system (OS) structure and call out to the OS even when another thread does not hold a lock, you incur the cost of the OS resources and the additional CPU cycles needed to call out to the operating system.

More recent JVMs also support additional locking functionality through the APIs provided by the classes in java.util.concurrent (juc). The implementation of juc locks differs enough from object monitors in that the optimizations and refinements outlined here have not yet been applied to them to the same extent.

Now, let's introduce some of the techniques that are used to optimize how monitors are supported within the JVM, as well as the refinements that are available in the JVM shipped with WebSphere Application Server V8: adaptive spinning and lock nursery.

Bimodal locks

In typical applications, most objects are never locked. In well designed applications, objects that are locked have low contention. This means that, in most cases, when a thread needs to lock the object, no other thread will be holding it. To take advantage of this behavior, the IBM JVM incorporates bimodal locking techniques.

The key concept is that a monitor has a "thin" lock that can be tested efficiently, but which does not support blocking, and -- only when necessary -- an "inflated" lock. The inflated lock is typically implemented using OS resources that can support blocking, but also is less efficient because of the additional path length required when making the calls to the operating system. Because thin locks don't support blocking, spinning is often used such that threads will spin for a short period of time in case the lock becomes available soon after they first try to acquire it. The general flow for acquiring a monitor based on bimodal locks is shown in Listing 2.

Listing 2. Bimodal locking flow
start:
if (not inflated){ 
	for(i=0;i<spin; i++){
		if (lock(thin lock) == successful){
			goto success;
		}
		for(j=0;j<spin2;j++){	}/* busy wait */
	}
	inflate and create OS resources
}
	lock using OS resources
success:
	thread holds lock.

Of course, a practical implementation is more complicated and includes further optimizations, such as being able to use the thin lock again once there is no longer contention. The research paper A Study of Locking Objects with Bimodal Fields provides a more in depth explanation of the concepts and potential implementations (see Resources).

The first advantage to using bimodal locks is that no OS resources are required for the object monitor, unless the object is locked and there is enough contention that blocked threads exceed the maximum spin. The second advantage is that thin lock acquisition can be as simple as a compare and swap on a field with the object header, leading to very fast and low overhead locking compared to making OS calls.

Tools for understanding lock behavior and identifying sources of contention can help in the understanding of bimodal lock behavior. IBM Monitoring and Diagnostic Tools for Java - Health Center and Java Lock Monitor (JLM) are two such tools that you might find useful (see Resources).

Adaptive spinning

While the spin included while using the thin lock mentioned above has been shown to significantly improve performance in many cases, it does consume CPU cycles. In some specific cases, and when CPU resources can be fully utilized by application threads other than those waiting to acquire a monitor, this spin can prevent useful work from getting done. For example, consider the case shown in Figure 1. Here there are four applications threads: two that are synchronized (t1 and t2) and two which have no synchronization (t3 and t4).

Figure 1. Example locking sequences
Illustrated locking sequence

In the first sequence, you assume spinning at the flat lock level, while in the second sequence you assume no spinning. Clearly, thread t3 was able to get more work completed in the case without spinning because more CPU cycles were available to be shared between t2, t3, and t4. This is because t1 was blocked at the OS level.

In the first sequence, t1 gets scheduled in and consumes CPU cycles spinning as t2 still hold the lock; in the second sequence, t1 will not be scheduled in and t3 can use these cycles. The maximum spin counts are selected in order to minimize this case, and techniques are used to give preference to other threads over spinning threads; however, this issue cannot be completely eliminated through these techniques.

Analysis of typical locking patterns gives us the insight that spinning helps most cases, but for some specific cases it does not. Before running an application, it is impossible to know for which monitors spinning will not be useful. It is possible, however, to observe monitor usage and identify at run time those monitors for which you do not believe spinning will be helpful. You can then reduce or eliminate spinning for those specific monitors.

Consider the case shown in Figure 2, where your maximum spin counts result in a typical maximum spin time of X, while the typical hold time for the monitor is Y, which is much greater than X.

Figure 2. Example spin sequence
Illustrated spin sequence

Given, as shown, that the maximum spin time is much shorter than the typical hold time, the spin will only help threads that try to acquire the lock near the end of a hold (as shown in green). All other overlapping attempts will fail. Therefore, for monitors with long hold times relative to the maximum spin time (assuming random arrival times), the likelihood that the spin is going to help is quite low. In these cases, it likely makes more sense to immediately go to the blocking OS path without wasting CPU cycles on spinning.

Another example involves monitors where the past history shows you that you always exceed the spin and have to go the the blocking OS path in order to acquire the lock. While past history is no guarantee of future behavior, it is often a good indicator.

The JVM shipped with WebSphere Application Serer V8 includes spinning refinements that capture locking history and use this history to adaptively decide which monitors should use spin and which should not. This can free up additional cycles for other threads with work to do and, when CPU resources are fully utilized, improve overall application performance. This enhanced function is enabled by default, and applications deployed on WebSphere Application Server V8 will benefit without the need for any application changes or tuning.

Lock nursery

The thin lock that forms part of bimodal locking consumes memory, typically in the object header. However, as mentioned earlier, most objects are never locked at all. For objects that are not locked, the space for the thin lock is wasted. For applications with a large number of objects, the total space set aside for thin locks can add up.

Lock nursery is a technique in which you don't include thin locks in the object header when you don't believe the object will be locked, or will only rarely be locked.

Figure 3. Object header shapes
Figure 3. Object header shapes

Of course, the JVM must continue to fully support locking on all objects within the JVM and, ideally, bimodal locking as well. Since the thin lock is no longer in the object header, there will be some additional overhead in getting to it when needed; however, this is expected to only occur rarely and has been optimized with caches and other techniques.

The net result of having fewer thin locks overall is a smaller heap (Figure4), which in turn can result in lowering OS resources used and more efficient garbage collections. Freeing these resources for application threads can improve overall application performance.

Figure 4. Smaller heap from smaller object headers
Figure 4. Smaller heap from smaller object headers

The research paper Space- and Time-Efficient Implementation of the Java Object Model provides a more in depth description of basic lock nursery techniques (see Resources).

The initial implementation is relatively conservative in choosing objects that will not have thin locks in the object header; regardless, those without thin locks often make up to 50% of the heap, which results in significant savings.

It is already a known good practice to use uniquely named objects for locking. For example, the code in Listing 3 shows a better practice than that shown in Listing 4.

Listing 3. Synchronization between threads
class SyncClass extends Object{};
Object syncObject = new SyncClass();

public void Method1(){
   synchronized(syncObject){
      // do something
   }
}
Listing 4. Synchronization between threads
Object syncObject = new Object();

public void Method1(){
   synchronized(syncObject){
      // do something
   }
}

This practice is helpful for problem determination because tools often report the class of the object on which contention is occurring. A uniquely named class for the object helps identify where in the application code those objects are being used. For example, the LOCKS information in a javacore dump includes the name of the object, as shown in Listing 5.

Listing 5. Sample Javacore contents
0SECTION       LOCKS subcomponent dump routine
NULL           ===============================
NULL           
1LKPOOLINFO    Monitor pool info:
2LKPOOLTOTAL     Current total number of monitors: 4
NULL           
1LKMONPOOLDUMP Monitor Pool Dump (flat & inflated object-monitors):
2LKMONINUSE      sys_mon_t:0x05416E30 infl_mon_t: 0x05416E6C:
3LKMONOBJECT       test/LockComparison$SyncClass@0xE65D4848/0xE65D484C: owner "Thread-26"
3LKWAITERQ            Waiting to enter:
3LKWAITER                "Thread-21" (0x04AF6100)
3LKWAITER                "Thread-22" (0x04AD4700)
3LKWAITER                "Thread-23" (0x04A7E100)
3LKWAITER                "Thread-24" (0x04AD4D00)
3LKWAITER                "Thread-25" (0x04DA4800)
3LKWAITER                "Thread-27" (0x04AF6700)
3LKWAITER                "Thread-28" (0x04BBB100)

If you had locked on java/lang/Object, as opposed to test/LockComparison$SyncObject, it would have been more difficult to identify where in the code the contention occured.

In Listing 3, there might be a single location in the code where instances of the inner SyncClass are used. If instead you had simply used Object, you would get no information from the class name as to where the contention may be occurring. An additional advantage of following this pattern is that it might enable your application to leverage more aggressive lock nursery implementations in the future, resulting in less unused thin locks and smaller heaps.

Conclusion

This article presented a brief introduction to synchronization support with the JVM, along with optimizations in the JVM to make it fast and efficient. Building on these concepts, the refinements to these optimizations in the JVM that is shipped with WebSphere Application Server V8 were described, along with how they can result in a reduction of CPU and memory resources and potentially lead to improved overall application performance.

These new features are likely the best choice for use in most if not all applications and will result in a net benefit as you migrate to WebSphere Application Server V8. The Java diagnostic guides provide command line options that can be used to control the new behaviors if you run into a situation where you believe this is not the case.

This article described but two of the many performance optimizations that have been made in the JVM in WebSphere Application Server V8, not to mention the many optimizations made in WebSphere Application Server V8 itself. Now is a good time to try out WebSphere Application Server V8 and to take advantage of all the new features and enhancements available.

Resources

Learn

Get products and technologies

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=768984
ArticleTitle=JVM updates in WebSphere Application Server V8: Adaptive spinning and the lock nursery
publish-date=11022011