Threading lightly, Part 2: Reducing contention

Improve application performance by staying out of your own way

In Part 1 of this series, we examined the performance overhead of uncontended synchronization. While it's common to hear that synchronized method calls can be 50 times as expensive as unsynchronized method calls, these numbers can actually be quite misleading. With each successive JVM version, overall performance has improved, and the cost of uncontended synchronization has been reduced, making the issue of uncontended synchronization overhead less significant. Contended synchronization, however, is quite expensive. Moreover, a high degree of contention is disastrous for scalability -- an application that had a high degree of contended synchronization will exhibit markedly worse performance as the load increases. This article will explore several techniques for reducing contention, and hence improving scalability, in your programs.

Brian Goetz (brian@quiotix.com), Software Consultant, Quiotix Corp.

Brian Goetz is a software consultant and has been a professional software developer for the past 15 years. He is a Principal Consultant at Quiotix, a software development and consulting firm located in Los Altos, California. See a list of Brian's published and upcoming articles in popular industry publications.



05 September 2001

Also available in Japanese

When we say a program is "too slow," we are generally referring to one of two performance attributes -- latency or scalability. Latency describes how long it takes for a given task to complete, whereas scalability describes how a program's performance varies under increasing load or given increased computing resources. A high degree of contention is bad for both latency and scalability.

Why contention is such a problem

Contended synchronizations are slow because they involve multiple thread switches and system calls. When multiple threads contend for the same monitor, the JVM has to maintain a queue of threads waiting for that monitor (and this queue must be synchronized across processors), which means more time spent in the JVM or OS code and less time spent in your program code. Moreover, contention impairs scalability because it forces the scheduler to serialize operations, even if a free processor is available. When one thread is executing a synchronized block, any thread waiting to enter that block is stalled. If no other threads are available for execution, then processors may sit idle.

If we want to write scalable multithreaded programs, we must reduce contention for critical resources. There are a number of techniques for doing so, but before you can apply any of them, you need to take a good look at your code and figure out under what conditions you will be synchronizing on common monitors. Determining what locks are bottlenecks can be quite difficult; sometimes locks are hidden within class libraries or implicitly specified through synchronized methods, and therefore are less obvious when reviewing the code. Moreover, the current state of tools for detecting contention is quite poor.


Technique 1: Get in, get out

Don't miss the rest of this series

Part 1, "Synchronization is not the enemy" (July 2001)

Part 3, "Sometimes it's best not to share" (October 2001)

One obvious technique for reducing the likelihood of contention is to make synchronized blocks as short as possible. The shorter the time a thread holds a given lock, the lower the probability that another thread will request it while the first thread is holding it. So while you should use synchronization to access or update shared variables, it is usually better to do any thread-safe pre-processing or post-processing outside of the synchronized block.

Listing 1 demonstrates this technique. Our application maintains a HashMap for representing attributes of various entities; one such attribute is the list of access rights that a given user has. The access rights are stored as a comma-separated list of rights. The method userHasAdminAccess() looks up a user's access rights in the global attributes table, and looks to see if the user has the access called "ADMIN".

Listing 1. Spending more time in a synchronized block than necessary
  public boolean userHasAdminAccess(String userName) {
    synchronized (attributesMap) {
      String rights = attributesMap.get("users." + userName + ".accessRights");
      if (rights == null) 
        return false;
      else
        return (rights.indexOf("ADMIN") >= 0);
    }
  }

This version of userHasAdminAccess is thread-safe, but holds the lock for much longer than necessary. To create the concatenated string "users.brian.accessRights", the compiler will create a temporary StringBuffer object, call StringBuffer.append three times, and then call StringBuffer.toString, which means at least two object creations and several method calls. It will then call HashMap.get to retrieve the string, and then call String.indexOf to extract the desired rights identifier. As a percentage of the total work done in this method, the pre- and post-processing are significant; because they are thread-safe, it makes sense to move them out of the synchronized block, as shown in Listing 2.

Listing 2. Reducing the time spent in a synchronized block
  public boolean userHasAdminAccess(String userName) {
    String key = "users." + userName + ".accessRights";
    String rights;

    synchronized (attributesMap) {
      rights = attributesMap.get(key);
    }
    return ((rights != null) 
            && (rights.indexOf("ADMIN") >= 0));
  }

On the other hand, it's possible to take this technique too far. If you have two operations that require synchronization separated by a small block of thread-safe code, you are generally better just using a single synchronized block.


Technique 2: Reducing lock granularity

Another valuable technique for reducing contention is to spread your synchronizations over more locks. For example, suppose that you have a class that stores user information and service information in two separate hash tables, as shown in Listing 3.

Listing 3. An opportunity for reducing lock granularity
public class AttributesStore {
  private HashMap usersMap = new HashMap();
  private HashMap servicesMap = new HashMap();

  public synchronized void setUserInfo(String user, UserInfo userInfo) {
    usersMap.put(user, userInfo);
  }

  public synchronized UserInfo getUserInfo(String user) {
    return usersMap.get(user);
  }

  public synchronized void setServiceInfo(String service, 
                                          ServiceInfo serviceInfo) {
    servicesMap.put(service, serviceInfo);
  }

  public synchronized ServiceInfo getServiceInfo(String service) {
    return servicesMap.get(service);
  }
}

Here, the accessor methods for user and service data are synchronized, which means that they are synchronizing on the AttributesStore object. While this is perfectly thread-safe, it increases the likelihood of contention for no real benefit. If a thread is executing setUserInfo, it means that not only will other threads be locked out of setUserInfo and getUserInfo, as is desired, but they will also be locked out of getServiceInfo and setServiceInfo.

This problem can be avoided by having the accessor simply synchronize on the actual objects being shared (the userMap and servicesMap objects), as shown in Listing 4.

Listing 4. Reducing lock granularity
public class AttributesStore {
  private HashMap usersMap = new HashMap();
  private HashMap servicesMap = new HashMap();

  public void setUserInfo(String user, UserInfo userInfo) {
    synchronized(usersMap) {
      usersMap.put(user, userInfo);
    }
  }

  public UserInfo getUserInfo(String user) {
    synchronized(usersMap) {
      return usersMap.get(user);
    }
  }

  public void setServiceInfo(String service, 
                             ServiceInfo serviceInfo) {
    synchronized(servicesMap) {
      servicesMap.put(service, serviceInfo);
    }
  }

  public ServiceInfo getServiceInfo(String service) {
    synchronized(servicesMap) {
      return servicesMap.get(service);
    }
  }
}

Now threads accessing the services map will not contend with threads trying to access the users map. (In this case, the same effect could also be obtained by creating the maps using the synchronized wrapper mechanism provided by the Collections framework, Collections.synchronizedMap.) Assuming that requests against the two maps are evenly distributed, in this case this technique would cut the number of potential contentions in half.


Applying Technique 2 to HashMap

One of the most common contention bottlenecks in server-side Java applications is the HashMap. Applications use HashMap to cache all sorts of critical shared data (user profiles, session information, file contents), and the HashMap.get method may correspond to many bytecode instructions. For example, if you are writing a Web server, and all your cached pages are stored in a HashMap, every request will want to acquire and hold the lock on that map, and it will become a bottleneck.

We can extend the lock granularity technique to handle this situation, although we must be careful as there are some potential Java Memory Model (JMM) hazards associated with this approach. The LockPoolMap in Listing 5 exposes thread-safe get() and put() methods, but spreads the synchronization over a pool of locks, reducing contention substantially.

LockPoolMap is thread-safe and functions like a simplified HashMap, but has more attractive contention properties. Instead of synchronizing on the entire map on each get() or put() operation, the synchronization is done at the bucket level. For each bucket, there's a lock, and that lock is acquired when traversing a bucket either for read or write. The locks are created when the map is created (there would be JMM problems if they were not.)

If you create a LockPoolMap with many buckets, many threads will be able to use the map concurrently with a much lower likelihood of contention. However, the reduced contention does not come for free. By not synchronizing on a global lock, it becomes much more difficult to perform operations that act on the map as a whole, such as the size() method. An implementation of size() would have to sequentially acquire the lock for each bucket, count the number of nodes in that bucket, and release the lock and move on to the next bucket. But once the previous lock is released, other threads are now free to modify the previous bucket. By the time size() finishes calculating the number of elements, it could well be wrong. However, the LockPoolMap technique works quite well in some situations, such as shared caches.

Listing 5. Reducing locking granularity on a HashMap
import java.util.*;

/**
 * LockPoolMap implements a subset of the Map interface (get, put, clear)
 * and performs synchronization at the bucket level, not at the map
 * level.  This reduces contention, at the cost of losing some Map
 * functionality, and is well suited to simple caches.  The number of
 * buckets is fixed and does not increase.
 */

public class LockPoolMap {
  private Node[] buckets;
  private Object[] locks;

  private static final class Node {
    public final Object key;
    public Object value;
    public Node next;

    public Node(Object key) { this.key = key; }
  }

  public LockPoolMap(int size) {
    buckets = new Node[size];
    locks = new Object[size];
    for (int i = 0; i < size; i++)
      locks[i] = new Object();
  }

  private final int hash(Object key) {
    int hash = key.hashCode() % buckets.length;
    if (hash < 0)
      hash *= -1;
    return hash;
  }

  public void put(Object key, Object value) {
    int hash = hash(key);

    synchronized(locks[hash]) {
      Node m;
      for (m=buckets[hash]; m != null; m=m.next) {
        if (m.key.equals(key)) {
          m.value = value;
          return;
        }
      }

      // We must not have found it, so put it at the beginning of the chain
      m = new Node(key);
      m.value = value;
      m.next = buckets[hash];
      buckets[hash] = m;
    }
  }

  public Object get(Object key) {
    int hash = hash(key);

    synchronized(locks[hash]) {
      for (Node m=buckets[hash]; m != null; m=m.next) 
        if (m.key.equals(key))
          return m.value;
    }
    return null;
  }
}

Table 1 compares the performance of three shared map implementations; a synchronized HashMap, an unsynchronized HashMap (not thread-safe), and a LockPoolMap. The unsynchronized version is present only to show the overhead of contention. A test that does random put() and get() operations on the map was run, with a variable number of threads, on a dual-processor system Linux system using the Sun 1.3 JDK. The table shows the run time for each combination. This test is somewhat of an extreme case; the test programs do nothing but access the map, and so there will be many more contentions than there would be in a realistic program, but it is designed to illustrate the performance penalty of contention.

Table 1. Scalability comparison between HashMap and LockPoolMap
ThreadsUnsynchronized HashMap (unsafe)Synchronized HashMapLockPoolMap
11.11.41.6
21.157.63.7
42.1123.57.7
83.7272.316.7
166.8577.037.9
3213.51233.380.5

While all the implementations exhibit similar scaling characteristics for large numbers of threads, the HashMap implementation exhibits a huge performance penalty when going from one thread to two, because there will be a contention on every single put() and get() operation. With more than one thread, the LockPoolMap technique is approximately 15 times faster than the HashMap technique. This difference reflects the time lost to scheduling overhead and to idle time spent waiting to acquire locks. The advantage of LockPoolMap would be even larger on a system with more processors.


Technique 3: Lock collapsing

Another technique that may improve performance is called "lock collapsing" (see Listing 6). Recall that the methods of the Vector class are nearly all synchronized. Imagine that you have a Vector of String values, and you are searching for the longest String. Suppose further that you know that elements will be added only at the end, and that they will not be removed, making it (mostly) safe to access the data as shown in the getLongest() method, which simply loops through the elements of the Vector, calling elementAt() to retrieve each one.

The getLongest2() method is very similar, except that it obtains the lock on the Vector before starting the loop. The result of this is that when elementAt() attempts to acquire the lock, the JVM sees that the current thread already has it, and will not contend. It lengthens the synchronized block, which appears to be in opposition to the "get in, get out" principle, but because it is avoiding so many potential synchronizations, it can be considerably faster, as less time will be lost to the scheduling overhead.

On a dual-processor Linux system running the Sun 1.3 JDK, a test program with two threads that simply looped calling getLongest2() was more than 10 times faster than one that called getLongest(). While both programs had the same degree of serialization, much less time was lost to scheduling overhead. Again, this is an extreme example, but it shows that the scheduling overhead of contention is not trivial. Even when run with a single thread, the collapsed version was some 30 percent faster: it is much faster to acquire a lock that you already hold than one that nobody holds.

Listing 6. Lock collapsing.
  Vector v;
  ...
 
 public String getLongest() {
    int maxLen = 0;
    String longest = null;

    for (int i=0; i<v.size(); i++) {
      String s = (String) v.elementAt(i);
      if (s.length() > maxLen) {
        maxLen = s.length();
        longest = s;
      }
    }
    return longest;
  }

  public String getLongest2() {
    int maxLen = 0;
    String longest = null;

    synchronized (v) { 
      for (int i=0; i<v.size(); i++) {
        String s = (String) v.elementAt(i);
        if (s.length() > maxLen) {
          maxLen = s.length();
          longest = s;
        }  
      }  
      return longest;
    }
  }

Conclusion

Contended synchronization can have a serious impact on the scalability of your programs. Even worse, unless you perform realistic load testing, contention-related performance problems do not always present themselves during the development and testing process. The techniques presented in this article are effective for reducing the cost of contention in your programs, and increasing the load they can bear before exhibiting nonlinear scaling behavior. But before you can apply these techniques, you first have to analyze your program to determine where contention is likely to occur.

In the last installment in this series, we'll examine ThreadLocal, an oft-neglected facility of the Thread API. By using ThreadLocal, we can reduce contention by giving each thread its own copy of certain critical objects. Stay tuned!

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=10582
ArticleTitle=Threading lightly, Part 2: Reducing contention
publish-date=09052001