Use HashMap for better multithreaded performance

A technique that improves performance for applications with infrequently updated read-write maps

Comments

A disadvantage to using data shared among multiple Java threads is that access to the data must be synchronized to avoid an inconsistent view of the contents. For example, the Hashtable class' put() and get() methods are synchronized. Synchronization is required so simultaneous put() and get() methods have sole access to the data when executing.

The synchronization points around those methods can become bottlenecks when an application's threads access those methods too frequently. Only one thread at a time gets access to the contents. The other threads must wait their turn, which hurts performance and throughput.

For data that changes infrequently, we've developed a technique that allows use of a HashMap, rather than a Hashtable. In a HashMap, the get() and put() methods are not synchronized. The developer is responsible for ensuring that get and put operations never execute at the same time. Our technique gives you a way to do this. (The approach works even if data is updated frequently, but the performance advantage is lost. Repopulating a HashMap offsets the performance gained by avoiding synchronized accessor methods.)

The technique takes advantage of two other characteristics of the Java language:

  • Automatic garbage collection— When the last reference to an object has gone away, the Java run time can free the object automatically. No action by the application is required other than making sure that no object references remain when the application is done using the object.
  • Atomicity of object references— A simple assignment statement that gets access to an object cannot be interrupted. As a result, it's not necessary to synchronize around a single-object assignment statement. If no synchronization point is present, the possibility of queuing for access to the object is eliminated.

You can take advantage of these Java language characteristics when data contained in a list-type object does change by keeping two separate instances of the object. Once one is populated, it doesn't change again. It is effectively immutable. It would be dangerous if the get and put operations were allowed to execute simultaneously. The technique we describe here ensures that all puts complete before any get can execute.

The technique

The sample code in Listing 1 illustrates the technique.

Listing 1. Producer/consumer code that avoids queuing points
static volatile Map currentMap = null;   // this must be volatile
static Object lockbox = new Object();  

public static void buildNewMap() {       // this is called by the producer     
    Map newMap = new HashMap();          // when the data needs to be updated

    synchronized (lockbox) {                 // this must be synchronized because
                                         // of the Java memory model
      // .. do stuff to put things in newMap
      newMap.put(....);
      newMap.put(....);
   }                 
/* After the above synchronization block, everything that is in the HashMap is 
   visible outside this thread */

/* Now make the updated set of values available to the consumer threads.  
   As long as this write operation can complete without being interrupted, 
   and is guaranteed to be written to shared memory, and the consumer can 
   live with the out of date information temporarily, this should work fine */

    currentMap = newMap;

}
public static Object getFromCurrentMap(Object key) {
    Map m = null;
    Object result = null;

    m = currentMap;               // no locking around this is required
    if (m != null) {              // should only be null during initialization
      Object result = m.get(key); // get on a HashMap is not synchronized
    
      // Do any additional processing needed using the result
    }
    return(result);

}

Here's what happening in Listing 1:

  • A class variable points to the current live version of a fully populated HashMap. That variable — called currentMap in Listing 1 — has an initial value of null. If a get needs to read an object from currentMap when it has a null value, the get should be retried later.
  • A second variable — called newMap in Listing 1 — holds the HashMap that is being populated with data. This variable is used by just one thread at a time — a producer thread whose job is to:

    • Create a new HashMap and store it in the newMap variable.
    • Perform a complete set of put operations on newMap such that all data that is needed by the consumer threads are in newMap.
    • When newMap is completely populated, assign the value of the newMap to currentMap. Note that this assignment statement is a unit operation and does not need to be synchronized, even though other threads might be accessing the object's value.

    The producer thread can be executed periodically, as a result of a timer, or it can be a listener that is awakened when some external data, such as a database, has changed.

  • Consumer threads that need to consume the contents of currentMap simply access the object and perform get operations.

Discussion

Once the newMap has been assigned to currentMap, the contents never change. Effectively, the HashMap is immutable. This allows multiple get operations to run in parallel, which can be a major performance boost.

The only thing that might change while the data is being read is the object reference to the currentMap variable. The producer might overlay the current value with a new value at the same time the consumer threads access the value. Because object references are unit operations in the Java language, neither the producer nor the consumer needs to synchronize when accessing that object. The worst that could happen is the consumer gets a reference to currentMap, then the producer overlays that reference with newer contents. In that case, the consumer thread uses data that is slightly out of date. The same result would occur if the consumer thread had executed a second before the producer thread was ready to run. Typically, this should not cause any problems.

When this race does occur, the consumer threads could have a reference to the "old" version of the data. The "new" object reference has overwritten the old one, but some consumers still have a reference to the old one. When the last consumer finishes referencing the old object, the object goes out of scope and is garbage collected. The Java run time keeps track of when that occurs. The application does not need to free the old object explicitly because it happens automatically.

A new version of currentMap might be created periodically based on the application's needs. By following the steps above, you can ensure that those updates occur safely and repeatedly.

The synchronized block and the volatile keyword in Listing 1 are required because no happens-before relationship exists between the writes to the currentMap and the reads from currentMap. As a result, reading threads could see garbage if the synchronized block and the volatile keyword were not used. The producer thread is putting a data structure where other threads will read it. It is responsible for ensuring that the consumer threads see a consistent view. It is aided by the fact that the data structure is not modified after publication. In this case — publishing an effectively immutable object graph — all that is required is to publish the root object reference safely. The easiest way is to make the reference volatile. You could also synchronize access to the root reference, but that would be a possible queuing point. We're trying to avoid queuing points. Brian Goetz refers to this approach as the "cheap read-write lock" trick (see Related topics).

Conclusion

This article's technique is applicable to any situation where shared data changes infrequently and is accessed simultaneously by multiple threads of execution. It is applicable only to situations in which having the absolute latest data is not a requirement of the application.

The end result is nonsynchronized access to shared data that can change over time. In environments where high performance is required, this technique lets you avoid having unnecessary queuing points within the application.

It's important to note that because of the intricacies of the Java memory model, the technique described here works only in Java 1.5 and later. In earlier Java versions, the client application is at risk of viewing an incompletely populated HashMap.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java development
ArticleID=253880
ArticleTitle=Use HashMap for better multithreaded performance
publish-date=09112007