Java theory and practice: Hey, where'd my thread go?

Learn how to avoid thread leakage in server applications

If you're not careful, threads can disappear from server applications without a (stack) trace. In this article, threading expert Brian Goetz offers some techniques for both prevention and detection of threads going AWOL.

Share:

Brian Goetz (brian@quiotix.com), Principal Consultant, Quiotix Corp

Brian Goetz is a software consultant and has been a professional software developer for the past 15 years. He is a Principal Consultant at Quiotix, a software development and consulting firm located in Los Altos, California. See Brian's published and upcoming articles in popular industry publications.



01 September 2002

Also available in Russian Japanese

When the main thread in a single-threaded application throws an uncaught exception, you are likely to notice because the stack trace is printed on the console (and because the program stops). But in a multithreaded application, especially one that runs as a server and is not attached to a console, thread death may be a less noticeable event, resulting in partial system failures that can cause confusing application behavior.

In July's installment of Java theory and practice, we looked at thread pools, and examined how an improperly written thread pool could "leak" threads, until eventually all the threads were gone. Most thread pool implementations guard against this by catching thrown exceptions or restarting threads that die, but the problem of thread leakage is not limited to thread pools -- server applications that use threads to service work queues can have this problem too. When a server application loses a worker thread, the application may appear to work fine for quite a while, making the true cause of the problem difficult to identify.

Many applications use threads to provide background services -- processing tasks from an event queue, reading commands from a socket, or performing a long-running task outside of the UI thread. What happens when one of these threads dies due to having thrown an uncaught RuntimeException or Error, or simply stalls, waiting on a blocked I/O operation that wasn't supposed to block?

Sometimes the user will notice that no progress is being made, such as when a thread is executing a long-running user-initiated task, like spell checking, and may abort the operation or program. But other times, background threads perform housekeeping tasks, and their disappearance may not be noticed for a long time.

An example server application

Consider this hypothetical middleware server application, which aggregates messages from a variety of input sources and then submits them to an external server application, receives responses from the external application, and routes the responses back to the appropriate input source. For each of the input sources, there is a plug-in that receives its input messages in its own way -- by scanning a directory of files, waiting on a socket connection, polling a database table, and so on. The plug-ins may be written by third parties, even though they are run in the server JVM. This application has (at least) two internal work queues -- messages received from a plug-in waiting to be sent to the server (the "outgoing messages" queue), and responses received back from the server waiting to be delivered to the appropriate plug-in (the "incoming responses" queue). Messages are routed back to the originating plug-in by calling a service routine, incomingResponse(), on the plug-in object.

After a message is received from a plug-in, it is queued to the outgoing messages queue. Messages from the outgoing messages queue are processed by one or more threads that read the message from the queue, make a note of its source, and submit it to the remote server application (say, through a Web services interface). The remote application eventually responds back through a Web services interface, and our server queues the received response to the incoming responses queue. One or more response threads reads messages from the incoming responses queue and routes them to the appropriate plug-in, completing the round trip.

In this application, we have two message queues -- for outgoing requests and incoming responses -- and perhaps additional queues within the various plug-ins. We also have several service threads -- one that reads requests off the outgoing message queue and submits them to the external server, one that reads the responses off the incoming responses queue and routes them to the plug-in, and perhaps threads within the plug-ins for servicing sockets or other external request sources.


It's not always obvious when a thread fails

What happens if one of these threads, such as the response-dispatching thread, disappeared? Because the plug-ins would still be able to submit new messages, they might not notice immediately that something is wrong. Messages would still arrive through the various input sources, and they would be submitted through our application to the external service. Because the plug-in is not expecting its response immediately, it has no idea that there's a problem, yet. Eventually, the received responses would queue up. If they are being stored in memory, we might eventually run out of memory. Even if not, at some point someone will notice that responses are not being delivered -- but it might take a while, since other aspects of the system are still functioning properly.

When the major task-handling aspects are handled with thread pools instead of single threads, there is a certain degree of insurance against the consequences of occasional thread leakage, because a thread pool that performs well with eight threads will probably still do its job acceptably with seven. At first, there might not be any perceptible difference. Eventually, though, system performance will degrade, albeit perhaps in a subtle way.

The problem with thread leakage in server applications is that it is not always easy to detect from the outside. Because most threads handle only a portion of a server's workload, or perhaps only a specific type of background task, the program can appear to the user to be functioning properly when it has in fact suffered a serious failure. This, coupled with the fact that the factors that cause thread leakage do not always leave evidence, can lead to surprising or even mystifying application behavior.


RuntimeException is the leading cause of thread death

Threads can disappear when they throw an uncaught exception or error, or they might simply stop working when they wait on an I/O operation that will never complete or a monitor for which no one will ever call notify(). The most common source of unexpected thread death is throwing a RuntimeException (such as NullPointerException, ArrayIndexOutOfBoundsException, and the like). In our example application, one area where a RuntimeException is likely to be thrown is when a response is handed back to the plug-in by calling incomingResponse() on the plug-in object. The plug-in code may have been written by a third-party, or may have been written after the application was written, and it is therefore impossible for the application writer to audit it for correctness. If the response service thread were to terminate when some plug-in throws a RuntimeException, this means that one faulty plug-in can take down the entire system. Unfortunately, this vulnerability is quite common.

While we are expected to code aggressively against checked exceptions -- the compiler forces us to -- unchecked exceptions are for the most part ignored by most Java developers. In a single-threaded application, the result of an unhandled RuntimeException is obvious, and there is a clear stack trace as to where it happened, which provides both notification of the problem and useful information to fix it. However, in multithreaded applications, threads can die silently due to unchecked exceptions -- leaving users and developers scratching their heads as to what happened and why.

Task-processing threads, like the request and response handler threads in our example application, basically spend their whole life calling service methods through some abstraction barrier like Runnable. Since we have no idea what is on the other side of this abstraction barrier, we should be skeptical that the service method will be so well behaved that we can assume it will never throw an unchecked exception. If the service routine throws a RuntimeException, the calling thread should catch this and either log it and move on to the next item in the queue or take down the thread and restart it. (The latter option stems from the assumption that whoever threw the RuntimeException or Error might also have corrupted the thread state.)

The code in Listing 1 is typical of a thread that processes Runnable tasks from a work queue, like the incoming responses thread from our example. It does not guard against the plug-in throwing any unchecked exceptions.

private class TrustingPoolWorker extends Thread {
    public void run() {
        IncomingResponse ir;

        while (true) {
            ir = (IncomingResponse) queue.getNext();
            PlugIn plugIn = findPlugIn(ir.getResponseId());
            if (plugIn != null)
                plugIn.handleMessage(ir.getResponse());
            else
                log("Unknown plug-in for response " + ir.getResponseId());
        }
    }
}

We don't have to add a lot of code to make this worker thread much more robust to failures in the plug-in code. By simply catching RuntimeException and then taking corrective action, we can insure ourselves against a single poorly written plug-in undermining the entire server. An appropriate corrective action might be to log the error and simply move on to the next message, terminate the current thread and restart it (this is what classes like TimerTask do), or unload the plug-in that caused the problem, as shown in Listing 2:

private class SaferPoolWorker extends Thread {
    public void run() {
        IncomingResponse ir;

        while (true) {
            ir = (IncomingResponse) queue.getNext();
            PlugIn plugIn = findPlugIn(ir.getResponseId());
            if (plugIn != null) {
                try {
                    plugIn.handleMessage(ir.getResponse());
                }
                catch (RuntimeException e) {
                    // Take some sort of action; 
                    // - log the exception and move on
                    // - log the exception and restart the worker thread
                    // - log the exception and unload the offending plug-in
                }
            }
            else
                log("Unknown plug-in for response " + ir.getResponseId());
        }
    }
}

Use the uncaught exception handler provided by ThreadGroup

In addition to the approach of treating foreign code as more likely to throw RuntimeException, it is wise to use the uncaughtException facility of the ThreadGroup class. ThreadGroup is not useful for much, but for the time being (until uncaught exception handling is added to Thread in JDK 1.5), the uncaughtException feature makes it indispensable. Listing 3 shows an example of using ThreadGroup to detect when a thread dies due to an uncaught exception:

public class ThreadGroupExample {

    public static class MyThreadGroup extends ThreadGroup {

        public MyThreadGroup(String s) {
            super(s);
        }

        public void uncaughtException(Thread thread, Throwable throwable) {
            System.out.println("Thread " + thread.getName() 
              + " died, exception was: ");
            throwable.printStackTrace();
        }
    }

    public static ThreadGroup workerThreads = 
      new MyThreadGroup("Worker Threads");

    public static class WorkerThread extends Thread {
        public WorkerThread(String s) {
            super(workerThreads, s);
        }

        public void run() {
            throw new RuntimeException();
        }

    }

    public static void main(String[] args) {
        Thread t = new WorkerThread("Worker Thread");
        t.start();
    }
}

If a thread in a thread group dies because it threw an uncaught exception, the thread group's uncaughtException() method is called, which can then write an entry to the log, restart the thread, restart the system, or take whatever corrective or diagnostic action it deems necessary. At the very least, if all threads write a log message when a thread dies, you will have a record of what went wrong and where, rather than just wondering where your request-handling thread went.


Summary

It can be confusing when a thread disappears from an application, and all too often, threads disappear without a (stack) trace. Like many risks, the best way to prevent thread leakage is with a combination of prevention and detection; pay attention to places where RuntimeException is likely to be thrown, such as when calling foreign code, and use the uncaughtException handler provided by ThreadGroup to detect when threads have terminated unexpectedly.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=10707
ArticleTitle=Java theory and practice: Hey, where'd my thread go?
publish-date=09012002