Garbage collection (GC) is performed when there's an allocation failure in heap lock allocation, or if there's a specific call to System.GC. The thread that has the allocation failure or the System.GC call takes control and performs the GC. It first acquires all the locks required for a GC and then suspends all the other threads. GC is then done in three phases: mark, sweep, and optionally compaction. We call this a stop-the-world (STW) collection, because all application threads are stopped while garbage is collected.
The mark phase of GC marks all the live objects. Since there's no way to individually identify unreachable objects, we identify all the reachable objects, and then we know everything else is garbage. The process of marking all reachable objects is also known as tracing.
The active state of the JVM is made up of the saved registers for each thread, the set of stacks representing the threads, the statics within Java classes, and the set of local and global Java Native Interface (JNI) references. All invocations of functions within the JVM itself give rise to a frame on the C stack. This frame may itself contain instances of objects, either as a result of an assignment to a local variable or a parameter passed from the caller. All these references are treated equally by the tracing routines. In essence, we view the stack of a thread as a set of 4-byte fields (8 bytes in 64-bit architecture) and scan them from the top to the bottom of each of the stacks. We assume that the stacks are 4-byte aligned (8-byte aligned in 64 bit architecture). Each slot is examined to see if it points at an object in the heap. But this does not make it necessarily a pointer to an object, because it may be just an unlucky combination of bits in a float or integer. When making the scan of the stacks you must be conservative; anything that points at an object is assumed to be an object, but the object in question must not be moved during GC. A slot is considered a pointer to an object if it fulfills the following criteria:
- It is grained on an 8-byte boundary.
- It is within the bounds of the heap.
- The allocbit is on.
Objects referenced in this way are known as roots, and will have their dosed bit set on to indicate that they cannot be moved.
Tracing can now proceed accurately, meaning we can find references within the roots to other objects. Because we know they are real references, we'll be able to move them during compaction as we can fix up these references.
The tracing process uses a stack that can hold 4K entries. All references pushed to the stack are marked at the same time by setting on the relevant markbit. The roots are marked and pushed to the stack, then we start to pop entries off the stack and trace them.
Normal objects (not arrays) are traced by using the mptr to access the classblock, which tells us where references to other objects are to be found in this object. As we find each reference, if it is not already marked, we mark and push it.
Array objects are traced by looking at each array entry and, if it is not already marked, marking and pushing it. Additional code traces a small part of the array at a time to try to avoid mark stack overflow.
The tracing process continues iteratively until the mark stack eventually becomes empty.
Because the mark stack has a fixed size, it can possibly overflow. If that happens, we:
- Set a global flag to indicate that mark stack overflow has occurred.
- Set the NotYetScanned bit in the object that could not be pushed.
Tracing can then continue with all other objects that could not be pushed having their NotYetScanned bit set. When all tracing is complete, we walk the heap by starting at the first object and using the size field to navigate to the next object. All objects that we find that have their NotYetScanned bit set are marked and pushed to the mark stack. The NotYetScanned bit is set off and tracing continues as before. It is possible to get another mark stack overflow, in which case we go through the whole process again until all reachable objects are marked.
Parallel mark was introduced in Java 1.3.0. Given bitwise sweep and compaction avoidance, we spend the majority of our GC time marking objects. This leads immediately to developing a parallel version of GC mark. The goal of parallel mark is to not degrade mark performance on a uniprocessor, and to scale typical mark performance better than 4 on an 8-way machine.
The basic idea is to augment object marking through the addition of helper threads and a facility for sharing work between them. In the original implementation, a single application thread was stolen to perform GC. Parallel mark still demands the participation of one application thread, which is used as the master coordinating agent. This thread performs very much as it always did, including having the responsibility for scanning C-stacks to identify root pointers for the collection. A platform with N processors will also have N-1 new helper threads, which work along with the master thread to complete the marking phase of GC. The default number of threads can be overridden with the -Xgcthreadsn parameter. A value of 1 will result in no helper threads. Values of 1 to N are accepted.
The goal is to provide each marker thread with a local stack and a shareable queue, both of which have references to objects that are marked but not yet scanned. Threads do the bulk of marking work using their local stacks, synchronising on sharable queues only when work balance requires it. Mark bits are updated using atomic primitives that require no additional lock. Because each thread has a mark stack that can hold 4K entries, and a mark queue that can hold 2K entries, the chances of a mark stack overflow are reduced.
Concurrent mark, which was introduced in Java 1.3.1, is designed to give reduced and consistent GC pause times as heap sizes increase. It works by starting a concurrent marking phase before the heap is full. In the concurrent phase we scan the roots by asking each thread to scan its own stack. We then use these roots to trace live objects concurrently. Tracing is done by a low-priority background thread, and by each application thread when it does a heap lock allocation.
As we mark live objects concurrently with application threads running, we have to record any changes to objects that were already traced. This is achieved by using a write barrier that's activated every time we change a reference in an object. The write barrier is required to tell us when an object reference update has taken place so we know we have to rescan a part of the heap. It is the same write barrier as required by resettable. The heap is divided into 512-byte sections and each section is allocated a byte in the card table. Whenever a reference to an object is updated, the card that corresponds to the beginning address of the object that has been updated with the new object reference is marked with 0x01. A byte is used rather than a bit for two reasons: it's quicker to do a write to a byte than to do bit changes, and we want to use the other bits in the future.
An STW collection will be started when one of the following occurs:
- Allocation fails
- A System.GC is requested
- Concurrent mark completes all the marking it can do
We try to start the concurrent mark phase so it will complete at the same time that the heap is exhausted. You can achieve this by constant tuning of the parameters that govern the concurrent mark time.
In the STW phase we again scan all roots, use the marked cards to see what needs to be retraced, and then sweep as normal. We guarantee that all objects that were unreachable at the start of the concurrent phase are collected. What we don't guarantee is that objects that become unreachable during the concurrent phase are collected.
Reduced and consistent pause times are benefits of concurrent mark, but there is a price to pay. Application threads have to "pay a tax" by doing some tracing when they are requesting a heap lock allocation. The amount of tax they have to pay will vary depending on how much idle CPU time there is for the background thread. There is also an overhead for the write barrier.
There is a new parameter to enable concurrent mark:
-Xgcpolicy:<optthruput | optavgpause>- Setting
gcpolicytooptthruputdisables concurrent mark. Users who do not have pause time problems (as seen by erratic application response times) should get the best throughput with this option.optthruputis the default setting.Setting
gcpolicytooptavgpauseenables concurrent mark with its default values. Users who are having problems with erratic application response times caused by normal GCs can alleviate those problems, at the cost of some throughput, when running with theoptavgpauseoption.
After the mark phase, the markbits vector contains a bit for every reachable object in the heap and must be a subset of the allocbits vector. The sweep phase identifies the intersection of the allocbits and markbits vectors -- objects that have been allocated but are no longer referenced. In the bitsweep technique we examine the markbits vector directly and look for "long" runs of zeros that probably identify free space. Once such a long run is found, we look at the length of the object at the start of the run to determine the free space to be released. If the amount of free space is greater than 512 (384 in 1.3.0 and earlier) plus the header size, then we put this free chunk on the freelist. The small pieces of storage that are not on the freelist are known as "dark matter," and they will be recovered when the objects adjacent to them become free or when we compact the heap. There is no need to free the individual objects in the free chunk, because we know that the whole chunk is free storage. Indeed, when we free a chunk we have no knowledge of the objects that were in it. During this process, the markbits are copied to the allocbits so that on completion the allocbits correctly reflects the allocated objects on the heap.
Parallel bitwise sweep, which was introduced in Java 1.3.1, is similar to parallel mark in that it improves sweep time by using available processors. We use the same helper threads as parallel mark, so the default number of helper threads is also the same and can be changed with the -Xgcthreadsn parameter.
32 * the number of helper threads, or the maximum heap size / 16M, whichever is larger |
The helper threads take a section at a time and scan it, performing a modified bitwise sweep. The results of this scan are stored for each section, and when all sections have been scanned the freelist is built.
Once the garbage has been removed from the heap, we can consider compacting the resulting set of objects to remove the spaces between them. Compaction is complex because if we move any object we must alter all its' currently existing references. If one of those references was from a stack and we aren't certain that it was an object reference (it may have been a float, for example), then we clearly are unable to move the object. Objects that are temporarily fixed in position are referred to as "dosed" in the code and have the "dosed" bit set in their header word. Similarly, objects can be "pinned" during some JNI operations, which has the same effect but is "permanent" until the object is explicitly unpinned by JNI. Objects that remain mobile are compacted in two phases by taking advantage of the low three bits zero of the mptr, and using one of these to denote that we have "swapped" it. Note that this swapped bit is applied in two places:
- The size + flags field, where it is known as
OLINK_IsSwapped. - The mptr, where it is known as
GC_FirstSwapped.
In both cases, the least significant bit (x01) is set.
To better understand the compaction process, use the analogy of the heap as a warehouse partly full of furniture of different sizes. The free space is the gaps between the furniture. The free list contains only gaps above a certain size. Compaction pushes everything in one direction and closes all the gaps. It starts with the object closest to the wall and puts it against the wall, takes the second object in line and puts it against the first, takes the third and puts it against the second, and so on. At the end, all the furniture is on one end and all the free space on the other. Pinned and dosed objects that cannot be moved complicate the picture but don't change the overall idea.
Compaction avoidance seeks to focus on proper object placement, and thereby reduce, and in many cases eliminate, the need for compaction. A key aspect of compaction avoidance is a concept called wilderness preservation. The idea is to try to preserve a region of the heap in a pristine state by focusing allocation activity elsewhere. In practice, this is done by establishing a boundary between most of the heap and a reserved wilderness portion. In typical cases, non-compacting GC events are triggered whenever the wilderness is threatened. The wilderness is consumed (eroded) only when necessary to satisfy a large allocation, or when insufficient allocation progress has been made since the previous GC. Figure 1 shows the wilderness on the heap.
Figure 1. The wilderness

The wilderness is allocated at the end of the active part of the heap. Its size is 5% of the active part of the heap, with a maximum of 3M. On heap lock allocation failure, if sufficient allocation progress has been made since the last GC, then a GC is performed. Sufficient progress is defined as at least 30% of the heap being allocated since the last GC. This is the default and can be changed with the -Xminf parameter. If sufficient progress has not been made, the allocation is immediately satisfied from the wilderness if possible. Failing this, a "normal" allocation failure occurs. Insufficient progress will have been made if we get an allocation request for a large object that cannot be satisfied before we've exhausted the free list. This is where the reserved wilderness will be able to satisfy the request, avoiding a GC and a compaction.
Compaction occurs if any of the following are true and -Xnocompactgc has not been specified:
-Xcompactgchas been specified.- Following the sweep phase, there is insufficient free space to satisfy the allocation request.
- A System.GC has been requested and the last allocation failure GC did not compact.
- At least half of the previously available memory has been consumed by TLH allocations (ensuring an accurate sample), and the average TLH size falls below 1000 bytes.
- Less than 5% of the active heap is free.
- Less than 128K of the active heap is free.
In Java 1.2.2 the whole area of reference handling changed considerably. The aim is to treat all references equally and process them in the same way. Thus, we actually create two separate objects on the heap -- the object itself and a separate "reference" object. The reference objects may optionally be associated with a queue to which they will be added when the referent becomes unreachable. Instances of SoftReference, WeakReference, and PhantomReference are created by the user and are immutable; they cannot be made to refer to other than the object they referenced on creation. Objects associated with a finalizer are "registered" with the Finalizer class on creation. The result is the creation of a FinalReference object that is associated with the Finalizer queue and refers to the object to be finalized.
During GC, reference objects get special handling, because the referent field is not traced during the marking phase. Once marking is complete, the references are processed in the order: soft, weak, final, and phantom. Processing of SoftReference objects is specialized in that the ST component is able to decide that these references should be cleared if the referent is unmarked (unreachable except for a path through a reference). The clearing is done if memory is running out, and is selective on the basis of most recent usage. "Usage" is measured by the last time the get method was called, which can lead to some surprises that are entirely valid. As a reference object is processed its referent is marked, ensuring that when, for example, a FinalReference is processed for an object that also has a SoftReference, the FinalReference will see a marked referent and thus the FinalReference will not be queued for processing. As a result, references will be queued in successive GC cycles.
References to unmarked objects are initially queued to the ReferenceHandler thread in the Reference class, which takes objects off its queue and looks at their individual "queue" field. If an object is thus associated with a specific queue it is requeued to it for further processing. The FinalReference objects are requeued and eventually their finalize
method is run by the Finalizer thread.
JNI weak references provide the same capability as WeakReference objects, although the processing is very different. A JNI routine can create a JNI weak reference to an object and later delete that reference. The ST component will clear any weak reference where the referent is unmarked, but there is no equivalent of the queuing mechanism. Note that failure to delete a JNI weak reference will cause a memory leak in the table and performance problems. This is also true for JNI global references. The processing of JNI weak references is handled last in the reference handling process. The result is that a JNI weak reference can exist for an object that has already been finalized and had a phantom reference queued and processed.
Heap expansion occurs after GC and when all the threads have been restarted, but the HEAP_LOCK is still held. The active part of the heap will be expanded up to the maximum if one of the following is true:
- The GC did not free sufficient storage to satisfy the allocation request.
- Free space is less than the minimum free space, which can be set by using the -Xminf parameter. The default is 30%.
- More than 13% of the time is being spent in GC, and expanding by the minimum expansion amount (-Xmine) will not result in a heap greater than the maximum percentage of free space (-Xmaxf).
The amount to expand by is calculated as follows:
- If we are expanding because there is less than -Xminf (default 30%) free space then we calculate how much we need to expand by to get -Xminf free space. If this is greater than the maximum expansion amount, which can be set by the -Xmaxe parameter (default of 0, which means no maximum expansion), then we reduce it to -Xmaxe. If this is less than the minimum expansion amount, which can be set by the -Xmine parameter (default of 1M), then we increase it to -Xmine.
- If we are expanding because GC did not free sufficient storage and we are not spending more than 13% in GC, then we expand by the allocation request.
- If we are expanding for any other reason then we calculate how much we need to expand by to get 17.5% free space. We again adjust this as above, depending on -Xmaxe and -Xmine. We also need to make sure we expand by at least the allocation request if GC did not free sufficient storage.
All calculated expansion amounts are rounded up to a 64K boundary on 32-bit architecture, or a 4M boundary on 64-bit architecture.
Heap shrinkage occurs after GC, but when all the threads are still suspended. Shrinkage will not occur if any of the following are true:
- The GC did not free sufficient space to satisfy the allocation request.
- The maximum free space, which can be set by the -Xmaxf parameter (default is 60%), is set to 100%.
- The heap has been expanded within the last three GCs.
- This is a System.GC and the free space at the beginning of the GC was less than -Xminf (default is 30%) of the live part of the heap.
If none of the above is true and we have more than -Xmaxf free space, we calculate how much to shrink the heap to get it to -Xmaxf free space, without going below the initial (-Xms) value. This figure is rounded down to 64K boundary on 32-bit architecture, or a 4M boundary on 64-bit architecture.
A compaction will occur before the shrink if all of the following are true:
- A compaction was not done on this GC cycle.
- There is not a free chunk at the end of the heap, or the size of the free chunk at the end of the heap is less than 10% of the desired shrinkage amount.
- We didn't shrink and compact on the last GC cycle.
Stay tuned for part 3 in this series, which will include a discussion of verbosegc and other command line parameters. If you missed it previously, catch up by reading Part 1, Object allocation.
- Participate in the discussion forum.
- Garbage Collection - Algorithms for Automatic Dynamic Memory Management, by Richard Jones and Rafael Lins (John Wiley & Son Ltd, 1996, ISBN:0-471-94148-4).
- Read Part 1, Object allocation (developerWorks, August 2002) and Part 3,
verbosegcand command-line parameters (developerWorks, September 2002) in this series. - Find more Java resources on the developerWorks Java technology zone.
Sam Borman joined IBM in 1984 after working at seven other companies as a programmer and systems programmer in the UK, New Zealand, and France. He worked as a developer, and later as a development manager, in CICS. In 1990 he returned to a technical career working on CICSPlex/SM and later on DirectTalk. In 1999 he joined the Java Technology Centre where he took ownership of Garbage Collection. You can contact Sam Borman at sambo@uk.ibm.com.
Comments (Undergoing maintenance)

