 | Level: Introductory Jack Shirazi (jack@JavaPerformanceTuning.com), Director, JavaPerformanceTuning.com Kirk Pepperdine (kirk@JavaPerformanceTuning.com), CTO, JavaPerformanceTuning.com
30 Jun 2004 If you're part of the current blogging craze, then you've likely heard of Blog-City, a blogging site owned and operated by Blog-City Ltd., a small company in Scotland. When some unexpected performance issues cropped up, Java performance experts Jack Shirazi and Kirk Pepperdine were asked to assist in a technical tuning of Blog-City. Their detective work was complicated by hardware constraints and communication channels (IRC, ftp, and the occasional e-mail) used throughout the project.
With the increase in the popularity of blogs as a public diary, blogging hosts have been growing by leaps and bounds. So to the folks at Blog-City, it was clear that their site needed to evolve and grow. To address its growing needs, the company had just recently rolled out Blog-City version 2.0. As if often the case when a new application is rolled out into production, its performance does not quite meet expectations due to a number of factors, not the least of which was the sudden and seemingly random appearance of long periods of time when the application appeared to be hung.
At its core, Blog-City relies on the Blue Dragon Servlet engine (a CFML engine) and a database. Amazingly, all of this software was hosted on a fairly old P3 running Red Hat Linux. The machine, sporting a single hard disk and 512 MB of RAM, though powerful enough for past loads, was struggling to keep up with the growing load. Blog-City had become a victim of it's own success. Even so, this was all the hardware that was going to be available for some time.
Defining the problem
The first step in the process was to determine the cause of the sudden application slowdowns. First on our list of suspects was garbage collection. As we demonstrated in last month's column, the easiest way to determine if garbage collection and memory utilization issues are having a negative effect on your applications performance is to set the -verbose:gc JVM option and examine the log output. So we restarted the application with verbose garbage collection logging turned in, and patiently waited for the application's performance to fall off. Our patience paid off in the form of a very revealing gc log file.
From the initial analysis of the log files, it was apparent that garbage collection was the bottleneck in this application. This was evidenced by the frequency, duration, and overall efficiency of GC. A common reason for a higher than normal GC frequency is that the heap is barely large enough to accommodate all of the live objects that are currently in use and not quite large enough to accommodate the new ones that are being created. Though there are many reasons why an application may be consuming the amount of heap that it does, not having enough memory causes the garbage collector to thrash as it tries to satisfy the current need. In other words, the application will attempt to allocate a new object and fail, and upon failing, will trigger a garbage collection. If that GC fails to recover enough memory, it will force another, more expensive GC to occur. Even if GC does recover enough space to satisfy the immediate request, you can be sure that it won't be very long before the application experiences yet another allocation failure, triggering yet another GC. As a result, instead of the JVM servicing your application, it will focus it attention on the fruitless task of repeatedly scavenging for free heap space.
Though there may be many reasons why an application is consuming all of the available heap space, the expedient solution is to configure a larger heap, if more memory is available. Assuming that the application does not have a memory leak (or what we like to call unintentionally retained objects), it will find a "natural" level of heap consumption, which the GC will be able to comfortably maintain (unless the rate of object creation is so high that GC is always in a race to keep up). In this case, and the case of unintentionally retained objects, we will need to make changes to the application in order to obtain any improvement.
If only it were that simple
Unfortunately, we had to contend with a severe complication -- the machine on which we are running had only 512 MB of RAM available. To make matters worse, we had to share that space with the database and any other processes running on the box. To fully understand why this is critical you must first understand the basics of garbage collection in terms with how it interacts with the underlying operating system.
Virtual memory is no longer your friend
Virtual memory has been used by operating systems for many years. As you know, virtual memory gives an OS the appearance of having more memory than it really has, allowing the computer to run programs that are larger than would fit in the available physical RAM, with that part of the application that does not fit into memory being held on disk. As a further simplification, the OS manages memory a page at a time. A page typically contains anywhere from 512 bytes to 8 KB, with all of the pages combined making up a virtual address space. The OS maintains a table that tells it how to map virtual addresses to physical addresses. When an application asks for the contents of a memory location, the OS (or hardware) will identify the page containing the virtual address. It will then determine if that page is in memory; if not, you get a page fault. Though there are many ways to handle a page fault, the end result is that the page must be loaded into memory from disk. Only then will the application have access to the contents of the desired virtual address.
If related objects were always clustered on the same page in memory, then GC could most likely proceed with little difficulty. But in the real world, related objects are rarely (if ever) clustered. The net effect is that a system that relies on virtual memory will cause the operating system to swap pages in and out of memory as it marks and then sweeps the heap space, and while that is happening, the GC will spend more time waiting for pages to be paged in from disk than actually reclaiming memory. So your application is waiting for the GC, and the GC is waiting for the disk, and in the meantime no real work is getting done. It is a common and sound recommendation to make sure that your JVM heap is no larger than the amount of available physical memory, minus that used by other processes running on your machine. Given that this system had only one disk and it also needed to support the database, we were between a rock and a hard place. On one hand, we needed to increase the amount of memory so that we could decrease the frequency of GC, but on the other hand, we needed to make sure that we played nicely with the database as well, and databases are hungry consumers of memory too. So we needed to understand the minimal amount of memory that was needed by our application.
As we saw last month, this information is readily available in the verbose gc logs. Rather than go through the lengthy process of scanning the logs for this information, we used the free JTune tool (see Resources) to interpret the verbose GC logs. Figure 1 shows an after-GC memory utilization graph when we set -Xmx to 256 MB.
Figure 1. Memory utilization after GC

Analyzing verbose:gc output
In Figure 1, the blue portion represents a partial GC. The orange areas represent a full GC, and the pink squares represent two full GC's that have occurred with less than a millisecond between them. From the summary we find that there were 12,823 scavenges averaging 0.257 seconds. There were 345 full GCs and 44 back-to-back GCs. The average duration of the full GC was 7.303 seconds, which results in 9.36 percent of the runtime being spent in the garbage collector. Though the value is high, it is still under the general target of 10 percent. So, in this instance, GC is taxing the system but not at a serious rate. The real problem is the presence of a memory leak, which you can see in the general upward sloping trend of the heap utilization.
Even though the leak consumes 50 MB of memory, it happens over a long period of time, which makes it less noticeable in a shorter test. The net effect of the memory leak is that it pushes the JVM's memory consumption to the point where it has forced the JVM (and consequently the OS) to consume memory the point where it is forced to start paging. The evidence is in Figure 2. Note the sudden sustained increase in the duration of each GC cycle just after the 55,000-second mark.
Figure 2. GC duration

As you might imagine, the user response started to increase as the "stop-the-world" nature of garbage collection resulted in less time being dedicated to user threads. In the last 10,000 seconds of the log, we see that every full collection (15 in all) takes longer than 30 seconds with the average duration taking approximately 70 seconds -- which results in well more than 10 percent of processing time being dedicated to a full GC. Partial collections (of which there were just over 1000) do not fare much better, averaging 1.24 seconds per invocation, a far cry from the 0.25-second average in the previous 11,800 scavenges.
Collection through the generations
Without getting into too many details (see Resources for a complete description of generational GC), generational heap spaces results in "young" and "old" objects being located in separate heap spaces. In this configuration, young and old generational spaces can be maintained with different GC algorithms and strategies, increasing the overall performance of GC.
One such strategy is to further divide the young generation into a creation space, called Eden, and survivor spaces, for young objects that survive one or more collections. This generally works well if there is enough memory in Eden to accommodate the creation of a new object. If this is not the case, then the object may be created in the old object space. Also, if the survivor space is full, then objects will be moved into the old generation space. We will use these facts to help us with our tuning problem.
Reducing the number of full collections
The problem Blog-City was experiencing was long pause times that would occur at some random point. Once the application started misbehaving, it never was able to get back on track without having to be restarted. Because the long pause times appeared to be directly related to long GCs, we wondered if keeping objects in the young generation for longer would reduce the number of full GCs. Because full GCs are so expensive, collecting more objects in young generation could yield shorter pause times. To achieve this we adjusted some of the garbage collection parameters, the survivor ratio and the tenuring threshold.
The survivor ratio sets the size of a survivor space in relation to the overall size of young generational space. If the size of the survivor ratio is 8 (the default on Intel), then each of the survivor spaces will be 1/8 the size of Eden. Another way of looking at it is to say that the young generation will be divided into ten equally sized values of which Eden will be allocated 8 and each survivor space will receive 1.
Our hypothesis was that by decreasing the survivor ratio, we could decrease the odds that an object will be prematurely promoted to the old generation due to lack of space in a survivor space. Another approach would be to increase the tenuring threshold so that objects would need to survive more GC events before being promoted. With this in mind, Blog-City was restarted with the settings -XX:SurvivorRatio=4.
Selecting a low pause time garbage collection algorithm
Because one of the goals of this technical tuning was to reduce pause time, we decided to move away from the default single-threaded, mark-and-sweep garbage collector. Instead we opted for the parallel copying collector selectable with the flag XX:+UseParallelGC. Again, details on the actual algorithm can be found in Resources, but suffice it to say that this flag invokes a multi-threaded collector. The number of threads is set to the number of CPUs. Given this fact, it's difficult to know why a single-threaded parallel copying garbage collector would work better than the traditional mark and sweep, but it has been observed to offer some performance advantages.
The output in Figures 3 and 4 show the results of a run that used the flags -XX:SurvivorRatio=4 +XX:+UseParallelGC -server -Xmx256M.
Figure 3. Memory Profile under the new configuration 
The resulting graph shows a remarkable difference. Though there is still a memory leak, the overall amount of memory consumption is much lower than in the first figure. A quick comparison of the GC duration reveals a marked decrease in the overall GC duration of young and old generations.
Figure 4. GC Durations under the new configuration
Intentional, unintentional object retention
Because the application was memory-bound, it became even more important to track down the memory leaks and eliminate them. In this instance, the component used to support the caching strategy was determined to the the source of the major leak. From the final memory profile (Figure 3), with the major memory leak eliminated, we can see that there still is another "low grade" leak, but this leak is small enough that it can be ignored until the next release.
The final word
The exercise offered many challenges. First, we were tuning a live application, which means there was limited opportunity to make changes. The second challenge was the exercise was conducted remotely, using IRC chat. Chat doesn't offer the level or quality of communication that is often required during the course of this type of exercise. In this case, the team was already acquainted with the realities of chat and was able to work through these realities with little difficulty.
The last and most difficult challenge is that we were constrained by the hardware. For a number of reasons, it was just not possible to add new hardware to the system. The biggest issue was the amount of real RAM that was on the system and the amount demanded by the JVM and MySQL. But, by systematically applying a number of changes one at a time and then measuring their effect on the system, we were able to bit by bit improve overall system performance.
Resources
About the authors
 | |  | Kirk Pepperdine is the Chief Technical Officer at Java Performance Tuning.com and has been focused on Object technologies and performance tuning for the last 15 years. Kirk is a co-author of the book Ant Developer's Handbook (MacMillan). |
Rate this page
|  |