The Support Authority: Using IBM Support Assistant to find patterns for determining and resolving performance problems in enterprise applications

In this article, the authors share their experience detecting and diagnosing performance problems in production and open source benchmark applications through the presentation of three case studies based on actual situations. These case studies highlight how various IBM® tools, such as Garbage Collection and Memory Visualizer (GCMV) and IBM Monitoring and Diagnostic Tools for Java™ - Memory Analyzer (MAT), can be used together to diagnose complex performance problems through the analysis of verbose garbage collection logs, heapdumps and threaddumps. This content is part of the IBM WebSphere Developer Technical Journal.

Vijay Mann (vijamann@in.ibm.com), IBM Research, IBM

Vijay Mann is a senior software engineer and researcher at IBM Research -India in New Delhi. He joined IBM in 2003 after working on Wall Street (Portfolio Analytics Group at Morgan Stanley, New York) for a couple of years. He has more than 11 years of experience in enterprise systems development and research. He has authored more than 20 papers in top IEEE/ACM conferences, co-authored 3 IBM Redbooks and has several issued and filed patents to his credit. His research interests broadly lie in the areas of distributed systems, data center networks, performance analysis, operating systems and systems management. He holds a MS from Rutgers University, New Jersey (USA).



Venkateswara R Madduri (vmadduri@in.ibm.com), IBM Research, IBM

Venkateswara Madduri is working with IBM Research – India where he focuses systems management. He can be contacted at vmadduri@in.ibm.com.



Srividya Shamaiah (sshamaia@in.ibm.com), Java Technology Center, IBM

Srividya Shamaiah works with the Java Technology Center at IBM India and has extensive experience troubleshooting Java Runtime issues. Srividya speaks extensively in conferences and workshops and is an evangelist on IBM Java tools.



20 June 2012

Also available in Chinese

In each column, The Support Authority discusses resources, tools, and other elements of IBM® Technical Support that are available for WebSphere® products, plus techniques and new ideas that can further enhance your IBM support experience.

Combining tools and experience

Problem determination and performance tuning of production level multi-tier enterprise Java™ applications is considered to be an extremely difficult task. Fortunately, there are many diagnostic tools available that can be used to detect the root cause of such problems. In this article, we share our experience detecting and diagnosing performance problems in production and open source benchmark applications through the presentation of three case studies based on actual situations. These case studies cover a wide spectrum of performance issues:

Each case presents the problem and the way we solved it through analysis of verbose garbage collection (GC) logs and heapdumps using IBM tools such as Garbage Collection and Memory Visualizer (GCMV) and IBM Monitoring and Diagnostic Tools for Java - Memory Analyzer (MAT). Each of these case studies highlights the complexity associated with problem determination and performance analysis of enterprise Java applications. All the tools described are available as plugins in the IBM Support Assistant workbench.


Case study 1: Memory leak in infrastructure code

Our first case study involves an open source benchmark called RUBiS. RUBiS mimics an auction site. It comes with its own workload driver and has a web tier that connects to a database tier. It has been cited heavily in research papers for performance evaluation (see Resources).

While using RUBiS as a benchmark for our other research, we noticed that RUBiS would stop responding at the end of an experiment and had to be restarted for the next experiment. We were using the servlets version of RUBiS and MySQL JDBC driver. We analyzed the verbose GC log of the application using GCMV for a constant load of 100 clients. Verbose GC logs can be easily enabled either through the Java command line or through the management user interface of most commercial application servers. It revealed that the used heap kept growing all the time, even at constant load (Figure 1). This seemed like a classic memory leak issue. However, as we discovered, the root cause of this memory leak was not as obvious as most classic memory leaks.

Figure 1. GCMV: Increasing RUBiS heap usage at constant load
Figure 1. GCMV: Increasing RUBiS heap usage at constant load

For memory leak analysis, it is a good idea to understand some basic definitions first:

  • Shallow size of an object is the amount of memory allocated to store the object itself, not taking into account the referenced objects.
  • Retained size of an object is its shallow size plus the shallow sizes of the objects that are accessible, directly or indirectly, only from this object.
  • Retained set of X is the set of objects that will be garbage collected if X is garbage collected.
  • Dominator tree is a transformation of the cyclic object graph into a Keep-Alive tree where every node in the tree is directly responsible for keeping alive its children.

We took two heap dumps of this application within minutes of each other at constant load and compared them, first using MAT. Heap dumps can be easily enabled in most commercial application servers‎. A quick look at the count of objects in those two heap dumps pointed us to a set of objects that had grown almost twice their original count and size. Most of these objects were of primitive types, but a few that had grown the most were related to PreparedStatement (number of such objects almost doubled between the first and the second heapdump) calls in JDBC. A quick breakdown of the objects that grew the most is given in Figure 2. In most memory leaks, there are a few container objects whose size grows the most and then there are other object types whose count grows the most.

Figure 2. MAT: RUBiS objects with most growth between the two heapdumps
Figure 2. MAT: RUBiS objects with most growth between the two heapdumps

From the snapshot in Figure 2, you can see that the count of objects of type PreparedStatement increased the most. However, we still needed to find where in the code this count was increasing (in other words, where is this memory leak located in the code). One way to do this is to look at the dominator tree in MAT (Figure 3). You can clearly see that two servlets (ViewBidHistory and BrowseCategories) account for more than 50% of the used heap, and these two classes are likely to contain the code that is leaking memory.

Figure 3. MAT: Dominator tree with two servlets that account for more than 50% of the used heap
Figure 3. MAT: Dominator tree with two servlets that account for more than 50% of the used heap

Because we had the source code of the application with us, we searched for the usage of PreparedStatement objects in the two servlets (ViewBidHistory and BrowseCategories) and we found the pattern shown in Figure 4. Recall that a PreparedStatement object represents a precompiled SQL statement with IN parameters that can be set each time with the specified setter methods. This way, it can be reused to execute the SQL statement efficiently multiple times. However, RUBiS code repeatedly created a lot of “temporary” PreparedStatement objects inside a loop, but closed only the last instance that was created. The JDBC specification states that closing a JDBC driver should close all PreparedStatements associated with a Connection when the Connection is closed. In this case, the JDBC driver did not close the PreparedStatement objects at Connection close and maintained a reference resulting in those objects not getting garbage collected, and so their number kept increasing, which resulted in a memory leak.

There are two ways to fix this kind of a leak: either the statement close method is moved inside the loop, or the statement creation is moved above the loop. Both will fix the memory leak, but the latter is likely to give better performance. We found that 7 out of 22 servlets in RUBiS were leaking memory at various places due to this usage patter. The resulting heap usage graph in GCMV at constant load after fixing the memory leak is given in Figure 5 . Notice that the used heap remains almost constant as expected.

Figure 4. RUBiS code that was the root cause for the memory leak
Figure 4. RUBiS code that was the root cause for the memory leak

This case study highlighted a memory leak scenario in infrastructure code where the leak happened due to a combination of bad usage as well as bad driver implementation. Furthermore, the root cause turned out to be something that is not usually perceived as a common cause of a leak.

Figure 5. GCMV: Constant RUBiS heap usage after fixing memory leak at constant load
Figure 5. GCMV: Constant RUBiS heap usage after fixing memory leak at constant load

Case Study 2: High system CPU usage

Our second case study is from a telecom application in production (referred to hereafter as TelecomApp). This application served as the home page for millions of mobile phone users. Users could click and get a variety of multimedia content such as ring tones, wallpapers, live scores, and so on. In production, the main components of this application consist of a cluster of web portal nodes that connect to a cluster of back end content management system nodes, which in turn get their data from a database running on a large multicore machine.

The problem that was reported was of very high CPU utilization at low to moderate number of users. We collected detailed system logs using nmon. A quick look at the CPU logs revealed that the webportal node had high levels of system CPU (Figure 6).

Figure 6. Web portal node CPU graph from nmon before fixing the code
Figure 6. Web portal node CPU graph from nmon before fixing the code

Verbose GC logging was enabled through the WebSphere Application Server admin console and we analyzed the verbose GC logs collected (by default, they go in native_stderr.log in WebSphere Application Server) using GCMV. The GC logs looked fine, except a very high number of objects being queued for finalization -- close to 40000 of them (see the dotted line in Figure 7). Recall that all those objects that have a finalize() method and are found to be unreachable (dead) by the garbage collector are pushed into a finalization queue. At some point later, the finalizer thread will dequeue this object from the finalization queue and call its finalize() method. Too many objects being queued for finalization can cause two types of performance degradation:

  • High CPU utilization, because of the finalizer thread’s work, or
  • High memory utilization, as all the objects that are dead but reachable from an object that is yet to be finalized cannot be reclaimed.
Figure 7. GCMV: Objects queued for finalization (dotted line = original code, solid line = after fixing code)
Figure 7. GCMV: Objects queued for finalization (dotted line = original code, solid line = after fixing code)
Figure 8. MAT: Histogram of objects referenced by finalizer in finalizer reference statistics
Figure 8. MAT: Histogram of objects referenced by finalizer in finalizer reference statistics”

We enabled heap dump settings in WebSphere Application Server‎ and took a heap dump in the staging environment to find out the type of objects that were being finalized. We used MAT to analyze the heapdump and looked at the histogram of objects referenced by finalizer in finalizer reference statistics (Figure 9). Out of 38,469 objects queued for finalization, 35,626 were of type java.util.zip.ZipFile. We took thread dumps and analyzed them using IBM Thread and Monitor Dump Analyzer (also shipped with IBM Support Assistant) that pointed us to code that was responsible for creation of these objects of type java.util.zip.ZipFile (Figure 10).

Figure 9. IBM Thread and Monitor Dump Analyzer: Code responsible for creating java.util.zip.ZipFile
Figure 9. IBM Thread and Monitor Dump Analyzer: Code responsible for creating java.util.zip.ZipFile

The code invoked the getResource(Filename) method on the class loader to get handle to the exact path of a particular properties file. This method, essentially, unzipped all JAR files in the classpath, one by one, to search for the given filename. This resulted in the creation of a lot of java.util.zip.ZipFile objects (each of which have a finalize method) and a lot of disk read activity. This unzipping of JAR files and the associated disk read activity was the reason behind the observed large system CPU. This code was repeated in several methods, all of which were on the critical path of various client requests. Steps taken to detect the root cause for this problem are shown in Figure 10.

Figure 10. Correlating output from various tools to resolve TelecomApp high system CPU problem
Figure 10. Correlating output from various tools to resolve TelecomApp high system CPU problem

We fixed this problem by replacing the code that searched for this JAR file in the classpath, by the exact file location of the given file. Immediately, the system CPU went down (Figure 11). The number of objects queued for finalization went down as well (see the solid line in Figure 7). Notice the initial spike in the objects queued for finalization in both cases; this happens as a result of all the classes being loaded from various JARs at startup.

This case study highlighted a high system CPU usage scenario where the problem happened due to bad code. This case study also highlighted how output from different tools can be correlated to get to the root cause of a performance problem.

Figure 11. Web portal node CPU graph from nmon after fixing the code
Figure 11. Web portal node CPU graph from nmon after fixing the code

Case Study 3: High CPU and memory usage

Our third case study is from the same TelecomApp form the previous case study. In this case study, we tried to solve the problem of high memory usage by the application coupled with high CPU utilization. This resulted in the application running out of heap memory after running for two or three days, and halting with a heap dump getting generated. The application used the default optthruput GC policy in the IBM JDK, which is aimed at optimal throughput.

First, we analyzed the production verbose GC log using GCMV and looked for the used heap. As shown in Figure 12, the used heap curve (pink line) shows the heap always has about 750 MB of data. The used heap becomes the lowest at night (between 1:30 am and 6:00 am). Logs also had a very high number of mark stack overflow errors; this happens when there are too many live objects in the heap (or, more precisely, very deeply nested objects) and the stack that GC uses during the mark phase overflows. Figure 12 also shows two restarts indicated by the sudden drop in heap size.

Figure 12. GCMV: Consistently high used heap (750 MB) in the TelecomApp
Figure 12. GCMV: Consistently high used heap (750 MB) in the TelecomApp

The verbose GC log also showed high pause times due to high compaction times. This pattern was very uniform thoughout and almost all high pause times were caused due to high compaction times (Figure 13). Further analysis of the logs for these high pause times revealed allocation failures with reason code 16; reason code 16 compaction happens when the garbage collector is not able to increase the amount of free storage by at least 10% [9]. The overhead due to garbage collection was around 13% (GC overhead=times spent in GC pauses/total execution time * 100). Figure 13 also shows that GC keeps freeing memory; however, the freed memory (green curve) keeps decreasing due to fragmentation, and eventually compaction is required (red lines). Notice that heap fragmentation can also lead to dark matter which cannot be used for satisfying allocation requests. Any heap space that lies between two allocated objects and is less than 512 bytes, is not used by the JVM for allocation requests and is referred to as "dark matter."

Figure 13. TelecomApp: GC freed memory (green curve) keeps decreasing due to fragmentation and eventually compaction is required (red lines)
Figure 13. TelecomApp: GC freed memory (green curve) keeps decreasing due to fragmentation and eventually compaction is required (red lines)

An analysis of the production heap (that got dumped on an out of memory error) showed a total of 1.2 GB of memory being used, out of which close to 585 MB of the retained heap is occupied by objects belonging to two classes (Figure 14):

  • com.TelecomApp.cache.impl.CacheEntryImpl
  • com.TelecomApp.mcs.devices.InternalDeviceImpl.

Notice that the retained size of an object is its shallow size plus the shallow sizes of the objects that are accessible, directly or indirectly, only from this object. Table 1 shows the breakdown of the shallow heap by object types. Recollect that shallow size of an object is the amount of memory allocated to store the object itself, not taking into account the referenced objects. You can see that the top three contributors are all related to HashMaps: java.util.HashMap$Entry and java.util.HashMap$Entry[] contribute nearly 390 MB (close to 33% of total heap size). This hints that both com.TelecomApp.cache.impl.CacheEntryImpl and com.TelecomApp.mcs.devices.InternalDeviceImpl are comprised of large hashmaps.

The application development team reconfirmed that the application uses a lot of caches internally to cache mobile device specific policies for faster response time. This explained why the heap usage never goes down drastically. This seemed like an application design issue, since the memory footprint of the application seemed to be more than what the system could provide. However, the application team maintained that cached entries should automatically expire every 30 minutes or so and they ruled out a redesign since it would have required considerable effort.

Figure 14. TelecomApp production heap dump break up (retained heap)
Figure 14. TelecomApp production heap dump break up (retained heap)
Table 1. TelcomApp production heap dump break up (shallow heap)
CountTotal sizeType
1,871,075305,453,504array of char
8,558,144273,860,608java/util/HashMap$Entry
927,421114,439,696array of java/util/HashMap$Entry
1,774,00256,768,064java/lang/String
639,45352,688,552array of java/lang/Object
92,92249,771,728array of byte
901,80843,286,784java/util/HashMap
899,46735,978,680com/TelecomApp/mcs/themes/PropertyValue
109,74027,300,328array of int
26,62216,834,352array of com/TelecomApp/mcs/themes/StyleValue
514,37616,460,032java/util/Hashtable$Entry
329,67515,824,400com/TelecomApp/styling/properties/PropertyDetailsImpl
398,76912,760,608com/TelecomApp/styling/impl/engine/StylerImpl
251,68911,886,160array of com/TelecomApp/mcs/themes/PropertyValue
324,78210,393,024com/TelecomApp/mcs/css/version/DefaultCSSProperty
340,8369,177,496array of long
95,2468,393,000array of java/util/Hashtable$Entry
338,9058,133,720java/util/BitSet
491,1917,859,056java/util/HashSet
281,8177,144,296array of com/TelecomApp/styling/impl/engine/matchers/SimpleMatcher
284,6486,831,552com/ibm/ws/cache/Bucket
203,7816,520,992javax/servlet/jsp/tagext/TagAttributeInfo
1755,708,528array of org/apache/xpath/objects/Xobject
234,1555,619,720com/TelecomApp/styling/impl/device/DeviceStylesDelta
221,1205,306,880java/util/ArrayList

Verbose GC logs also pointed out that the application was using some large objects. We analyzed the heap dump for the shallow object sizes. There were only 30,000 objects greater than 1024 bytes in a total of close to 25 million objects. This constitutes almost 0.12% of the heap in terms of number of objects. However, the cumulative space occupied by these large objects was roughly 222 MB out of the total 1.2 GB (18.5%). The average object size was 51 bytes, while the median size was 32 bytes, which was also the mode. This indicated that the heap was comprised of a very large number of small objects which were interspersed throughout the heap in between few large objects. A histogram of the shallow objects is given in Figure 15.

Figure 15. Histogram of the object sizes from TelecomApp production heap
Figure 15. Histogram of the object sizes from TelecomApp production heap

In order to rule out any memory leaks, a long 10 hour run at low and constant load was conducted in the staging environment. The heap usage from the verbose GC log was almost constant throughout the experiment, which ruled out any memory leaks. However, the staging environment could not replicate the long pause times observed in the production system. This was probably due to the load that was offered to the system.

Our analysis so far ruled out any memory leaks and the following characteristics became apparent:

  • The heap was fragmented and resulted in high compaction times and long periods of sustained high CPU usage, as well as wasted heap space.
  • The application had a mix of a few large and a lot of small objects.
  • The application had some short-lived objects and most of the heap seemed to comprise of relatively long-living cached objects.

Garbage collection techniques to resolve the above issues have been discussed in literature. Given the above application characteristics, we considered two solutions:

  1. Use a generational and concurrent collector. This policy (known as the Gencon GC policy in IBM JDK) is usually recommended for transactional applications that create a lot of short-lived objects. This policy divides the heap space into nursery (for new and short lived objects) and tenured space for old objects. Since this policy uses copying of live objects in the nursery space during a collection in the nursery (minor collection), it gets rid of fragmentation in the nursery. However, when the tenured space fills up, a global collection occurs and unreachable objects are collected, which can result in fragmentation in the tenured space. Ideally, we would like to move all the long-lived cached objects into the tenured space and leave the nursery only for short lived objects.
  2. Use a pool of objects for the most frequently created objects. This resolves the problem of fragmentation by reducing creation and collection of short lived objects. However, this requires either creating a pool for frequently created objects in the application code or using the inbuilt pooling capability in the IBM JVM.
Figure 16. TelecomApp with modified GC policy (Gencon)
Figure 16. TelecomApp with modified GC policy (Gencon)

We tried both solutions, but realized that the first option was easier to implement on a production system. We first tried the generational and concurrent collector with default settings. Garbage collection policy can be easily specified on the Java command line for WebSphere Application Server (using the flag –Xgcpolicy). We specified generational policy by using the flag –Xgcpolicy:gencon. Garbage collection overhead (percentage of time spent in GC pauses) went down from 13% to 1.18%. In the default gencon setting, the size of the nursery heap is set to 55-60 MB and the rest is used as the tenured space. When this setting was put into production, the high pause times went away as there was no compaction required (Figure 16). However, a narrow saw-tooth pattern in the used memory and free tenured space (Figure 16) also indicated that the default nursery size was too small. This resulted in too many minor collections, and objects being prematurely promoted into tenured space. This also resulted in the tenured space filling up very soon and requiring a global connection.

The saw-tooth pattern revealed that around 400-550 MB of memory was being freed from the tenured space in each global collection. We used this as a yardstick to set the initial size of nursery heap to be 256 MB and the maximum nursery size to be 512 MB. The initial and maximum nursery sizes can be easily controlled using the flags –Xmns and –Xmnx, respectively‎. When this setting was implemented in production, the GC overhead further went down from 1.18% in the default Gencon settings to 0.8% (Table 2). The problem of high CPU usage for a sustained period of time (which was being caused by frequent fragmentation and compaction in the default settings) was also resolved.

Table 2. TelecomApp: performance improvement through GC tuning
GC policy Time spent in GC pausesMax pause time in secondsMean pause time in seconds
Optthroughput GC policy
(original)
13% 11.2 1.35
Gencon GC policy
(default nursery size)
1.18% 6.26 0.03
Gencon GC policy
(increased nursery size)
0.80% 1.18 0.08

This case study highlighted a high CPU usage scenario where the problem happened due to a combination of application design and infrastructure settings (garbage collection policy settings). The application suffered high GC overheads due to heap fragmentation and repeated compaction, and this resulted in high CPU utilization.

This case study also illustrated the complexity associated with performance tuning: optimally tuning GC settings for an enterprise application that had an interesting mix of objects (large and small objects, long-living and short-living objects) can be non-trivial and difficult to automate.


Conclusion

In this article, we presented some of our experiences working with enterprise Java applications that exhibited various performance and scaling problems. We shared the insights we learned while determining the root cause of some of these performance problems. We diagnosed an availability problem with an open source benchmark, and found the root cause to be a memory leak that occurred due to usage error, as well as a non-compliant JDBC driver implementation. We presented two other case studies from a production telecom application. One resulted in high CPU utilization caused by code that searched a properties file in the entire classpath. This manifested itself as high number of finalization objects that get created when various JAR files are unzipped. The last case demonstrated various garbage collection issues that enterprise applications in Java face. Through our analysis we were able to recommend the correct GC policy and tune its settings.

Throughout this journey of fixing performance problems, the analysis of verbose GC logs and heap dumps using the various tools available in IBM Support Assistant proved to be valuable diagnostic tools that provided us with vital clues on the probable root causes. Our experience also demonstrated that the underlying causes of these problems tend to be diverse, making it is very difficult to provide an end-to-end root cause analysis engine for these kinds of problems.


Acknowledgements

The authors thank Rajeev Palanki and Chris Bailey for their valuable suggestions and guidance.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=821839
ArticleTitle=The Support Authority: Using IBM Support Assistant to find patterns for determining and resolving performance problems in enterprise applications
publish-date=06202012