Skip to main content

Maximizing Java performance on AIX: Part 2: The need for speed

Amit Mathur (amitmat@us.ibm.com), Senior Technical Consultant and Solutions Enablement Manager, IBM 
Amit Mathur works in the IBM Solutions Development group, working primarily with IBM ISVs in enablement/performance of their apps on IBM eServer platforms and providing self-sufficiency to ISVs and customers by providing education and articles on developer works. Amit has more than fourteen years' experience working in Leading software support and development in C/C++, Java and databases on UNIX and Linux platforms. He holds a Bachelor of Engineering degree in Electronics and Telecommunication from India. You can reach Amit at amitmat@us.ibm.com.
Sumit Chawla (sumitc@us.ibm.com), IBM Certified IT Architect and Technical Lead, Java Enablement, IBM 
Sumit Chawla leads the Java Enablement initiative for IBM eServer (for AIX, Windows, and Linux platforms), assisting Independent Software Vendors for IBM Servers. Sumit has a Master of Science degree in Computer Science, with almost 10 years of experience in the IT industry, and is certified by IBM as an Application Architect. He is a frequent contributor to the developerWorks eServer zone. You can contact him at sumitc@us.ibm.com.

Summary:  This 5-part series provides several tips and techniques that are commonly used for tuning Java™ applications for optimum performance on AIX®. A discussion of the applicability of each tip is also provided. Using these tips, you should be able to quickly optimize the Java environment to suit your application's needs.

View more content in this series

Date:  03 Nov 2007 (Published 29 Mar 2004)
Level:  Intermediate
Activity:  1767 views

Introduction

This is the second article in the 5-part series about Java on AIX Performance Tuning. If you have not done so already, we strongly recommend that you review Part 1 of this series before proceeding further.

This article looks at ways to maximize the execution speed and throughput of a system. For programs that involve a user interface, we also look at how to ensure that responsiveness of the system is kept within acceptable levels

You should look at the first section for general tips that apply to most situations. We also provide a quick reference to tools that are useful in CPU bottleneck detection/investigation. The next section describes various types of applications and how they can be tuned. This discussion makes use of your knowledge of the application to decide which tips are best for you. The third section describes the various tips. The article concludes with a look at the next article in the series.


CPU as bottleneck

This article deals with making your application faster, or more responsive, or both.

You can usually find out if the application is slow by comparing the actual and expected performance numbers. Alternatively, an application's user interface may freeze periodically, or network connections to the application may time out due to application being busy. Using topas or tprof will display whether CPU is being utilized at 100% or not. You need to be able to distinguish between abnormal activity and a case of bad sizing; if you need a faster CPU or a bigger machine, there is not much tweaking that can be carried out.

As the first step, you should use topas or other similar tools to see if Java is the biggest user of CPU. If you see that Java is way down in the list of CPU users, doing CPU-specific tuning won't likely be of much use. We provide a quick overview of topas in Part 1.

The ideal case would be if the application is utilizing CPUs at or above 90% utilization. If you have reached that stage and still are not satisfied with the throughput, you may be using an undersized machine. If you use DLPAR, try adding another CPU or two and measure the difference.

The rest of this section provides quick introduction to some common tools, and how to detect Java-specific problems. For more details please refer to AIX 5L Performance Tools Handbook and Understanding IBM eServer pSeries Performance and Sizing.

vmstat

vmstat can be used to give multiple statistics on the system. For CPU-specific work, try the following command:

vmstat -t 1 3

This will take 3 samples, 1 second apart, with timestamps (-t). You can, of course, change the parameters as you like. The output is shown below:

      kthr     memory             page              faults        cpu        time
      ----- ----------- ------------------------ ------------ ----------- --------
       r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa hr mi se
       0  0 45483   221   0   0   0   0    1   0 224  326 362 24  7 69  0 15:10:22
       0  0 45483   220   0   0   0   0    0   0 159   83  53  1  1 98  0 15:10:23
       2  0 45483   220   0   0   0   0    0   0 145  115  46  0  9 90  1 15:10:24


In this output some of the things to watch for are:

  • Columns r (run queue) and b (blocked) start going up, especially above 10. This usually is an indication that you have too many processes competing for CPU.
  • If cs (context switches) go very high compared to the number of processes, then you may need to tune the system with vmtune. This topic is beyond the scope of the current article series.
  • In the cpu section, us (user time) indicates the time is being spent in programs. Assuming Java is at the top of the list in tprof, then you need to tune the Java application).
  • In the cpu section, if sys (system time) is higher than expected, and you still have id (idle) time left, this may indicate lock contention. Check the tprof for lock related calls in the kernel time. You may want to try multiple instances of the JVM. It may also be possible to find deadlocks in a javacore file.
  • In the cpu section, if wa (I/O wait) is high, this may indicate a disk bottleneck, and you should use iostat and other tools to look at the disk usage.
  • Values in the pi, po (page in/out) columns are non-zero may indicate that you are paging and need more memory. It may be possible that you have the stack size set too high for some of your JVM instances. It could also mean that you have allocated a heap larger than the amount of memory on the system. Of course, you may also have other applications using memory, or that file pages may be taking up too much of the memory.

iostat

You can use iostat to get the same CPU information that vmstat provides, along with disk I/O etc.

ps

ps is a very flexible tool for identifying the programs that are running on the system and the resources they are using. It displays statistics and status information about processes on the system, such as process or thread ID, I/O activity, CPU and memory utilization.

ps -ef | grep java

This will allow you to find out all active Java process IDs. Many other commands require you to find the process ID first; using -ef will help you distinguish between multiple Java processes by showing their command-line arguments.

 ps -p PID -m -o THREAD

Using the PID (process ID) of the Java process you are interested in, you can examine how many threads are created. This is especially useful for cases when you want to monitor a large application; you can pipe the above output through wc -l to obtain the number of threads being created by the JVM. This can be done in a loop so that you can detect if some threads are starting or dying when they shouldn't.

ps au[x]

Useful for getting the %CPU and %Memory numbers, sorted by top users. This is useful for quickly locating the bottlenecks on the system.

ps v[g]

Shows the virtual memory usage. Note that the preferred way to monitor native and Java heaps is through svmon. This is explained in detail in Part 3 of this series.

 ps eww PID

Using the PID (process ID), you can get an output of the environment settings of the process. This would, for example, show the complete file path of Java being executed, which may not show up in a normal ps listing. Note that in order to obtain a complete environment listing, it is recommended to create a javadump file instead (see IBM developer kits - diagnosis documentation for details).

sar

sar -u -P ALL x y can be used to check the balance of CPU usage across multiple CPU's. If the distribution is not balanced, it may indicate that your application is not threaded and you may need to create multiple instances of the application. In this example below, two samples are taken every five seconds on a 2-processor system that is 80% utilized.


        # sar -u -P ALL 5 2

        AIX aix4prt 0 5 000544144C00    02/09/01

        15:29:32 cpu    %usr    %sys    %wio   %idle
        15:29:37  0       34      46       0      20
                  1       32      47       0      21
                  -       33      47       0      20
        15:29:42  0       31      48       0      21
                  1       35      42       0      22
                  -       33      45       0      22

        Average   0       32      47       0      20
                  1       34      45       0      22
                  -       33      46       0      21


You may also see all CPUs being utilized at 100% (when you are running into long Mark cycles), or just one CPU being utilized at 100% (when the JVM is doing compaction). That would indicate that you need to do verbosegc tuning; see Fine-tuning Java garbage collection performance.

tprof

tprof is one of the AIX legacy tools that provides a detailed profile of CPU usage for every AIX process ID and name. It has been completely rewritten for AIX 5.2, and the example below uses the AIX 5.1 syntax. You should refer to AIX 5.2 Performance Tools update: Part 3 for the new syntax.

The simplest way to invoke this command is to use:

 # tprof -kse -x "sleep 10"

At the end of ten seconds, a new file __prof.all is generated that contains information about what commands are using CPU on the system. Searching for FREQ, the information looks something like the example below:

              Process   FREQ  Total Kernel   User Shared  Other
              =======    ===  ===== ======   ==== ======  =====
               oracle    244  10635   3515   6897    223      0
                 java    247   3970    617      0   2062   1291
                 wait     16   1515   1515      0      0      0
    ...
              =======    ===  ===== ======   ==== ======  =====
                Total   1060  19577   7947   7252   3087   1291



This example shows that over half the CPU time is associated with the oracle application and that Java is using about 3970/19577 or 1/5 of the CPU. The wait usually means idle time, but can also include the I/O wait portion of the CPU usage.

To see if a lot of lock contention is present you should also examine the KERNEL section:

       Total Ticks For All Processes (KERNEL) = 7787

     Subroutine                Ticks  %   Source   Address Bytes
     =============             ===== ==== ======== ======== ======
     .unlock_enable_mem         2286 11.7 low.s    930c     1f4
     .waitproc_find_run_queue   1372  7.0 ../../../../../src/bos/kernel/proc/dispatc h.c 2a6ec    2b0
     .e_block_thread             893  4.6 ../../../../../src/bos/kernel/proc/sleep2.

For Shared Objects section, look for libjvm.a and specifically for gc_* or names that are close to any of the GC phases (Mark, Sweep, Compact). If you find these a lot, the JVM process may need GC tuning.

You should also look for significant subroutines in terms of large percentage of Ticks. For example one tprof output shows that the value of clProgramCounter2Method was quite high:


    Subroutine                Ticks  %   Source   Address Bytes
     =============             ===== ==== ======== ======== ======
     .clProgramCounter2Method   3551 14.8 /userlvl/ca131/src/jvm/sov/cl/clloadercache.c

After a review of several such examples, it was discovered that removing Throwable.printStackTrace calls made significant performance improvements. The investigation that led to this particular method was started by analyzing tprof output.

Java-specific tips

In almost all cases (see tips for exceptions), the JIT compiler must be switched on, since that can lead to a performance difference equivalent of executing bytecode versus native code. The JIT can provide up to 25x improvement, so it is a critical performance component for Java.

Garbage Collection is also another crucial performance component, so it must be examined and tweaked as needed. Note that though enabling the GC traces (using -verbosegc) has a slight negative impact, the advantage of being able to monitor and analyze the heap outweighs the negative impact. Another way of looking at it is that a healthy heap would minimize the amount of information being printed through -verbosegc, so by tweaking the heap you can minimize the cost of additional tracing as well.


Characteristics-based tuning tips

Now we look at various characteristics of typical applications. You should locate the behavior that resembles that of your application (either by design or through observation) and apply the corresponding tips.

Longevity of application

The IBM Java is designed to provide better out of the box characteristics for long running applications like server code. If, for some reason, you're trying to run a testcase that lasts less than 5 minutes or so, you may find that the preparation that IBM Java does to get ready for the long haul will affect the startup time. Look at tips CPU001: Quick start your application and CPU004: Get rid of GC completely if a quick start is more important to your application than a long run. If CPU004 : Get rid of GC completely does not work for you, CPU012 : Avoid heap resizing may be considered instead. In severe cases, you may want to test with JIT switched off, if your testcase is so short that even JIT initialization is too expensive. Note that disabling the JIT is not being mentioned as a standalone performance tip since, as mentioned in the last article, this is possibly the worst thing you can do to affect your application performance.

If your application can afford a slight delay in startup time, you should see tips CPU003 : Compile everything at first touch and CPU008 : Use a small heap. For long-running applications that have clear "initialization" and "run" phases, CPU003 : Compile everything at first touch is very handy.

Level of interaction

Based on whether your code is computation-intensive or not, the responsiveness of the JVM can range from critical to irrelevant. If the JVM you're trying to tune is running a GUI, long GC pauses would be unacceptable. At the same time if you're running multiple JVM instances that allow load share, or if a batch-mode processing is being carried out, a long pause may be acceptable.

For applications that cannot afford long pause times, see CPU002 : Use Concurrent GC, CPU004 : Get rid of GC completely, CPU007 : Disable explicit System.gc() calls, CPU008 : Use a small heap, CPU009 : Remove Mark Stack Overflow and CPU012 : Avoid Heap resizing. CPU004 would be applicable only to short-running applications in most circumstances. Note that CPU008 must be looked at in conjunction with the memory characteristics of the application, as it may end up having the opposite effect if not applied correctly.

For applications that can afford longer pauses, CPU003 : Compile everything at first touch should be considered. Note that having long pauses is a bad thing in most cases, so even if your application can afford it, you should look at and correct the problem since you do not gain any advantage by having a misconfigured JVM instance.

CPU consumption

If you're running an application with more threads than the number of installed CPUs, and it is normal for you to observe that the overall CPU utilization remains at 90% or higher, any kind of background processing will hurt the through of your application. On the other hand, if your application is a server whose threads sleep for most of the time, waking up only to service incoming requests, you may be able to diminish the effect of a long GC Pause using background processing.

For applications that are CPU-intensive and hence would like to minimize background processing, consider CPU007 : Disable explicit System.gc() calls, CPU008 : Use a small heap and CPU009 : Remove Mark Stack Overflow. You should consider CPU008 : Use a small heap in conjunction with memory characteristics, as mentioned before.

For applications that are not CPU-intensive, CPU002 : Use Concurrent GC is highly recommended. This will benefit by reducing the overall pause time when the GC cycle hits.

Well-defined locality of reference

If your application has a few methods that get executed very often, while other methods that get executed rarely, CPU003 : Compile everything at first touch would be a very good performance enhancer.

Degree of parallelism

If your application runs multiple threads to get the work done, it will benefit from a system that has large number of CPUs. For a Dynamic Partition, adding more CPUs will show benefits immediately, as Java threads can be immediately scheduled to newly added CPUs. CPU005 : Large Number of Threads, CPU006 : Reduce Lock contention and CPU011 : More than 24 CPU systems discuss other optimizations you can try.

But if your application has a single thread of execution, you will be limited by the processing power of a single CPU. In this case, you may want to try CPU002 : Use Concurrent GC and CPU010 : Single-CPU systems. CPU010 : Single-CPU systems is especially helpful if you are attempting to run multiple JVM instances on a system (for example, in a clustered environment).


General collection of tips

The text below refers to command-line arguments to Java (specified before the class/jar file names) as "switches". For example, the line "java -mx2g hello" has a single switch, "-mx2g".

Tip CPU001 : Quick start your application

The non-standard switch -Xquickstart can be used to reduce the startup time of your application.This switch reduces the level of JIT optimizations to a minimum, and re-applies them only if the applicable methods become hot again. The result, for applications where execution is not concentrated into a small number of methods, is a much quicker startup.

Note: Due to the multi-stage optimization approach, this switch may have an adverse effect on long-running applications. Note:

Tip CPU002 : Use concurrent GC

The Concurrent Mark Garbage Collection Policy can be specified in order to reduce the amount of pause induced by a GC cycle. It is specified using -Xgcpolicy:optavgpause switch.

Note: In some cases, CPU-intensive applications may see a decrease in throughput with concurrent mark in certain cases.

Tip CPU003 : Compile everything (or selected methods) at first touch

The environment variable IBM_MIXED_MODE_THRESHOLD can be set to 0, switching the Mixed-Mode interpreter off. The result is that all methods get JIT-compiled the first time they get invoked. Add this line to your environment settings, or simply run it prior to launching Java:

export IBM_MIXED_MODE_THRESHOLD=0

You can also experiment with non-zero values, to see if a particular MMI threshold gives you better performance than zero. For AIX, Java 1.3.1 uses 600 as the threshold value, while Java 1.4 uses a value greater than 1000 (note that these values are subject to change). The IBM developer kits - diagnosis documentation has a section "Selecting the MMI Threshold" under the chapter"JIT Diagnostics" that provides more information.

If there are only certain classes you wish to affect, you can instead use JITC_COMPILEOPT=FORCE(0){classname}{methodname} instead. Examples:

export JITC_COMPILEOPT=FORCE(0){com/myapp/*}{*}

This example compiles all methods of all classes within com.myapp.* package on first load.

export JITC_COMPILEOPT=FORCE(0){*}{uniqueName}

This example compiles all methods called "uniqueName" when first loaded.

export JITC_COMPILEOPT=FORCE(0){com/myapp/special}{SpecialMethod}

This example compiles only this particular method on first load. Along with * (which stands for 0 or more characters), you can also use '?' as wildcard for single characters.

Multiple classes and/or methods can be specified using the following syntax:

export JITC_COMPILEOPT=FORCE(0){class1}{method1}{class2}{method2}

Make sure you document clearly that this is an optimization, not a fix!

Note: Start-up time of applications can get increased due to this setting.

Tip CPU004 : Get rid of GC completely

The startup and maximum heap sizes can be set to very large values, so that no allocation failures occur during the run. You should enable verbosegc for these runs to ensure that the strategy is working!

Note: When GC occurs, the cycle will likely be quite long, so this can be used in only very rare cases.

Tip CPU005 : Large number of threads

For scaling to a larger number of threads, you should use -Xss switch to specify a value smaller than the default (normally 512 K, but may vary based on Java version). This will allow you to scale to much larger number of threads, while reducing the native memory footprint of your application.

Note: If the stack size is too small, you may get Stack Overflow exceptions.

Tip CPU006 : Reduce lock contention

You can try running multiple Java instances, if your application architecture allows it, to reduce lock contention. This is facilitated by Application servers allowing this kind of a configuration; e.g. WebSphere allows you to use multiple Nodes on the same physical computer.

Note: Note that this may only shield the problem; you should revisit parts of your code that are causing excessive lock contention. You can use tprof or Java profiling to locate the areas that need to be revisited.

Tip CPU007 : Disable explicit system.gc() calls

Using a non-standard switch, -Xdisableexplicitgc, you can alleviate the need to remove the System.gc() calls from your code. Removing these calls will return the GC management back to the JVM.

Note: If System.gc() calls are required through functionality (for example, via a button on the application screen), this would be a bad idea since the button will become non-functional. There may be other, legitimate reasons why calls to System.gc() could be present in the code.

Tip CPU008 : Use a small heap

Use a heap size that never allows compaction times to become intolerable. If for some reason your application ends up causing a lot of compaction, having a 256 MB heap is going to take much less time compacting than a 1 GB heap.

Note: If more compactions are triggered as a result of smaller heap, this optimization will backfire. This tip can be used only in cases where a lot of temporary objects are being created.

Tip CPU009 : Remove mark stack overflow

If you observe Mark Stack Overflow messages in verbosegc logs, reduce the number of objects being kept live in the heap so that these messages go away. Newer builds of Java have a much better handling of MSO. This is being included here since MSO can seriously damage the performance of your application and must be treated as a defect rather than an optimization.

Tip CPU010 : Single-CPU systems

You can use the bindprocessor command to tie the Java process to a particular processor. This can be considered to avoid multiple JVM instances fighting for CPU scheduling. You may also want to set -Xgcthreads0 if the system is not a uniprocessor box.

If you are running your application on a 1-CPU LPAR that will not be reconfigured to add more CPUs dynamically, you can also export NO_LPAR_RECONFIGURATION=1 to get better performance in certain cases.

Note: You are disabling the best performance features of Java by forcing it to run in a single-CPU configuration. NO_LPAR_RECONFIGURATION will also disable the dynamic configurability of Java to adapt to DLPAR's, so it should be used with caution.

Tip CPU011 : More than 24 CPU systems

For 24- to 32-way systems, you should test with -Xgcpolicy:subpool since this GC policy is tuned to deliver better performance for larger configurations.

Tip CPU012 : Avoid heap resizing

You can keep a fixed size heap to avoid the time being spent resizing the heap when the free space percentage falls below (or goes above) a certain value. See Fine-tuning Java garbage collection performance for details.

Note: The memory footprint of the application will remain at the specified heap size, even if the heap usage is at 10% of its maximum level.


Summary

This article showed how to use AIX tools for Java performance monitoring, and provided a list of common tweaks that can be applied to optimize the application's CPU usage. The next article in the series talks about Memory tweaking for Java applications on AIX.


Resources

About the authors

Amit Mathur works in the IBM Solutions Development group, working primarily with IBM ISVs in enablement/performance of their apps on IBM eServer platforms and providing self-sufficiency to ISVs and customers by providing education and articles on developer works. Amit has more than fourteen years' experience working in Leading software support and development in C/C++, Java and databases on UNIX and Linux platforms. He holds a Bachelor of Engineering degree in Electronics and Telecommunication from India. You can reach Amit at amitmat@us.ibm.com.

Sumit Chawla leads the Java Enablement initiative for IBM eServer (for AIX, Windows, and Linux platforms), assisting Independent Software Vendors for IBM Servers. Sumit has a Master of Science degree in Computer Science, with almost 10 years of experience in the IT industry, and is certified by IBM as an Application Architect. He is a frequent contributor to the developerWorks eServer zone. You can contact him at sumitc@us.ibm.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=87790
ArticleTitle=Maximizing Java performance on AIX: Part 2: The need for speed
publish-date=11032007
author1-email=amitmat@us.ibm.com
author1-email-cc=
author2-email=sumitc@us.ibm.com
author2-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers