This is the second article in the 5-part series about Java on AIX Performance Tuning. If you have not done so already, we strongly recommend that you review Part 1 of this series before proceeding further.
This article looks at ways to maximize the execution speed and throughput of a system. For programs that involve a user interface, we also look at how to ensure that responsiveness of the system is kept within acceptable levels
You should look at the first section for general tips that apply to most situations. We also provide a quick reference to tools that are useful in CPU bottleneck detection/investigation. The next section describes various types of applications and how they can be tuned. This discussion makes use of your knowledge of the application to decide which tips are best for you. The third section describes the various tips. The article concludes with a look at the next article in the series.
This article deals with making your application faster, or more responsive, or both.
You can usually find out if the application is slow by comparing the actual
and expected performance numbers. Alternatively, an application's user interface
may freeze periodically, or network connections to the application may time out
due to application being busy. Using topas or tprof will display whether CPU is
being utilized at 100% or not. You need to be able to distinguish between
abnormal activity and a case of bad sizing; if you need a faster CPU or a bigger
machine, there is not much tweaking that can be carried out.
As the first step, you should use topas or other similar tools to see if Java
is the biggest user of CPU. If you see that Java is way down in the list of CPU
users, doing CPU-specific tuning won't likely be of much use. We provide a quick
overview of topas in Part 1.
The ideal case would be if the application is utilizing CPUs at or above 90% utilization. If you have reached that stage and still are not satisfied with the throughput, you may be using an undersized machine. If you use DLPAR, try adding another CPU or two and measure the difference.
The rest of this section provides quick introduction to some common tools, and how to detect Java-specific problems. For more details please refer to AIX 5L Performance Tools Handbook and Understanding IBM eServer pSeries Performance and Sizing.
vmstat can be used to give multiple statistics
on the system. For CPU-specific work, try the following command:
vmstat -t 1 3 |
This will take 3 samples, 1 second apart, with timestamps (-t). You can, of
course, change the parameters as you like. The output is shown below:
kthr memory page faults cpu time
----- ----------- ------------------------ ------------ ----------- --------
r b avm fre re pi po fr sr cy in sy cs us sy id wa hr mi se
0 0 45483 221 0 0 0 0 1 0 224 326 362 24 7 69 0 15:10:22
0 0 45483 220 0 0 0 0 0 0 159 83 53 1 1 98 0 15:10:23
2 0 45483 220 0 0 0 0 0 0 145 115 46 0 9 90 1 15:10:24
|
In this output some of the things to watch for are:
- Columns r (run queue) and b (blocked) start going up, especially above 10. This usually is an indication that you have too many processes competing for CPU.
- If cs (context switches) go very high compared to the number of processes, then you may need to
tune the system with
vmtune. This topic is beyond the scope of the current article series. - In the cpu section, us (user time) indicates the time is being spent in programs. Assuming Java is at
the top of the list in
tprof, then you need to tune the Java application). - In the cpu section, if sys (system time) is higher than expected, and you still have
id (idle) time left, this may indicate lock contention. Check the
tproffor lock related calls in the kernel time. You may want to try multiple instances of the JVM. It may also be possible to find deadlocks in ajavacorefile. - In the cpu section, if wa (I/O wait) is high, this may indicate a disk bottleneck, and you should use
iostatand other tools to look at the disk usage. - Values in the pi, po (page in/out) columns are non-zero may indicate that you are paging and need more memory. It may be possible that you have the stack size set too high for some of your JVM instances. It could also mean that you have allocated a heap larger than the amount of memory on the system. Of course, you may also have other applications using memory, or that file pages may be taking up too much of the memory.
You can use iostat to get the same CPU information that
vmstat provides, along with disk I/O etc.
ps is a very flexible tool for identifying the programs that are running on the
system and the resources they are using. It displays statistics and status information about processes on the system, such as process or thread ID, I/O activity, CPU and memory utilization.
ps -ef | grep java |
This will allow you to find out all active Java process IDs. Many other commands require you to find the process ID first; using -ef will help you distinguish between multiple Java processes by showing their
command-line arguments.
ps -p PID -m -o THREAD |
Using the PID (process ID) of the Java process you are interested in, you can examine how many threads are created. This is especially useful for cases when you want to monitor a large application; you
can pipe the above output through wc -l to obtain the number of threads being
created by the JVM. This can be done in a loop so that you can detect if some
threads are starting or dying when they shouldn't.
ps au[x] |
Useful for getting the %CPU and %Memory numbers, sorted by top users. This is useful for quickly locating the bottlenecks on the system.
ps v[g] |
Shows the virtual memory usage. Note that the preferred way to monitor native and Java heaps is through svmon. This is explained in detail in Part 3 of this series.
ps eww PID |
Using the PID (process ID), you can get an output of the environment settings of the process. This would, for example, show the complete file path of Java being executed, which may not show up in a normal ps listing. Note that in order to obtain a complete environment listing, it is recommended to create a javadump file instead (see IBM developer kits - diagnosis documentation for details).
sar -u -P ALL x y can be used to check the balance of CPU usage across multiple CPU's. If the
distribution is not balanced, it may indicate that your application is not threaded and you may need to create multiple instances of the application. In this example below, two samples are taken every five seconds on
a 2-processor system that is 80% utilized.
# sar -u -P ALL 5 2
AIX aix4prt 0 5 000544144C00 02/09/01
15:29:32 cpu %usr %sys %wio %idle
15:29:37 0 34 46 0 20
1 32 47 0 21
- 33 47 0 20
15:29:42 0 31 48 0 21
1 35 42 0 22
- 33 45 0 22
Average 0 32 47 0 20
1 34 45 0 22
- 33 46 0 21
|
You may also see all CPUs being utilized at 100% (when you are running into long Mark cycles), or just one CPU being utilized at 100% (when the JVM is doing compaction). That would indicate that you need to do verbosegc tuning; see Fine-tuning Java garbage collection performance.
tprof is one of the AIX
legacy tools that provides a detailed profile of CPU usage for every AIX process
ID and name. It has been completely rewritten for AIX 5.2, and the example below
uses the AIX 5.1 syntax. You should refer to AIX 5.2 Performance Tools update: Part 3 for the new syntax.
The simplest way to invoke this command is to use:
# tprof -kse -x "sleep 10" |
At the end of ten seconds, a new file
__prof.all is generated that contains information
about what commands are using CPU on the system. Searching for FREQ,
the information looks something like the example below:
Process FREQ Total Kernel User Shared Other
======= === ===== ====== ==== ====== =====
oracle 244 10635 3515 6897 223 0
java 247 3970 617 0 2062 1291
wait 16 1515 1515 0 0 0
...
======= === ===== ====== ==== ====== =====
Total 1060 19577 7947 7252 3087 1291
|
This example shows that over half
the CPU time is associated with the oracle application and that Java is using
about 3970/19577 or
1/5 of the CPU. The wait usually means idle time, but can also include the I/O
wait portion of the CPU usage.
To see if a lot of lock contention is present you should also examine the KERNEL section:
Total Ticks For All Processes (KERNEL) = 7787
Subroutine Ticks % Source Address Bytes
============= ===== ==== ======== ======== ======
.unlock_enable_mem 2286 11.7 low.s 930c 1f4
.waitproc_find_run_queue 1372 7.0 ../../../../../src/bos/kernel/proc/dispatc h.c 2a6ec 2b0
.e_block_thread 893 4.6 ../../../../../src/bos/kernel/proc/sleep2.
|
For Shared Objects section, look for
libjvm.a and specifically for gc_* or names that
are close to any of the GC phases (Mark, Sweep, Compact). If you find these a
lot, the JVM process may need GC tuning.
You should also look for significant subroutines in terms
of large percentage of Ticks. For example one tprof
output shows that the value of clProgramCounter2Method was quite high:
Subroutine Ticks % Source Address Bytes
============= ===== ==== ======== ======== ======
.clProgramCounter2Method 3551 14.8 /userlvl/ca131/src/jvm/sov/cl/clloadercache.c
|
After a review of several such examples, it was discovered
that removing Throwable.printStackTrace calls made significant performance
improvements. The investigation that led to this particular method was started
by analyzing tprof output.
In almost all cases (see tips for exceptions), the JIT compiler must be switched on, since that can lead to a performance difference equivalent of executing bytecode versus native code. The JIT can provide up to 25x improvement, so it is a critical performance component for Java.
Garbage Collection is also another crucial performance component, so it must
be examined and tweaked as needed. Note that though enabling the GC traces
(using -verbosegc) has a slight negative impact, the advantage of being able to
monitor and analyze the heap outweighs the negative impact. Another way of
looking at it is that a healthy heap would minimize the amount of information
being printed through -verbosegc, so by tweaking the heap you can minimize the
cost of additional tracing as well.
Characteristics-based tuning tips
Now we look at various characteristics of typical applications. You should locate the behavior that resembles that of your application (either by design or through observation) and apply the corresponding tips.
The IBM Java is designed to provide better out of the box characteristics for long running applications like server code. If, for some reason, you're trying to run a testcase that lasts less than 5 minutes or so, you may find that the preparation that IBM Java does to get ready for the long haul will affect the startup time. Look at tips CPU001: Quick start your application and CPU004: Get rid of GC completely if a quick start is more important to your application than a long run. If CPU004 : Get rid of GC completely does not work for you, CPU012 : Avoid heap resizing may be considered instead. In severe cases, you may want to test with JIT switched off, if your testcase is so short that even JIT initialization is too expensive. Note that disabling the JIT is not being mentioned as a standalone performance tip since, as mentioned in the last article, this is possibly the worst thing you can do to affect your application performance.
If your application can afford a slight delay in startup time, you should see tips CPU003 : Compile everything at first touch and CPU008 : Use a small heap. For long-running applications that have clear "initialization" and "run" phases, CPU003 : Compile everything at first touch is very handy.
Based on whether your code is computation-intensive or not, the responsiveness of the JVM can range from critical to irrelevant. If the JVM you're trying to tune is running a GUI, long GC pauses would be unacceptable. At the same time if you're running multiple JVM instances that allow load share, or if a batch-mode processing is being carried out, a long pause may be acceptable.
For applications that cannot afford long pause times, see CPU002 : Use Concurrent GC, CPU004 : Get rid of GC completely, CPU007 : Disable explicit System.gc() calls, CPU008 : Use a small heap, CPU009 : Remove Mark Stack Overflow and CPU012 : Avoid Heap resizing. CPU004 would be applicable only to short-running applications in most circumstances. Note that CPU008 must be looked at in conjunction with the memory characteristics of the application, as it may end up having the opposite effect if not applied correctly.
For applications that can afford longer pauses, CPU003 : Compile everything at first touch should be considered. Note that having long pauses is a bad thing in most cases, so even if your application can afford it, you should look at and correct the problem since you do not gain any advantage by having a misconfigured JVM instance.
If you're running an application with more threads than the number of installed CPUs, and it is normal for you to observe that the overall CPU utilization remains at 90% or higher, any kind of background processing will hurt the through of your application. On the other hand, if your application is a server whose threads sleep for most of the time, waking up only to service incoming requests, you may be able to diminish the effect of a long GC Pause using background processing.
For applications that are CPU-intensive and hence would like to minimize background processing, consider CPU007 : Disable explicit System.gc() calls, CPU008 : Use a small heap and CPU009 : Remove Mark Stack Overflow. You should consider CPU008 : Use a small heap in conjunction with memory characteristics, as mentioned before.
For applications that are not CPU-intensive, CPU002 : Use Concurrent GC is highly recommended. This will benefit by reducing the overall pause time when the GC cycle hits.
Well-defined locality of reference
If your application has a few methods that get executed very often, while other methods that get executed rarely, CPU003 : Compile everything at first touch would be a very good performance enhancer.
If your application runs multiple threads to get the work done, it will benefit from a system that has large number of CPUs. For a Dynamic Partition, adding more CPUs will show benefits immediately, as Java threads can be immediately scheduled to newly added CPUs. CPU005 : Large Number of Threads, CPU006 : Reduce Lock contention and CPU011 : More than 24 CPU systems discuss other optimizations you can try.
But if your application has a single thread of execution, you will be limited by the processing power of a single CPU. In this case, you may want to try CPU002 : Use Concurrent GC and CPU010 : Single-CPU systems. CPU010 : Single-CPU systems is especially helpful if you are attempting to run multiple JVM instances on a system (for example, in a clustered environment).
The text below refers to command-line arguments to Java (specified before the class/jar file names) as "switches". For example, the line "java -mx2g hello" has a single switch, "-mx2g".
Tip CPU001 : Quick start your application
The non-standard switch -Xquickstart can be used to reduce the startup time
of your application.This switch reduces the level of JIT optimizations to a minimum, and re-applies them only if the applicable methods become hot again. The result, for applications where execution is not concentrated into a small number of methods, is a much quicker startup.
Note: Due to the multi-stage optimization approach, this switch may have an adverse effect on long-running applications. Note:
Tip CPU002 : Use concurrent GC
The Concurrent Mark Garbage Collection Policy can be specified in order to
reduce the amount of pause induced by a GC cycle. It is specified using -Xgcpolicy:optavgpause
switch.
Note: In some cases, CPU-intensive applications may see a decrease in throughput with concurrent mark in certain cases.
Tip CPU003 : Compile everything (or selected methods) at first touch
The environment variable IBM_MIXED_MODE_THRESHOLD can be set to 0, switching the Mixed-Mode interpreter off. The result is that all methods get JIT-compiled the first time they get invoked. Add this line to your environment settings, or simply run it prior to launching Java:
export IBM_MIXED_MODE_THRESHOLD=0 |
You can also experiment with non-zero values, to see if a particular MMI threshold gives you better performance than zero. For AIX, Java 1.3.1 uses 600 as the threshold value, while Java 1.4 uses a value greater than 1000 (note that these values are subject to change). The IBM developer kits - diagnosis documentation has a section "Selecting the MMI Threshold" under the chapter"JIT Diagnostics" that provides more information.
If there are only certain classes you wish to affect, you can instead use JITC_COMPILEOPT=FORCE(0){classname}{methodname} instead. Examples:
export JITC_COMPILEOPT=FORCE(0){com/myapp/*}{*} |
This example compiles all methods of all
classes within com.myapp.* package on first load.
export JITC_COMPILEOPT=FORCE(0){*}{uniqueName} |
This example compiles all methods called "uniqueName" when first loaded.
export JITC_COMPILEOPT=FORCE(0){com/myapp/special}{SpecialMethod} |
This example compiles only this particular method on first load. Along with * (which stands for 0 or more characters), you can also use '?' as wildcard for single characters.
Multiple classes and/or methods can be specified using the following syntax:
export JITC_COMPILEOPT=FORCE(0){class1}{method1}{class2}{method2}
|
Make sure you document clearly that this is an optimization, not a fix!
Note: Start-up time of applications can get increased due to this setting.
Tip CPU004 : Get rid of GC completely
The startup and maximum heap sizes can be set to very large values, so that no allocation failures occur during the run. You should enable verbosegc for these runs to ensure that the strategy is working!
Note: When GC occurs, the cycle will likely be quite long, so this can be used in only very rare cases.
Tip CPU005 : Large number of threads
For scaling to a larger number of threads, you should use -Xss switch to specify a value smaller than the default (normally 512 K, but may
vary based on Java version). This will allow you to scale to much larger number
of threads, while reducing the native memory footprint of your application.
Note: If the stack size is too small, you may get Stack Overflow exceptions.
Tip CPU006 : Reduce lock contention
You can try running multiple Java instances, if your application architecture allows it, to reduce lock contention. This is facilitated by Application servers allowing this kind of a configuration; e.g. WebSphere allows you to use multiple Nodes on the same physical computer.
Note: Note that this may only shield the problem; you should revisit parts of your code that are causing excessive lock contention. You can use tprof or Java profiling to locate the areas that need to be revisited.
Tip CPU007 : Disable explicit system.gc() calls
Using a non-standard switch, -Xdisableexplicitgc, you can alleviate the need
to remove the System.gc() calls from your code. Removing these calls will return
the GC management back to the JVM.
Note: If System.gc() calls are required through functionality (for example, via a button on the application screen), this would be a bad idea since the button will become non-functional. There may be other, legitimate reasons why calls to System.gc() could be present in the code.
Use a heap size that never allows compaction times to become intolerable. If for some reason your application ends up causing a lot of compaction, having a 256 MB heap is going to take much less time compacting than a 1 GB heap.
Note: If more compactions are triggered as a result of smaller heap, this optimization will backfire. This tip can be used only in cases where a lot of temporary objects are being created.
Tip CPU009 : Remove mark stack overflow
If you observe Mark Stack Overflow messages in verbosegc logs, reduce the number of objects being kept live in the heap so that these messages go away. Newer builds of Java have a much better handling of MSO. This is being included here since MSO can seriously damage the performance of your application and must be treated as a defect rather than an optimization.
Tip CPU010 : Single-CPU systems
You can use the bindprocessor command to tie the Java process to a particular
processor. This can be considered to avoid multiple JVM instances fighting for
CPU scheduling. You may also want to set -Xgcthreads0 if the system is not a
uniprocessor box.
If you are running your application on a 1-CPU LPAR that will not be reconfigured to add more CPUs dynamically, you can also export NO_LPAR_RECONFIGURATION=1 to get better performance in certain cases.
Note: You are disabling the best performance features of Java by forcing it to run in a single-CPU configuration. NO_LPAR_RECONFIGURATION will also disable the dynamic configurability of Java to adapt to DLPAR's, so it should be used with caution.
Tip CPU011 : More than 24 CPU systems
For 24- to 32-way systems, you should test with -Xgcpolicy:subpool since this
GC policy is tuned to deliver better performance for larger configurations.
Tip CPU012 : Avoid heap resizing
You can keep a fixed size heap to avoid the time being spent resizing the heap when the free space percentage falls below (or goes above) a certain value. See Fine-tuning Java garbage collection performance for details.
Note: The memory footprint of the application will remain at the specified heap size, even if the heap usage is at 10% of its maximum level.
This article showed how to use AIX tools for Java performance monitoring, and provided a list of common tweaks that can be applied to optimize the application's CPU usage. The next article in the series talks about Memory tweaking for Java applications on AIX.
- Read other parts in the Maximizing Java performance on AIX series:
-
IBM developer kits for AIX, Java technology edition at http://www.ibm.com/developerworks/java/jdk/aix/service.html
-
IBM developer kits - diagnosis documentation at http://www.ibm.com/developerworks/java/jdk/diagnosis/
-
AIX Performance PMR Data Collection Tools at ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr/
-
AIX 5L Performance Tools Handbook at http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/SG246039.html
-
Understanding IBM eServer pSeries Performance and Sizing at http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/SG244810.html
-
Fine-tuning Java garbage collection performance at http://www.ibm.com/developerworks/library/i-gctroub/
-
Getting more memory in AIX for your Java applications at http://www.ibm.com/developerworks/eserver/articles/aix4java1.html
-
AIX 5.2 performance tools update, Part 1 at http://www.ibm.com/developerworks/eserver/articles/Keung_AIXnewPerf.html
-
AIX 5.2 Performance Tools update, Part 2 at http://www.ibm.com/developerworks/eserver/articles/AIX5.2PerfTools.html
-
AIX 5.2 Performance Tools update: Part 3 at http://www.ibm.com/developerworks/eserver/articles/AIX5.2_performancetoolsupdatepart3.html
Amit Mathur works in the IBM Solutions Development group, working primarily with IBM ISVs in enablement/performance of their apps on IBM eServer platforms and providing self-sufficiency to ISVs and customers by providing education and articles on developer works. Amit has more than fourteen years' experience working in Leading software support and development in C/C++, Java and databases on UNIX and Linux platforms. He holds a Bachelor of Engineering degree in Electronics and Telecommunication from India. You can reach Amit at amitmat@us.ibm.com.
Sumit Chawla leads the Java Enablement initiative for IBM eServer (for AIX, Windows, and Linux platforms), assisting Independent Software Vendors for IBM Servers. Sumit has a Master of Science degree in Computer Science, with almost 10 years of experience in the IT industry, and is certified by IBM as an Application Architect. He is a frequent contributor to the developerWorks eServer zone. You can contact him at sumitc@us.ibm.com.
Comments (Undergoing maintenance)





