CPU bottlenecks

If vmstat has shown that the system is CPU-bound, the next stage is to determine which process is using the most CPU time.

The recommended tool is tprof:

> tprof -s -k -x sleep 60

outputs:

Mon Nov 28 12:40:11 2010
System: AIX 6.1 Node: voodoo Machine: 00455F1B4C00

Starting Command sleep 60 
stopping trace collection
Generating sleep.prof
> cat sleep.prof
Process                                Freq  Total Kernel   User Shared  Other
=======                                ====  ===== ======   ==== ======  =====
./java                                    5  59.39  24.28   0.00  35.11   0.00
wait                                      4  40.33  40.33   0.00   0.00   0.00
/usr/bin/tprof                            1   0.20   0.02   0.00   0.18   0.00
/etc/syncd                                3   0.05   0.05   0.00   0.00   0.00
/usr/bin/sh                               2   0.01   0.00   0.00   0.00   0.00
gil                                       2   0.01   0.01   0.00   0.00   0.00
afsd                                      1   0.00   0.00   0.00   0.00   0.00
rpc.lockd                                 1   0.00   0.00   0.00   0.00   0.00
swapper                                   1   0.00   0.00   0.00   0.00   0.00
=======                                ====  ===== ======   ==== ======  =====
Total                                    20 100.00  64.70   0.00  35.29   0.00

Process                   PID      TID  Total Kernel   User Shared  Other
=======                   ===      ===  ===== ======   ==== ======  =====
./java                 467018   819317  16.68   5.55   0.00  11.13   0.00
./java                 467018   766019  14.30   6.30   0.00   8.00   0.00
./java                 467018   725211  14.28   6.24   0.00   8.04   0.00
./java                 467018   712827  14.11   6.16   0.00   7.94   0.00
wait                    20490    20491  10.24  10.24   0.00   0.00   0.00
wait                     8196     8197  10.19  10.19   0.00   0.00   0.00
wait                    12294    12295   9.98   9.98   0.00   0.00   0.00
wait                    16392    16393   9.92   9.92   0.00   0.00   0.00
/usr/bin/tprof         421984   917717   0.20   0.02   0.00   0.18   0.00
/etc/syncd             118882   204949   0.04   0.04   0.00   0.00   0.00
./java                 467018   843785   0.03   0.02   0.00   0.00   0.00
gil                     53274    73765   0.00   0.00   0.00   0.00   0.00
gil                     53274    61471   0.00   0.00   0.00   0.00   0.00
/usr/bin/sh            397320   839883   0.00   0.00   0.00   0.00   0.00
rpc.lockd              249982   434389   0.00   0.00   0.00   0.00   0.00
/usr/bin/sh            397318   839881   0.00   0.00   0.00   0.00   0.00
swapper                     0        3   0.00   0.00   0.00   0.00   0.00
afsd                    65776   274495   0.00   0.00   0.00   0.00   0.00
/etc/syncd             118882   258175   0.00   0.00   0.00   0.00   0.00
/etc/syncd             118882   196839   0.00   0.00   0.00   0.00   0.00
=======                   ===      ===  ===== ======   ==== ======  =====
Total                                  100.00  64.70   0.00  35.29   0.00

	Total Samples = 24749	Total Elapsed Time = 61.88s

This output shows that the Java™ process with Process ID (PID) 467018 is using the majority of the CPU time. You can also see that the CPU time is being shared among four threads inside that process (Thread IDs 819317, 766019, 725211, and 712827).

By understanding what the columns represent, you can gather an understanding of what these threads are doing:

Total: The total percentage of CPU time used by this thread or process.
Kernel: The total percentage of CPU time spent by this thread or process inside Kernel routines (on behalf of a request by the JVM or other native code).
User: The total percentage of CPU time spent executing routines inside the executable. Because the Java executable is a thin wrapper that loads the JVM from shared libraries, this CPU time is expected to be very small or zero.
Shared: The total percentage of CPU time spent executing routines inside shared libraries. Time shown under this category covers work done by the JVM itself, the act of JIT compiling (but not the running of the subsequent code), and any other native JNI code.
Other: The total percentage of CPU time not covered by Kernel, User, and Shared. In the case of a Java process, this CPU time covers the execution of Java bytecodes and JIT-compiled methods themselves.

From the previous example, notice the Kernel and Shared values: these account for all of the CPU time used by this process, indicating that the Java process is spending its time doing work inside the JVM (or some other native code).

To understand what is being done during the Kernel and Shared times, the relevant sections of the tprof output can be analyzed.

The shared library section shows which shared libraries are being invoked:

Shared Object                                                  %  
=============                                                          ======
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9gc26.so  17.42
/usr/lib/libc.a[shr.o]                                                   9.38
/usr/lib/libpthreads.a[shr_xpg5.o]                                       6.94
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9thr26.so   1.03
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9prt26.so   0.24
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9vm26.so   0.10
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9ute26.so   0.06
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9jit26.so   0.05
/usr/lib/libtrace.a[shr.o]                                               0.04
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9trc26.so   0.02
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9hookable26.so   0.01

This section shows that almost all of the time is being spent in one particular shared library, which is part of the JVM installation: libj9gc<vm_version>.so, where <vm_version> is the version of the J9 virtual machine (check the output of the java -version command to confirm your VM level). By understanding the functions that the more commonly used JVM libraries carry out, it becomes possible to build a more accurate picture of what the threads are doing:

libbcv<vm_version>.so: Bytecode Verifier
libdbg<vm_version>.so: Debug Server (used by the Java Debug Interface)
libj9gc<vm_version>.so: Garbage Collection
libj9jextract.so: The dump extractor, used by the jextract command
libj9jit<vm_version>.so: The Just In Time (JIT) Compiler
libj9jvmti<vm_version>.so: The JVMTI interface
libj9prt<vm_version>.so: The port layer between the JVM and the Operating System
libj9shr<vm_version>.so: The shared classes library
libj9thr<vm_version>.so: The threading library
libj9ute<vm_version>.so: The trace engine
libj9vm<vm_version>.so: The core Virtual Machine
libj9zlib<vm_version>.so: The compressed file utility library
libjclscar_<vm_version>.so: The Java Class Library (JCL) support routines

In the previous example, the CPU time is being spent inside the garbage collection (GC) implementation, implying either that there is a problem in GC or that GC is running almost continuously.

Again, you can obtain a more accurate understanding of what is occurring inside the libj9gc<vm_version>.so library during the CPU time by analyzing the relevant section of the tprof output:

Profile: /work/JDK/inst.images/rios_aix32_9/sdk/bin/
                                                      libj9gc26.so

  Total % For All Processes (/work/JDK/inst.images/rios_aix32_9/
                                 sdk/bin/libj9gc26.so) = 17.42

Subroutine                                            %   Source              
==========                                         ====== ======              
Scheme::scanMixedObject(MM_Environment*,J9Object*)   2.67 MarkingScheme.cpp   
MarkingScheme::scanClass(MM_Environment*,J9Class*)   2.54 MarkingScheme.cpp   
.GC_ConstantPoolObjectSlotIterator::nextSlot()       1.96 jectSlotIterator.cpp
lelTask::handleNextWorkUnit(MM_EnvironmentModron*)   1.05 ParallelTask.cpp    
orkPackets::getPacket(MM_Environment*,MM_Packet**)   0.70 WorkPackets.cpp     
cheme::fixupRegion(J9Object*,J9Object*,bool,long&)   0.67 CompactScheme.cpp   
WorkPackets::putPacket(MM_Environment*,MM_Packet*)   0.47 WorkPackets.cpp     
rkingScheme::scanObject(MM_Environment*,J9Object*)   0.43 MarkingScheme.cpp   
sweepChunk(MM_Environment*,MM_ParallelSweepChunk*)   0.42 allelSweepScheme.cpp
ment*,J9IndexableObject*,J9Object**,unsigned long)   0.38 MarkingScheme.cpp   
M_CompactScheme::getForwardingPtr(J9Object*) const   0.36 CompactScheme.cpp   
ObjectHeapIteratorAddressOrderedList::nextObject()   0.33 dressOrderedList.cpp
ckets::getInputPacketFromOverflow(MM_Environment*)   0.32 WorkPackets.cpp     
.MM_WorkStack::popNoWait(MM_Environment*)            0.31 WorkStack.cpp       
WorkPackets::getInputPacketNoWait(MM_Environment*)   0.29 WorkPackets.cpp     
canReferenceMixedObject(MM_Environment*,J9Object*)   0.29 MarkingScheme.cpp   
MarkingScheme::markClass(MM_Environment*,J9Class*)   0.27 MarkingScheme.cpp   
._ptrgl                                              0.26 ptrgl.s             
_MarkingScheme::initializeMarkMap(MM_Environment*)   0.25 MarkingScheme.cpp   
.MM_HeapVirtualMemory::getHeapBase()                 0.23 eapVirtualMemory.cpp

This output shows that the most-used functions are:

MarkingScheme::scanMixedObject(MM_Environment*,J9Object*)   	
                                       2.67 MarkingScheme.cpp   
MarkingScheme::scanClass(MM_Environment*,J9Class*)   			
                                       2.54 MarkingScheme.cpp
ObjectSlotIterator.GC_ConstantPoolObjectSlotIterator::nextSlot()  
                                       1.96 ObjectSlotIterator.cpp
ParallelTask::handleNextWorkUnit(MM_EnvironmentModron*)   		
                                       1.05 ParallelTask.cpp

The values show that the time is being spent during the Mark phase of GC. Because the output also contains references to the Compact and Sweep phases, the GC is probably completing but that it is occurring continuously. You could confirm that likelihood by running with -verbosegc enabled.

The same methodology shown previously can be used for any case where the majority of the CPU time is shown to be in the Kernel and Shared columns. If, however, the CPU time is classed as being Other, a different methodology is required because tprof does not contain a section that correctly details which Java methods are being run.

In the case of CPU time being attributed to Other, you can use a Javadump to determine the stack trace for the TIDs shown to be taking the CPU time, and therefore provide an idea of the work that it is doing. Map the value of TID shown in the tprof output to the correct thread in the Javadump by taking the tprof TID, which is stored in decimal, and convert it to hexadecimal. The hexadecimal value is shown as the native ID in the Javadump.

For the previous example:

Process            PID      TID  Total Kernel   User Shared  Other
=======            ===      ===  ===== ======   ==== ======  =====
./java            7018   819317  16.68   5.55   0.00  11.13   0.00

This thread is the one using the most CPU; the TID in decimal is 819317. This value is C8075 in hexadecimal, which can be seen in the Javadump:

3XMTHREADINFO      "main" (TID:0x300E3500, sys_thread_t:0x30010734, 
                             state:R, native ID:0x000C8075) prio=5
4XESTACKTRACE          at java/lang/Runtime.gc(Native Method)
4XESTACKTRACE          at java/lang/System.gc(System.java:274)
4XESTACKTRACE          at GCTest.main(GCTest.java:5)

These entries show that, in this case, the thread is calling GC, and explains the time spent in the libj9gc<vm_version>.so shared library.