CPU bottlenecks
If vmstat has shown that the system is CPU-bound, the next stage is to determine which process is using the most CPU time.
> tprof -s -k -x sleep 60
outputs: Mon Nov 28 12:40:11 2010
System: AIX 6.1 Node: voodoo Machine: 00455F1B4C00
Starting Command sleep 60
stopping trace collection
Generating sleep.prof
> cat sleep.prof
Process Freq Total Kernel User Shared Other
======= ==== ===== ====== ==== ====== =====
./java 5 59.39 24.28 0.00 35.11 0.00
wait 4 40.33 40.33 0.00 0.00 0.00
/usr/bin/tprof 1 0.20 0.02 0.00 0.18 0.00
/etc/syncd 3 0.05 0.05 0.00 0.00 0.00
/usr/bin/sh 2 0.01 0.00 0.00 0.00 0.00
gil 2 0.01 0.01 0.00 0.00 0.00
afsd 1 0.00 0.00 0.00 0.00 0.00
rpc.lockd 1 0.00 0.00 0.00 0.00 0.00
swapper 1 0.00 0.00 0.00 0.00 0.00
======= ==== ===== ====== ==== ====== =====
Total 20 100.00 64.70 0.00 35.29 0.00
Process PID TID Total Kernel User Shared Other
======= === === ===== ====== ==== ====== =====
./java 467018 819317 16.68 5.55 0.00 11.13 0.00
./java 467018 766019 14.30 6.30 0.00 8.00 0.00
./java 467018 725211 14.28 6.24 0.00 8.04 0.00
./java 467018 712827 14.11 6.16 0.00 7.94 0.00
wait 20490 20491 10.24 10.24 0.00 0.00 0.00
wait 8196 8197 10.19 10.19 0.00 0.00 0.00
wait 12294 12295 9.98 9.98 0.00 0.00 0.00
wait 16392 16393 9.92 9.92 0.00 0.00 0.00
/usr/bin/tprof 421984 917717 0.20 0.02 0.00 0.18 0.00
/etc/syncd 118882 204949 0.04 0.04 0.00 0.00 0.00
./java 467018 843785 0.03 0.02 0.00 0.00 0.00
gil 53274 73765 0.00 0.00 0.00 0.00 0.00
gil 53274 61471 0.00 0.00 0.00 0.00 0.00
/usr/bin/sh 397320 839883 0.00 0.00 0.00 0.00 0.00
rpc.lockd 249982 434389 0.00 0.00 0.00 0.00 0.00
/usr/bin/sh 397318 839881 0.00 0.00 0.00 0.00 0.00
swapper 0 3 0.00 0.00 0.00 0.00 0.00
afsd 65776 274495 0.00 0.00 0.00 0.00 0.00
/etc/syncd 118882 258175 0.00 0.00 0.00 0.00 0.00
/etc/syncd 118882 196839 0.00 0.00 0.00 0.00 0.00
======= === === ===== ====== ==== ====== =====
Total 100.00 64.70 0.00 35.29 0.00
Total Samples = 24749 Total Elapsed Time = 61.88s
This
output shows that the Java™ process with Process ID (PID)
467018
is using the majority of the CPU time. You can also see that the CPU time is
being shared among four threads inside that process (Thread IDs 819317
,
766019
, 725211
, and 712827
).- Total
- The total percentage of CPU time used by this thread or process.
- Kernel
- The total percentage of CPU time spent by this thread or process inside Kernel routines (on behalf of a request by the JVM or other native code).
- User
- The total percentage of CPU time spent executing routines inside the executable. Because the Java executable is a thin wrapper that loads the JVM from shared libraries, this CPU time is expected to be very small or zero.
- Shared
- The total percentage of CPU time spent executing routines inside shared libraries. Time shown under this category covers work done by the JVM itself, the act of JIT compiling (but not the running of the subsequent code), and any other native JNI code.
- Other
- The total percentage of CPU time not covered by Kernel, User, and Shared. In the case of a Java process, this CPU time covers the execution of Java bytecodes and JIT-compiled methods themselves.
From the previous example, notice the Kernel and Shared values: these account for all of the CPU time used by this process, indicating that the Java process is spending its time doing work inside the JVM (or some other native code).
To understand what is being done during the Kernel and Shared times, the relevant sections of the tprof output can be analyzed.
The shared library section shows which shared libraries are being invoked:
Shared Object %
============= ======
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9gc26.so 17.42
/usr/lib/libc.a[shr.o] 9.38
/usr/lib/libpthreads.a[shr_xpg5.o] 6.94
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9thr26.so 1.03
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9prt26.so 0.24
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9vm26.so 0.10
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9ute26.so 0.06
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9jit26.so 0.05
/usr/lib/libtrace.a[shr.o] 0.04
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9trc26.so 0.02
/work/JDK/inst.images/rios_aix32_8/sdk/jre/bin/libj9hookable26.so 0.01
- libbcv<vm_version>.so
- Bytecode Verifier
- libdbg<vm_version>.so
- Debug Server (used by the Java Debug Interface)
- libj9gc<vm_version>.so
- Garbage Collection
- libj9jextract.so
- The dump extractor, used by the jextract command
- libj9jit<vm_version>.so
- The Just In Time (JIT) Compiler
- libj9jvmti<vm_version>.so
- The JVMTI interface
- libj9prt<vm_version>.so
- The port layer between the JVM and the Operating System
- libj9shr<vm_version>.so
- The shared classes library
- libj9thr<vm_version>.so
- The threading library
- libj9ute<vm_version>.so
- The trace engine
- libj9vm<vm_version>.so
- The core Virtual Machine
- libj9zlib<vm_version>.so
- The compressed file utility library
- libjclscar_<vm_version>.so
- The Java Class Library (JCL) support routines
In the previous example, the CPU time is being spent inside the garbage collection (GC) implementation, implying either that there is a problem in GC or that GC is running almost continuously.
Again, you can obtain a more accurate understanding of what is occurring inside the libj9gc<vm_version>.so library during the CPU time by analyzing the relevant section of the tprof output:
Profile: /work/JDK/inst.images/rios_aix32_9/sdk/bin/
libj9gc26.so
Total % For All Processes (/work/JDK/inst.images/rios_aix32_9/
sdk/bin/libj9gc26.so) = 17.42
Subroutine % Source
========== ====== ======
Scheme::scanMixedObject(MM_Environment*,J9Object*) 2.67 MarkingScheme.cpp
MarkingScheme::scanClass(MM_Environment*,J9Class*) 2.54 MarkingScheme.cpp
.GC_ConstantPoolObjectSlotIterator::nextSlot() 1.96 jectSlotIterator.cpp
lelTask::handleNextWorkUnit(MM_EnvironmentModron*) 1.05 ParallelTask.cpp
orkPackets::getPacket(MM_Environment*,MM_Packet**) 0.70 WorkPackets.cpp
cheme::fixupRegion(J9Object*,J9Object*,bool,long&) 0.67 CompactScheme.cpp
WorkPackets::putPacket(MM_Environment*,MM_Packet*) 0.47 WorkPackets.cpp
rkingScheme::scanObject(MM_Environment*,J9Object*) 0.43 MarkingScheme.cpp
sweepChunk(MM_Environment*,MM_ParallelSweepChunk*) 0.42 allelSweepScheme.cpp
ment*,J9IndexableObject*,J9Object**,unsigned long) 0.38 MarkingScheme.cpp
M_CompactScheme::getForwardingPtr(J9Object*) const 0.36 CompactScheme.cpp
ObjectHeapIteratorAddressOrderedList::nextObject() 0.33 dressOrderedList.cpp
ckets::getInputPacketFromOverflow(MM_Environment*) 0.32 WorkPackets.cpp
.MM_WorkStack::popNoWait(MM_Environment*) 0.31 WorkStack.cpp
WorkPackets::getInputPacketNoWait(MM_Environment*) 0.29 WorkPackets.cpp
canReferenceMixedObject(MM_Environment*,J9Object*) 0.29 MarkingScheme.cpp
MarkingScheme::markClass(MM_Environment*,J9Class*) 0.27 MarkingScheme.cpp
._ptrgl 0.26 ptrgl.s
_MarkingScheme::initializeMarkMap(MM_Environment*) 0.25 MarkingScheme.cpp
.MM_HeapVirtualMemory::getHeapBase() 0.23 eapVirtualMemory.cpp
This output shows that the most-used functions are:
MarkingScheme::scanMixedObject(MM_Environment*,J9Object*)
2.67 MarkingScheme.cpp
MarkingScheme::scanClass(MM_Environment*,J9Class*)
2.54 MarkingScheme.cpp
ObjectSlotIterator.GC_ConstantPoolObjectSlotIterator::nextSlot()
1.96 ObjectSlotIterator.cpp
ParallelTask::handleNextWorkUnit(MM_EnvironmentModron*)
1.05 ParallelTask.cpp
The values show that the time is being spent during the Mark phase of GC. Because the output also contains references to the Compact and Sweep phases, the GC is probably completing but that it is occurring continuously. You could confirm that likelihood by running with -verbosegc enabled.
The same methodology shown previously can be used for any case where the majority of the CPU time is shown to be in the Kernel and Shared columns. If, however, the CPU time is classed as being Other, a different methodology is required because tprof does not contain a section that correctly details which Java methods are being run.
In the case of CPU time being attributed to Other, you can use a Javadump to determine the stack trace for the TIDs shown to be taking the CPU time, and therefore provide an idea of the work that it is doing. Map the value of TID shown in the tprof output to the correct thread in the Javadump by taking the tprof TID, which is stored in decimal, and convert it to hexadecimal. The hexadecimal value is shown as the native ID in the Javadump.
For the previous example:
Process PID TID Total Kernel User Shared Other
======= === === ===== ====== ==== ====== =====
./java 7018 819317 16.68 5.55 0.00 11.13 0.00
This thread is the one using the most CPU; the TID in decimal is 819317
.
This value is C8075
in hexadecimal, which can be
seen in the Javadump:
3XMTHREADINFO "main" (TID:0x300E3500, sys_thread_t:0x30010734,
state:R, native ID:0x000C8075) prio=5
4XESTACKTRACE at java/lang/Runtime.gc(Native Method)
4XESTACKTRACE at java/lang/System.gc(System.java:274)
4XESTACKTRACE at GCTest.main(GCTest.java:5)
These entries show that, in this case, the thread is calling GC, and explains the time spent in the libj9gc<vm_version>.so shared library.