Another excerpt from our WebSphere Application Server Performance Cookbook, due for external publication sometime in the near future, on determining the health of a JVM. This may or may not look like the final publication.
"A common question is how does one determine how efficiently is the JVM performing and what metrics point to a JVM that is in, or heading toward, distress?
Depending on the environment, number of JVMs, redundancy, continuous availability and/or high availability requirements the threshold for %CPU utilization varies. For HA/CA, business critical environments the threshold can be as low as 50% CPU utilization. For non-critical applications the threshold could be as high as 95%. One needs to analyze both the NFRs and SLAs of the application in order to determine appropriate thresholds to indicate a potential health issue with the JVM.
Amount of times spent in GC
This metric, gleaned from the verbose GC or PMI metrics, is a general indicator of how efficiently the application is utilizing memory and how quickly the garbage collector can complete its tasks. The more time spent in GC the more CPU the application will use and potentially impact the application response time. A general rule of thumb is time spent in GC below 8% is generally a marker of a healthy application environment. If the time spent in GC goes over 8% then it is probably time to either try and tune the JVM or start looking at capacity planning to grow the environment.
%heap utiilization after a full GC
The low water mark after a full GC provides an indication if the heap is able to reclaim memory or not. If the low water mark continues to rise over time after a full GC then the application could be the victim of a memory leak. Heap dumps should be able to identify the culprit and the application can either be corrected to eliminate the leak. Unfortunately, if the application can not be fixed the only way to recover from a memory leak is through a controlled restart of the JVM. In a clustered environment this is not generally a problem if the JVM users can be quiesced to another JVM before restarting the JVM otherwise inflight transactions will be affected when the JVM is stopped abruptly.
Application response time
Deteriorating (i.e. increasing) response time is often an indication of poor health.
Once you have determined that the application is not healthy follow the appropriate MustGather and open a PMR with IBM Support."