My Dear Debugger friends,
After I took a small hiatus from my business I would like to reporting back to you with a new posting.
Yesterday our support team was paged out by a customer because of a very nasty problem. Several times a day the JVM keep restarting unexpectedly. Because the problem happens on production environment this was a very urgent issue and needed a solution as soon as possible.
As input we got the Systemout.log file from the Node as well as the Nodeagent and Dmgr. Additionally, we got the javacore dump and the stderr.log.
This should be enough for our first analysis. We started our investigation with having a look at the Systemout logs.
The Dmgr_SystemOut.log showed the time stamps of the server start / stop events
[11:29:40:383] 00000052 NGUtil$Server I ASND0002I: Detected server server1 started on node Node01
[11:42:01:082] 00000053 NGUtil$Server I ASND0003I: Detected server server1 stopped on node Node01
[11:43:40:457] 00000065 P2PGroup I ODCF8040I: Detected process Node01\server1 started.
[11:45:38:384] 00000052 NGUtil$Server I ASND0002I: Detected server server1 started on node Node01
[11:51:24:181] 00000065 P2PGroup I ODCF8041I: Detected process Node01\server1 stopped.
[11:51:24:181] 00000052 NGUtil$Server I ASND0003I: Detected server server1 stopped on node Node01
As the customer mentioned that happened more than 20 times a day, depending on existing workload.
Next we checked the Node and Nodeagent Systemout.log to get a more clear picture of what happened here.
Unfortunately, it does not provided any hints to why the server restart was triggered again and again. No problem, we still have the javacore and stderr.log
Same fore javacore - there was no hint to a possible issue (CPU starvation or Out of Memory situation).
At least the stderr.log provided following possible reason
CJLFF0008W The First Failure Data Capture (FFDC) disk space requested has exceeded 90.0 percent of the max disk space, 10485760 bytes. Space used is, 10719011 bytes, and the estimated space needed to complete the FFDC action is 425984 bytes for handler, CYN.handler.snap.
CJLFF0022I Attempting to delete oldest files in FFDC directory, /opt/IBM/xxx/yyyyyyy/184.108.40.206.1/logs/FFDC/Node01.server1/, to reduce it below the first Warning, 75.0, percentage limit of max disk space, 10485760. This is approximated to 2854691 bytes being deleted by handler, CYN.handler.snap
Could it be there is a disk space problem ? And because of missing disk space the JVM (server1) was triggered for a re-start ?
Very likely NOT - The messages only say that FFDC log management was active because of less log space on disk, solved by deleting old log files by handler.
So what we know at this time
* the JVM (server1) re-started many times a day
* nothing is logged to log files that is related to the re-start situation
* no Out-Of-Memory or CPU starvation issues
What is the best way to investigate a JVM problem where nothing is written to any files ?
After quick consultation with WebSphere support team we got a good Technote for such kind of problem
WebSphere Application Server process exits without leaving a footprint in log files or no core dumps
This document provide several starting points for a deeper analysis. Beside general considerations it also it also discuss OS specifics that can be checked.
From the Node Systemout.log we could see some application code was running before the server went down
CWLLG2015E: An unexpected runtime exception occurred. See exception for more details. Error: Undeclared local variable, input or output parameter response found in service "RetrievePartyStatements" item "UNKNOWN" location "UNKNOWN". You must declare it in order to use it..
com.lombardisoftware.server.core.UndeclaredVariableRuntimeException: Undeclared local variable, input or output parameter response found in service "RetrievePartyStatements" item "UNKNOWN" location "UNKNOWN". You must declare it in order to use it.
Item 5 from the Technote discuss
Check if Java Native Interface (JNI) code have called JVM to shutdown.
Verbose JNI trace reports information about use of native methods and other Java Native Interface activity.
If the application's native code made a call to "JNI_DestroyJavaVM" or System.exit(), it will be logged.
This presupposes the java application code or product code itself includes a System.exit() call - not the case here !
However, lets have a look at item 7 that discuss LINUX specific issues. As customer system is running on LINUX we requested to get the mentioned /var/log/messages file to review for OS triggered event.
And in fact, we "hit the bull's eye" when we searched for Out of Memory records
A more detailed check showed the following situation
As you can see process 17577 use the most resources of virtual memory 'total-vm' that is mapped into the RAM (allocated and used ) - represented by 'rss'.
Because the Linux system run out of memory the OS OOM killer terminated the process that use large memory to solve the OOM situation.
You can check documents from the web in case you want to know more details about Linux OOM killer log:
So, the customer needed to add more memory to the Linux OS or stop running to many programs to overcome the reported situation.
Keep that in mind, especially the link to troubleshooting Technote. It might save your life one day.
And if this does not help, take two of these and call me in the morning.
Your Dr. Debug