IBM Support

SYNTHETIC PLAYBACK AGENT KEEPS CRASHING RANDOMLY

Technical Blog Post


Abstract

SYNTHETIC PLAYBACK AGENT KEEPS CRASHING RANDOMLY

Body

One of the Synthetic Agent (SN agent version 01.00.05.03) used to measure portal responsiveness started crashing quite often, at least once every 2 days.
When this kind of problem occurs, usually it depends on one of the playback scripts.
The agent was configured to run several scripts so it was not easy to understand which one was causing the problem.

When the agent cores, it actually stops working despite the process seems to be still up and running.
Every time we had to restart the agent to recover.

Looking at the callstack, I noticed that the the exception is always generated by the same thread, same call stack:

3XMTHREADINFO3        Java callstack:
4XESTACKTRACE             at sun/awt/X11/XRobotPeer.getRGBPixelsImpl(Native Method)
4XESTACKTRACE             at sun/awt/X11/XRobotPeer.getRGBPixels(XRobotPeer.java:98(Compiled Code))
4XESTACKTRACE             at java/awt/Robot.createScreenCapture(Robot.java:448(Compiled Code))
5XESTACKTRACE                (entered lock: java/awt/Robot@0x00000000FBDEC7D8, entry count: 1)



From the agent logs it seems that the agent continues to stay up and running, but then looking deeper I noticed there were no other calls for:


"org.openqa.selenium.remote.ProtocolHandshake createSession"

Looking at the information from the logs, we can also notice that agent suddenly stops communicating with browser:

 

org.openqa.selenium.remote.UnreachableBrowserException: Error communicating with the remote browser. It may have died.

So as first step I verified the version of the browser, as I'm aware of some possible compatibility issues with Firefox.
Anyway Firefox release was 52.8, nothing wrong here.


So I focused on javacore file.

As anticipated, the exception occurs every time in the same thread;

#INFO: Crashed in application thread 0000000001F57C00.
#INFO: Found 14 JITed methods on Java stack.

Looking at this thread, we can see it has title:

"Session 3047a569-c279-4955-ac92-98c82885ea50 processing inside browser"

and from the javastack we can see it occurs while it tries to perform a screencapture:

Java callstack:
at sun/awt/X11/XRobotPeer.getRGBPixelsImpl(Native Method)
at sun/awt/X11/XRobotPeer.getRGBPixels(XRobotPeer.java:98(Compiled Code))
at java/awt/Robot.createScreenCapture(Robot.java:448(Compiled Code))

When getRGBPixelsImpl is executed, the native callstack shows the following:


abort+0x148 (0x00007FDC47C298E8 [libc.so.6+0x368e8])
(0x00007FDC47C67F47 [libc.so.6+0x74f47])
(0x00007FDC47C6F619 [libc.so.6+0x7c619])
XFree+0x9 (0x00007FDBEEABE4D9 [libX11.so.6+0x454d9])
(0x00007FDC2602A4C9 [libawt_xawt.so+0x454c9])
(0x00007FDC2602A4A0 [libawt_xawt.so+0x454a0])
(0x00007FDC2602A5A3 [libawt_xawt.so+0x455a3])
(0x00007FDC2602AA6B [libawt_xawt.so+0x45a6b])

So the problem occurs while the code is trying to free memory using an invalid pointer:

*** Error in `/opt/ibm/apm/agent/JRE/lx8266/bin/java': free(): invalid pointer: 0x00007fdbf00b3768 ***


======= Backtrace: =========


/lib64/libc.so.6(+0x7c619)[0x7fdc47c6f619]
/lib64/libX11.so.6(XFree+0x9)[0x7fdbeeabe4d9]
/opt/ibm/apm/agent/JRE/lx8266/lib/amd64/libawt_xawt.so(+0x454c9)[0x7fdc2602a4c9]

--------------------------------------------

After further investigations, the root cause was identified in a long running script.
In this case, Firefox instance hangs and Cron job in Synthetic agent kills the Firefox instance.
Even if the Firefox instance is terminated, the script is still running in this session and it can not communicate with the browser, this would cause Selenium server to exit
the session unexpectedly and begin to capture the screen shot by calling sun X11 API.
Since Firefox is not there, the API invocation throws severe exception (abort event) which leads to the JVM coredump.
This can occur occasionally.

A valid workaround can be to investigate and fix the reason leading to the long duration of the impacting script.
The alternative to avoid the coredump is to make the Selenium server started on quiet mode, so it does not need to capture screen shot when encountering such API call failure.

These are steps needed to implement quiet mode:

 
1)  open the file /opt/ibm/apm/agent/lx8266/sn/bin/run_selenium.sh, change the last line
 
from
 
${JAVA} -Xms64m -Xdump:java:events=user -jar -Djava.util.logging.config.file=${CWD}/logging.properties ${SELENIUM_JAR} -timeout 300 -browserTimeout 300 &
 
to

${JAVA} -Xms64m -Xdump:java:events=user -jar -Djava.util.logging.config.file=${CWD}/logging.properties -Dwebdriver.remote.quietExceptions=true  ${SELENIUM_JAR} -timeout 300 -browserTimeout 300 &
 
2) restart Synthetic Agent

-------------
The above changes will be also included in official SN code with version 01.00.05.05 (IF05 for SN agent) that is planned to be available in March 2019.
You can refer to APAR IJ13047.

Hope it helps.

 

 

Tutorials Point

 

Subscribe and follow us for all the latest information directly on your social feeds:

 

 

image

 

image

 

image

 

 

  

Check out all our other posts and updates:

Academy Blogs:https://goo.gl/U7cYYY
Academy Videos:https://goo.gl/TLfMoF
Academy Google+:https://goo.gl/HnTs0w
Academy Twitter :https://goo.gl/AhR8CL


image

[{"Business Unit":{"code":"BU004","label":"Hybrid Cloud"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":""}]

UID

ibm11085265