My dear Debuggers,
Yesterday I worked on a case where a user was not able to start his BPM system. In detail, the AppClusterNode ‘Server_APP_X_1’ refused to work without clear issue in logs. To bring it to work again they started the complete range for solving such kind of problems - he restarted Messaging Engine Cluster (ME) , Support Cluster and Node Agent - all without luck. Also, he tried to clear ClassCache and run osgiCfgInit.sh but the issue still persisted.
For such kind of issues, my first glance takes in the server logs (startServer.log).
The NodeAgent is up and running
[3/6/19 20:56:34:601 GST] 00000056 NGUtil$Server I ASND0002I: Detected server nodeagent started on node NODE_X_1
[3/6/19 20:56:34:622 GST] 00000001 WsServerImpl A WSVR0001I: Server nodeagent open for e-business
but the AppServer startup fails
************* End Display Current Environment *************
[3/7/19 10:52:34:600 GST] 00000001 ManagerAdmin I TRAS0017I: The startup trace state is *=info.
[3/7/19 10:52:34:699 GST] 00000001 AdminTool A ADMU0128I: Starting tool with the NODE_X_1 profile
[3/7/19 10:52:34:703 GST] 00000001 AdminTool A ADMU3100I: Reading configuration for server: Server_APP_X_1
[3/7/19 10:52:34:716 GST] 00000001 ImplFactory W WSVR0072W: Ignoring undeclared override of interface, com.ibm.websphere.cluster.topography.DescriptionManager, with implementation, com.ibm.ws.cluster.propagation.bulletinboard.BBDescriptionManager
[3/7/19 10:52:34:909 GST] 00000001 ModelMgr I WSVR0801I: Initializing all server configuration models
[3/7/19 10:52:39:404 GST] 00000001 WorkSpaceMana A WKSP0500I: Workspace configuration consistency check is disabled.
[3/7/19 10:52:39:670 GST] 00000001 AdminTool A ADMU3200I: Server launched. Waiting for initialization status.
[3/7/19 10:52:39:695 GST] 00000001 AdminTool A ADMU3011E: Server launched but failed initialization. Server logs, startServer.log, and other log files under /opt/IBM/SOA/BPM85PS/profiles/NODE_X_1/logs/Server_APP_X_1should contain failure information.
Next, I checked the FFDC folder. In case of an abnormal condition happened a FFDC log will be created.
I could see 2 different files that were created by system
The exception log refered to a java.io.FileNotFoundException and another org.omg.CORBA.OBJECT_NOT_EXIST error (both logged to txt file)
[3/6/19 19:30:03:433 GST] FFDC Exception:java.io.FileNotFoundException SourceId:com.ibm.wbiserver.xct.resources.bootstrap.WsXctResources ProbeId:51 Reporter:java.lang.Class@caa58bc6
java.io.FileNotFoundException: /opt/IBM/SOA/BPM85PS/profiles/NODE_X_1/config/cells/CELL_X_04/nodes/NODE_X_1/servers/Server_1/server-core.xml (No such file or directory)
at java.io.FileInputStream.open(Native Method)
[3/6/19 19:30:05:025 GST] FFDC Exception:org.omg.CORBA.OBJECT_NOT_EXIST SourceId:com.ibm.rmi.iiop.Connection.doLocateRequestWork:3304 ProbeId:ORB FFDC
org.omg.CORBA.OBJECT_NOT_EXIST: SERVANT_NOT_FOUND (2) for 0x4a4d424900000010721d1fcf0000000000000000000000000000000000000024000000080000000000000000 vmcid: IBM minor code: C12 completed: No
As the first FFDC error seems to be not relevant (https://www-01.ibm.com/support/docview.wss?uid=swg21662337) for our problem and can be ignored. Next, I dedicated my activities to the second one.
Because this exception comes from Common Object Request Broker Architecture (CORBA), I contacted my friends from the WebSphere team to get their help on that issue.
The first curious finding was the size of servers native_stderr.log - that is really large.
Also, with his experiences my friend from Websphere team knew that for Linux systems, limits on certain resources can be defined that may have a massive influence an functionality.
So as next, we requested to run the ulimit command to get a general overview on that
What we could see from the output is a limitation of FILE SIZE to 6291453 (~6 GB) !!!
Does it ring the bell?
For one thing, we have a native_stderr.log with a size of 6 GB, then again a file size limitation of 6 GB.
Please have a look at the following limit settings suggested by Websphere team
As you can see, the suggested limit for file size is UNLIMITED
You can do it that way, but you do not have to
But what happens in case the limit is reached? The file cannot further grow and the system may stop working! This could explain why the startServer.log shows a problem within seconds.
So, we instructed the user to rename the native_stderr.log and try to restart the server again. Now, it started smoothly without any other problems.
What we have learned from that case:
Ensure ulimits are updated per general WAS recommendations. If you apply alternative configuration take care, it is documented for whoever need it.
And if this does not help, take two of these and call me in the morning.
Your Dr. Debug