Just another rainy day, my morale has absolutely plummeted and what next ?
Again a patient having pains to start his environment after an upgrade. This time, he upgraded from v8.5.6. to 8.5.6. CF2 and is using a MS SQL Server. He can see Java heap dumps generated and discovered the following error in the Node logs:
[04/08/16 13:57:05:657 BST] 000000e1 SystemOut O java.lang.OutOfMemoryError: Java heap space
So he contacted me to get help on that..
I first checked if the upgrade was successful. From this perspective, everything looks fine and the BPMConfig gives the following feedback:
2016-08-03 12:16.13.030 com.ibm.bpm.config.BPMConfig.main(): The 'BPMConfig.bat -upgrade -profile C:\IBM\BPM\v850\profiles\DmgrProfile' command completed successfully.
Let’s take a look at the SystemOut.log of the Node:
[04/08/16 10:20:22:132 BST] 00000116 Heartbeat I CWLLG0561I: Heartbeat thread starting...
[04/08/16 10:22:05:847 BST] 00000116 FfdcProvider W com.ibm.ws.ffdc.impl.FfdcProvider logIncident FFDC1003I: FFDC Incident emitted on C:\IBM\BPM\v850\profiles\Node1\logs\ffdc\DE.SingleCluster.Node1_c930c6bf_16.08.04_10.22.05.8475239436514672546473.txt com.ibm.ws.uow.UOWManagerImpl.runUnderNewUOW 934
[04/08/16 10:22:13:807 BST] 00000116 FfdcProvider W com.ibm.ws.ffdc.impl.FfdcProvider logIncident FFDC1003I: FFDC Incident emitted on C:\IBM\BPM\v850\profiles\Node1\logs\ffdc\DE.SingleCluster.Node1_c930c6bf_16.08.04_10.22.13.807233318241113520547.txt com.lombardisoftware.utility.spring.ProgrammaticTransactionSupport isTransactionFailure-NoRetryForException
[04/08/16 10:22:13:807 BST] 00000116 Heartbeat E CWLLG0563E: An exception occurred while checking the heart beat.
[04/08/16 10:22:13:812 BST] 00000116 FfdcProvider W com.ibm.ws.ffdc.impl.FfdcProvider logIncident FFDC1003I: FFDC Incident emitted on C:\IBM\BPM\v850\profiles\Node1\logs\ffdc\DE.SingleCluster.Node1_c930c6bf_16.08.04_10.22.13.8126235137079804576453.txt com.lombardisoftware.server.scheduler.Heartbeat beat
[04/08/16 10:27:09:454 BST] 000000e1 SystemOut O java.lang.OutOfMemoryError: Java heap space
Many FFDC's were generated before the Out Of Memory (OOM) exception occurs. Ultimately, the logs show that the heartbeat to Process Center. But why does it prevent the server to start and what has it to do with the OOM error ???
I took a deeper look and analyzes the FFDCs. The last FFDC I received from my customer provides more information about the issue:
[04/08/16 10:22:13:812 BST] FFDC
Reporter:com.lombardisoftware.server.scheduler.Heartbeat@886b84c8 com.lombardisoftware.core.TeamWorksException: Lock request time out period exceeded.
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Lock request time out period exceeded.
There is a Lock time out on customer’s MS SQL Server which prevents scheduler.Heartbeat to execute. Also, a FFDC related to the OOM error itself can be found, showing:
[04/08/16 10:27:17:322 BST] FFDC Exception:java.lang.OutOfMemoryError
SourceId:com.ibm.ws.util.ThreadPool.getTask ProbeId:822 Reporter:com.ibm.ws.util.ThreadPool@fb23cafc java.lang.OutOfMemoryError: Java heap space
Locks ultimately seem to lead to the OOM error (“java.util.concurrent.locks”).
I informed the customer about the findings and mentioned that such lock time outs and hung threads might occur because of orphaned transactions on DB-side, see:
“You might see that current transactions are blocked for a long time frame by a transaction with a -2 session ID. This scenario can lead to lock timeouts or hung threads depending on the lock timeout setting.” ( http://www-01.ibm.com/support/docview.wss?uid=swg21633692 )
Based on the findings, the patient checked his MS SQL Server for orphaned transactions. And… right, there were many of them !
So, how to follow up to resolve the issue?
In some cases increasing the lockTimeout value might already help to prevent a lock time out issue
(HowTo see for instance: http://www-01.ibm.com/support/docview.wss?uid=swg21516235 ).
However, in this case we have already identified orphaned transactions causing the locks. Hence, that will not help here.
I mentions that under certain circumstances, MS SQL Server does NOT clean up specific transactions and they can continue to use resources and block other transactions. For Microsoft SQL Server, these transactions stay in the system until the XA timeout on the database resource is reached or the database is restarted.
To resolve the problem, I always advice to restart the database. Alternatively, the user can also individually terminate the transactions on the database. However, to be sure that a transaction is orphaned, he should ensure that it is not an in-doubt (prepared) transaction or one that is currently in the process of completing. Anything that lasts through a restart of all the JVMs for a related environment and a successful recovery cycle should be orphaned.
The following resource provides more details on that:
As described on the Microsoft Support website (http://support.microsoft.com/en-us/kb/3005016) loss of communication between the IBM BPM server and its database (for example, caused by a software or network failure) may result in orphaned transactions that persist in the database and hold resources (such as database locks) active indefinitely. Resolving these orphaned transactions may require manual rollback or restarting the IBM BPM and Database servers. (see: http://www-01.ibm.com/support/docview.wss?uid=swg21578920#mssqldbsupp )
The patient decided to restart the DB and BINGO:
Orphaned transactions were resolved (+ no new orphaned transactions were generated) and the environment could be successfully started!
In the meantime the rain passed off, my customer was satisfied with the result we reached - a positive end of the day.
And if it does not, take two of these and call me in the morning.