Technical Blog Post
Recovering Failed SIBus Transactions in WebSphere Application Server
If you are a WebSphere Service Integration Bus (SIB) user, at some point you might have come across WTRN0005W / CWSIT0008E / CWSIT0019E / CWSIT0103E / CWSIT0009W errors, especially after making some configuration changes on the SIB server (which is the server hosting the messaging engine). Ever wondered how to fix these errors without opening a Problem Record (PMR) or Service Request (SR) with IBM?
In this blog entry I will be providing step by step instructions that would help you fix the WTRN0005W / CWSIT0008E / CWSIT0019E / CWSIT0103E / CWSIT0009W errors without even needing to open a PMR with IBM support.
Now let's look at the scenario that could cause WTRN0005W.
JVM hosting the JMS application/MDB was terminated abnormally while it was still connected to the messaging engine AND then the messaging engine was deleted and recreated. When the application JVM was brought up, you noticed WTRN0005W warning messages in the SystemOut.log.
Let's take a look at the warning messages:
The above stack highlighted in red tells us that the errors occurred during transaction recovery. When WebSphere Application Server runs a transaction, it writes the transaction information to the tranlog. The resources required for that transaction (example messaging engine, jdbc datasource, activationspec info etc..) are written to the partner log. The transaction information is removed from the tranlog and partnerlog once the transaction completes successfully. If the application server with active transactions is terminated abnormally, then on the next restart of the application server the transaction service identifies if there are any indoubt transactions that need to be recovered and then performs transaction recovery. Under certain circumstances, the WebSphere Application Server may fail to recover the indoubt transactions and may throw WTRN0005W warning messages.
1. Closer look at the CWSIT0019E error tells us that the application JVM's transaction manager is trying to recover a transaction that is associated with a messaging engine with UUID E1EB838547BFAF96:
2. Now check what the UUID of the currently running messaging engine is.
There are 3 ways to check this:
a. Open the SystemOut.log (should be located under WAS_INSTALL_ROOT\profiles\ <appserverprofile>\logs\<servername>) of the server hosting the messaging engine and search for the latest occurrence of CWSIS1563I message for the concerned messaging engine (in this case XXXNode01.server1-TestBus):
b. Open the sib-engines.xml file (should be located under WAS_INSTALL_ROOT\profiles\Dmgr\config\cells\cellname\nodes\nodename\servers\servername OR WAS_INSTALL_ROOT\profiles\Dmgr\config\cells\cellname\clusters\clustername depending on whether the bus member is a server or a cluster) and look for your messaging engine name (in this example:XXXNode01.server1-TestBus) and note the UUID.
c. Open the sib-bus.xml file (should be located under WAS_INSTALL_ROOT\profiles\Dmgr\config\cells\cellname\buses\<yourBus>) and look for your messaging engine name (in this example XXXNode01.server1-TestBus) and note the engineUuid:
|<sibresources:SIBus xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" xmlns:sibresources="http://www.ibm.com/websphere/appserver/schemas/6.0/sibresources.xmi" xmi:id="SIBus_1580910735616" name="TestBus" uuid="F52764547710353" secure="false" discardMsgsAfterQueueDeletion="false">
<busMembers xmi:id="SIBusMember_1580910844471" server="server1" node="XXXNode01">
<target xmi:id="SIBusMemberTarget_1580910844502" engineUuid="3F99FC2E85ED282E" engineName="XXXNode01.server1-TestBus"/>
(Click image to enlarge)
As you can see, the UUID of the messaging engine is 3F99FC2E85ED282E but the application JVM's transaction manager is trying to recover a transaction that is associated with a messaging engine with UUID E1EB838547BFAF96. So this confirms that the messaging engine with UUID E1EB838547BFAF96 does not exist.
Now how do we recover a transaction for a resource that does not even exist?
It is not possible to recover this transaction since the resource no longer exists but we can prevent the application server from continuously logging WTRN0005W warning messages by deleting the transaction id of this indoubt transaction from the application JVM's partner log.
The question now is how do we identify the transaction id of this indoubt transaction?
The transaction id can be determined if "Transaction=all" trace is enabled on the application JVM (where the xarecovery errors occurred) and the JVM is restarted.
OK, now we have the transaction trace, what next?
Caution: Follow the below solution ONLY when:
- The messaging engine was deleted and recreated AFTER the remote JVM hosting the JMS application/MDB was terminated with active transactions.
- You see WTRN0005W, CWSIT0008E, CWSIT0019E, CWSIT0103E, CWSIT0009W errors along with the XARecovery stack highlighted in RED.
Make sure to remove the property REMOVE_PARTNER_LOG_ENTRY after deleting the entry from the partner log (as discussed in step 7 below).
3. Open the trace.log (should be located under WAS_INSTALL_ROOT\profiles\ <appserverprofile>\logs\<servername>) and search for "recovered=false" in the trace.log. You should see something like the below entry:
|XARecoveryDat > recover Entry
index=1, recoveryID=2, recovered=false, terminating=false,
<useServerSubject=false> <providerEndpoints=null>], , 0,
(Click image to enlarge)
Please note, there could be multiple transactions needing recovery. In this example, there was only one indoubt transaction and the recoveryID of that indoubt transaction is 2.
We have now identified the transaction id. What do we do next?
4. Now delete the recoveryID 2 from the partner log following the instructions in the product documentation topic: Removing entries from the transaction partner log
Note: The property REMOVE_PARTNER_LOG_ENTRY is available in versions 184.108.40.206, 220.127.116.11, 18.104.22.168 and later version of WebSphere Application Server.
5. Lastly, restart the server in recovery mode.
profile_root/bin> startServer Server1 -recovery
Please refer to the product documentation topic: Restarting an application server in recovery mode
6. Start the server normally.
profile_root/bin> startServer Server1
7. Remove the property REMOVE_PARTNER_LOG_ENTRY.
Important Note: If you do not want to go through the tedious process of enabling trace and identifying the recovery IDs then another simple solution is to set the value for REMOVE_PARTNER_LOG_ENTRY to 0 and follow steps 5-7. A recovery ID of 0 (zero) indicates that ALL resource entries that throw XA Exceptions during recovery are removed from the transaction partner log.
More details can be found in APAR PM62977.
Other Scenarios that could cause WTRN0005W:
1. Enabling/Disabling SIB security
2. Changing SIB ports
3. Changing Permitted transports
4. Deleting and recreating the SIB destinations