Technical Blog Post
Recovering from failed transaction recovery
When WebSphere Application Server is running a transaction, the transaction information is written to the tranlog directory to log1 & log2. The resources required for that transaction (database name, user, password, etc) are recorded in the partnerlog directory to it's log1 & log2. When a transaction completes, the transaction information is garbage collected from the logs. If the application server should abend or be forced off mid-transaction, such that a transaction does not complete, then on subsequent server restarts, the transaction service detects the unfinished transaction and attempts to re-establish the resource(s) stored in the partnerlogs then complete the transaction(s) stored in the tranlogs.
A potential problem, especially with test servers, is when a required resource changes while the application server is down, for example the application server JVM was forced down so that a database password could be changed, or a database outage occurs, or for whatever reason, when the application server restarts, the resource no longer exist and you are now stuck in a state where the application server can no longer recover the unfinished transaction.
This situation is easy to spot, the error logged in the SystemOut.log will be an inability to establish the required resource, for example the database connection, with the transaction recovery manager code further down in the same error stack:
In cases where the logged transaction is of no concern, you can simply stop the application server, navigate to the tranlog and partnerlog directories and delete the contents (log1 & log2) of both directories, then restart the app server. On restart, new logs will be created by the transaction service (you can make a backup of these logs and store them elsewhere before deleting them from the application server folders if desired).
For reference, unless changed by your configuration, the default directories are typically located in the paths:
Oracle XA Transaction Recovery Failure:
When WebSphere Application Server attempts to recover Oracle database transactions, the transaction service issues the following exception:
WTRN0037W: The transaction service encountered an error on an xa_recover operation.
The resource was com.ibm.ws.rsadapter.spi.WSRdbXaResourceImpl@1114a62.
The error code was XAER_RMERR. The exception stack trace follows:
Oracle requires services such as the WebSphere Application Server transaction service to have special permissions for performing transaction recovery.
As user SYS, run the following commands on your Oracle server:
grant select on pending_trans$ to public;
grant select on dba_2pc_pending to public;
grant select on dba_pending_transactions to public;
grant execute on dbms_system to <user>;
"User" is a user ID in the application server that is authorized to perform transaction recovery for the XA data source. If you have not authorized any user IDs to perform transaction recovery, the application server will use the login alias for the data source as the user ID.
This problem is mentioned under Oracle bug: 3979190
WBI / WPS Servers:
Transaction logs should NOT be deleted if running in a WebSphere Process Server or IBM Business Process Manager environment. These products store process information in these logs and deleting them can cause unpredictable results with operation.
In this case, you must perform the following steps to recover the incomplete transactions:
- Start all servers (in a clustered environment) in recovery mode, using the following command:
profileRoot/bin/startServer.(bat|sh) serverName -recovery
This command using the recovery option causes the server to start and perform only transaction recovery before shutting the server down again.
For clustered environments, make sure that the servers are started in the following order (sample for a Remote Messaging or Remote Support topology, also known as Golden Topology):
- All members of the messaging infrastructure cluster
- All members of the support cluster
- All members of the application deployment cluster
You can use the Ripple start option in the administrative console (Cluster view) if you do not want to shut down your environment completely.
- Start the application servers again.
- Check, using the administrative console, if there are any in-doubt transaction left. Navigate to Servers > Application Servers > serverName > Container Services > Transaction service > Runtime tab
If remaining in-doubt transactions are listed, select all of them and initiate a rollback.
- Check if there are messages on the Retention Queue and Hold Queue of the Business Process Container. Navigate to Servers > Application Servers > server_name > Business process container > Runtime Configuration
If there are any messages on the Retention Queue , the Hold Queue, or both, replay the messages. However, replay all messages from the Retention Queue first.