IBM WebSphere Developer Technical Journal: Automate peer recovery for transactions and messages in WebSphere Application Server V6.0.x

Dramatic enhancements for high availability

Learn how to set up, verify, and troubleshoot an environment for peer recovery of transactions and messages using new high availability features in IBM® WebSphere® Application Server Network Deployment V6.0.x, including the high availability manager, transaction high availabily, and messaging high availability.

Li-Fang Lee (lifang@us.ibm.com), Test Strategist , IBM

Li-Fang Lee is a test strategist working for the WebSphere Test & Quality Organization in Rochester, MN. Her current focus area is high availability, understanding customer business requirements, designing customer-like test scenarios across the organization, and leading a team to carry out test scenarios to ensure high availability of WebSphere Application Server.



Kristi Peterson (kristip@us.ibm.com), Software Engineer , IBM

Kristi Peterson is a software engineer from Rochester, MN. She has four years of experience in the field of software testing, and holds a degree in Computer Science and English from Luther College in Decorah, Iowa. Her areas of expertise include WebSphere Application Server, software testing, high availability software, test application development, documentation review, and scenario testing development.



21 September 2005

Also available in Chinese

From the IBM WebSphere Developer Technical Journal.

Introduction

Configuring, verifying, and troubleshooting a highly available environment is inherently complex and confusing. Good news: with important new enhancements, IBM WebSphere Application Server Network Deployment V6.0.x provides easier and more flexible ways to configure high availability solutions for stateful information, such as transactions and messages, with dramatic performance improvements.

For example, WebSphere Application Server Network Deployment V6.0.x (hereafter referred to as Network Deployment):

  • Simplifies the configuration needed to achieve highly available systems. The failover of transactions and messages is now handled by multiple Network Deployment internal components. As a result, no external software (such as clustering software, like IBM HACMP) is required to monitor the application server's transaction and messaging resource availability, to perform resource switching on failure, or to achieve transaction or message high availability.

  • Decreases the time it takes to switch resources from one server to another. Before any failed in-flight transactions or messages could be processed from a failed server in prior versions of Network Deployment, either the original server had to be restarted, or -- with clustering software support -- a new server had to be cold-started and all the necessary resources needed to be failed-over to the new server. No mechanism was available to have an active application server ready to process in-flight transactions or messages from a failed peer server. This is no longer the case in Network Deployment V6.0.x. Any active application server can now be configured to be knowledgeable about the resource of its peer servers; critical application server services can be activated on any application server within the same cluster. As a result, the turnaround time for resource failover is reduced from minutes to seconds.

With the proper configuration, the failover for transactions or messages should be transparent to the user. Work should not be interrupted, performance should not degrade, and transactions and messages should not be lost.

In this article, we will provide an example of how to set up, verify, and troubleshoot an environment for peer recovery of transactions and messages using Network Deployment. In this environment, the in-flight transactions and messages will automatically fail over from a failed application server to a peer server, which will continue processing uninterrupted.

This article assumes familiarity with IBM WebSphere Application Server V5.x and the basic concepts of transactions and messaging. The example described here will show how peer recovery can be used to automatically fail over transactions and messages among clustered application servers, and how to configure the environment, verify the results, and troubleshoot basic problems. This article will not discuss how peer recovery works internally or the design of the server components. For background information on the basic concept, see Transitional high availability and deployment considerations in WebSphere Application Server V6. Other articles referenced throughout this article are available in the WebSphere Application Server Version 6.0.x Information Center.

The example scenario described in this article is based on applications we have developed for software testing and are presented for illustrative purposes only. Although executable files are not included with this article, the concepts described here can be applied to similar environments.


Key terminology

To help you understand some of the key points in our example, here are some important terms that are used throughout this article, several of which are new to Network Deployment V6.0.x:

Key termDefinition
High availability (HA) managerA new component introduced in Network Deployment V6.0.x, responsible for managing the availability of critical services within a cell, such as transaction managers and messaging engines for cluster members. The HA manager keeps track of when active members join (server starts) and leave (server stops or dies) the high availability group.
Core groupA statically-defined component in the HA manager that is the boundary for the high availability group. Network Deployment V6.0.x creates a default core group, called DefaultCoreGroup, for each cell. The default core group is sufficient for use in most configurations.
High availability groupA dynamically-created component of a core group created by WebSphere Application Server for use by WebSphere components, such as the transaction manager or messaging engine. A high availability group is directly affected by the core group policy configuration, and cannot be configured directly by a user.
Core group policyA set of rules to determine how many members in a high availability group should be active at any given point in time. A group member can only accept work if it is active. In the case of a messaging engine or transaction manager, the active member must be in started state. There are several different policies (one-of-n, m-of-n, all active, and so on) and policy attributes (for example, preferred only, fail back) that control which member(s) of the group are activated; there are default policies for the transaction manager and messaging engine.
Clustered transaction manager policy
and
Default service integration bus policy
Default core group policies for the transaction manager and default messaging, respectively. Both are one-of-n policies. The group is activated on one of a set of candidates, meaning that no matter how many members are in the group, only one will be "activated". For example, three cluster member servers, TxS1, TxS2, and TxS3, create three self-declared HA groups, HAGrp1, HAGrp2, and HAGrp3, respectively. Each server is a member of each of the three HA groups. For the default one-of-n policy, only one transaction server is active for its own HA group, so only TxS1 is activated for HAGrp1. If TxS1 fails, a single other server in the group will be activated, such as TxS2.
Transaction managerA server component that coordinates commands from application programs and communication resource managers to complete or roll back global transactions.
Service integration bus (SIBus)A logical communication system that groups one or more interconnected servers as its members to support applications using message-based architectures. Any application can exchange messages with any other application by using a destination to which one application sends, and from which the other application receives.
Messaging engineA server component that provides the messaging functionality for a service integration bus. Each messaging engine is associated with a server (or a server cluster) that is a service integration bus member. When you add an application server (or a server cluster) as a bus member, a messaging engine is automatically created for the new member.
Network Attached Storage (NAS)A data server on a network that provides file storage accessible via the network, consolidating storage resources to provide simplified management and scalability.
Network File System (NFS) v4A collection of protocol suites developed and licensed by Sun Microsystems® to support file sharing among computers running different operating systems. An NFS server exports file directories for remote clients to access. Clients access the exported directories by mounting them remotely. In version 3, the file locking was handled by a separate, stateful lock manager. In version 4, the file locking mechanism is integrated into the file access protocol. Locks are granted for a specific time (lease expiration) to simplify the recovery schemes.

Basic requirements

To achieve successful transaction or messaging service peer recovery in Network Deployment V6.0.x, you need to understand these basic transaction and messaging requirements, in addition to the standard WebSphere Application Server environment:

  • For transaction failover to work successfully, all application servers must access the same high availability file system for storing transaction logs. When a server fails, one of its peer servers needs to be able to take over the transaction logs from the failed server to perform peer recovery. IBM SAN File System, Sun Network File System (NFS) v4, or Microsoft® Common Internet File System (CIFS) can be used as the highly available file server for the transaction logs, since these file system protocols provide the essential integrating file locking mechanism. (NFS v4 NAS is referred to in this example.)

  • Each messaging engine needs to store its information in a data store to keep the messages persistent. The messaging engine uses an instance of a JDBC data source to interact with the database. The messaging engine runs on an application server with an exclusive lock on the database. If the application server fails, another instance of the messaging engine will be activated on a peer server, and it will reconnect to the database to continue processing the work. (IBM Cloudscape® is the default data store, but for this example, we refer to a remote IBM DB2® database as our messaging engine data store.)


Logical layout

The topology of our example is made up of a deployment manager and two application server nodes configured in a single cell (Figure 1):

  • The two nodes and deployment manager are run on separate systems.

  • We use IBM HTTP Server V6.0.x to distribute work requests between the application servers.

  • A DB2 server is used as the messaging engine data store and for application databases. An LDAP server is used for security.

  • We use IBM WebSphere Studio Workload Simulator to generate client requests to stress the system.

  • The transaction logs for each application server are configured on a remote NFS v4 NAS file system.

  • The NAS file system is mounted on all WebSphere Application Server nodes.

Figure 1. Overall topology
Figure 1. Overall topology

Our example uses two EJB applications:

  • Tech: a message sender; this online bookstore application produces messages.
  • Griffin: a message receiver; this order processing application consumes messages.

How it works:

  • Clients will browse the Tech Web site to search for and order books.
  • The Tech application gets the client order requests, stores the information in two databases, and sends the requests to a queue.
  • The Griffin application listens to that queue and picks up the orders.
  • Griffin then processes the messages and stores information in its database.
  • Two-phase commit transactions are involved when Tech orders books and when Griffin processes books.

On the nodes, we create two clusters with two servers each, one for Tech (TechCluster) and one for Griffin (GriffinCluster). The application servers are scaled horizontally across the nodes (Figure 2).

Figure 2. Application server node topology
Figure 2. Application server node topology

Installing and configuring

Let's take a high level look at the general setup and configuration of our example. (Detailed product installation documentation is available in the WebSphere Application Server Version 6.0.x Information Center.)

WebSphere Application Server Network Deployment V6.0.x

  • Install Network Deployment on three systems.

  • Create one deployment manager profile and two custom node profiles on Node1 and Node2. Add the custom profiles to the deployment manager profile.

  • If you use NAS as the shared transaction log file system, all application server node machines must be UNIX® systems so that the NAS file system can be mounted. There is no platform restriction for the deployment manager. One of the combinations we use is a Windows® 2000 deployment manager, a Solaris™ application server node, and an AIX® application server node.

DB2

  • Create application databases. We have four DB2 databases for our two applications, each of which accesses two databases. Both applications use two-phase commit transactions.

  • We also create two DB2 databases for our messaging engines. At a minimum, there is one messaging engine per cluster.

  • All of these databases are catalogued on the node machines.

  • For more information, see:

Resources

Cluster

  • Create application server clusters with multiple servers for the two applications.

  • To save time configuring cluster members, we first create only one new cluster member. We modify the non-root properties and the transaction log properties, then we save the changes and create the rest of the cluster members. The non-root properties and transaction log properties will be applied to all subsequent application servers created in the cluster.

  • After the cluster is created, select the Enable high availability for persistent services checkbox on the Cluster properties panel (Figure 3).

    Figure 3. Enabling HA on a cluster for persistent services
    Figure 3. Enabling HA on a cluster for persistent services

Non-root

  • Configure the deployment manager, the node agents, and the application servers to run as non-root users.

  • For peer recovery, the non-root user for the application servers must have the same name, user ID, and group ID created on all application server systems. If the non-root users have the same name, but different identification numbers, then transaction peer recovery will not take place, since the different machines will not have the same system permission to access all of the transaction logs.

    Figure 4. Transaction logs created as non-root
    Figure 4. Transaction logs created as non-root
  • Changing to non-root requires a restart of all members. Note the Run As User and Run As Group fields in Figure 5.

  • For more information, see:

    Figure 5. Configuring non-root in the administrative console
    Figure 5. Configuring non-root in the administrative console

Transaction log

  • Prepare the application server nodes to mount to the NAS file system. Create a desired directory structure on NAS. For example, the mount point is /mnt/nas and we create a was60 directory, resulting in a base path of /mnt/nas/was60. If you will be running with non-root, switch to the non-root user when you create subdirectories on the mounted directory.

  • For each cluster member, you will have to change the transaction log location to point to the NAS file system. To ease configuration, create a cluster with one member and change the transaction log location, then create the rest of the servers. Edit the transaction log location of the rest of the servers to make each one unique; for example, /mnt/nas/was60/TechServer1, /mnt/nas/was60/TechServer2, mnt/nas/was60/TechServer3, and so on. The transaction logs must be unique; otherwise two servers will try to access the same log. WebSphere Application Server will create additional directory structure under the supplied name. Note the Transaction log directory field in Figure 6.

  • For more information, see:

    Figure 6. Changing the transaction log location
    Figure 6. Changing the transaction log location

Application

  • Once the clusters and resources are set up, we install the Tech and Griffin applications.

  • During install, we assign the applications to their respective clusters.

  • We verify the correctness of the data source and authentication alias mappings. We check that Griffin's message driven bean is assigned to the activation specification created in the Resources setup.

  • For more information, see:

IBM HTTP Server

  • Install IBM HTTP Server and the Web server plug-in on a remote machine.

  • We create the default Web server named webserver1 and add it to our cell. When we fail over, we drive clients against the Web server. WebSphere will map the existing applications to the new Web server.

  • Generate and propagate the plug-in to access the applications from the Web server.

  • For more information, see:

Default messaging

  • Create a new service integration bus named BookstoreBus and add both the TechCluster and GriffinCluster as new bus members.

  • For each member, provide the JNDI name of the data source created during Resources setup. Note the Data store JNDI name field in Figure 7.

  • Two messaging engines are automatically created, named TechCluster.000-BookstoreBus and GriffinCluster.000-BookstoreBus. The name is based on the name of the bus member and a universal unique identifier (UUID) to provide a unique identity.

    Figure 7. Adding a bus member
    Figure 7. Adding a bus member
  • Next, we create a new destination of type "queue" and assign it to GriffinCluster. The TechCluster will be able to connect to the queue via its messaging engine, since they are members on the same bus.

  • WebSphere Application Server creates a link between the GriffinCluster.000-BookstoreBus messaging engine and the TechCluster.000-BookstoreBus messaging engine. The queue is only created on the Griffin side and communication flows through the link between the messaging engines.

Security

  • Set up an LDAP user directory. In the administrative console, enter values for LTPA settings and the LDAP user registry. Enable security with LTPA authentication and an LDAP user registry.

  • Restart the cell to run securely. Our application has a form login and secured JSPs that require a login after security is enabled.

HA policy

  • We take advantage of several high availability default settings, such as the DefaultCoreGroup, created for the cell, and the default one-of-n core group policies for the transaction managers and messaging engines. These are "Cluster TM Policy" and "Default SIBus Policy".


Failover and verification

Of course, adequate testing is required to know for sure whether your environment is going to work. It is also crucial to test failover under a stressful workload. To simulate failover with a client workload, we use IBM's WebSphere Studio Workload Simulator to drive hundreds of clients against our applications.

Before you actually run a test, though, you need to carefully examine your environment and document what the test results should be for all your test scenarios. This is especially important when testing for failovers. In our example, we define various failure test actions and document the expected results for each test. Our test plan explicitly states how to verify the results, including checking database integrity, reviewing application server logs, and verifying output based on our application's business logic.

For example, during test actions 1 through 5, documented below:

  • We expect that the messaging engine will restart on a peer server and continue processing messages after a failure.
  • We expect the transaction manager to initiate peer transaction recovery and to complete without leaving in-doubt transactions.
  • We expect that all data is correctly recorded in four application server data stores and that no data is lost during a failure.

For test actions 6 and 7, however, we have different expectations:

  • We expect the failure of the deployment manager or node agent should not have any impact on the availability of application servers.

Figure 8 shows all the systems running in a normal state. Tech-ME and Grif-ME are the messaging engines created when the TechCluster and GriffinCluster were added to the service integration bus. T1-TM, T2-TM, G1-TM and G2-TM are the transaction logs for TechServer1, TechServer2, GriffinServer1 and GriffinServer2.

Figure 8. Example environment, all processes running
Figure 8. Example environment, all processes running

These are the test actions we defined and performed to cause various types of failures. We randomly and repeatedly:

  1. Gracefully stop one or more application server cluster members. We use the admin console and the stopServer command to gracefully stop the server(s).

  2. Forcefully kill one or more application server cluster members. We issue kill -9 <process number> to maliciously terminate one or more running servers. You can find the application server process identification number by looking at the <servername>.pid file in the server's log directory.

    Figure 9. Example of failover with GriffinServer1 going down
    Figure 9. Example of failover with GriffinServer1 going down
  3. Disable the network access from one of the application server nodes and then enable it again.

  4. Unplug the power cord from one of the application server nodes, then plug it back in. Restart node agent and application servers.

  5. Reboot one of the application server nodes. Restart node agent and application servers.

    Figure 10. Example of failover with Application Server Node1 going down
    Figure 10. Example of failover with Application Server Node1 going down
  6. Kill the deployment manager and node agents at various times and then restart them. We do not expect the application servers to be affected.

    Figure 11. Example of Node agent1 down
    Figure 11. Example of Node agent1 down
  7. Stop applications from the administrative console and then restart them. We expect no failover to occur and the application should shut down gracefully.

The above test actions are not all-inclusive. Other scenarios, such as planned machine down time, system replacement, CPU or memory overload, software updates, and so on, must all be considered when preparing for production. You need to verify the test results based on your documented expectations to make sure that your systems and applications meet your high availability requirements.

During times of failure during a test, we perform the following important checks and compare the actual results with our expected test results to verify that peer recovery is working for both the transaction manager and messaging engine. (The default location of the application server logs is <installRoot>/profiles/<profileName>/logs/<serverName>.)

Messaging engine restarts

  • Review the application server's SystemOut.log to verify that the messaging engine successfully restarts on another server.

    Log example 1. ME starting on a peer server
    [2/24/05 13:02:01:983 CST] 00000049 SibMessage    I   [BookstoreBus:GriffinClust
    er.000-BookstoreBus] CWSID0016I: Messaging engine GriffinCluster.000-BookstoreBus is 
    in state Starting.
     [2/24/05 13:02:03:485 CST] 00000049 SibMessage    I   [BookstoreBus:GriffinClust
    er.000-BookstoreBus] CWSIS1538I: The messaging engine is attempting to obtain an 
    exclusive lock on the data store.
    [2/24/05 13:02:03:664 CST] 0000004b SibMessage    I   
    [BookstoreBus:GriffinCluster.000-BookstoreBus] CWSIS1537I: The messaging engine has 
    acquired an exclusive lock on the data store.
    [2/24/05 13:02:10:218 CST] 00000049 SibMessage    I   [BookstoreBus:GriffinClust
    er.000-BookstoreBus] CWSIP0212I: messaging engine GriffinCluster.000-BookstoreBus on bus 
    BookstoreBus is starting to reconcile the WCCM destination and link configuration.
    [2/24/05 13:02:10:308 CST] 00000049 SibMessage    I   
    [BookstoreBus:GriffinCluster.000-BookstoreBus] CWSIP0213I: messaging engine 
    GriffinCluster.000-BookstoreBus on bus BookstoreBus has finished reconciling the WCCM 
    destination and link configuration.
    [2/24/05 13:02:11:365 CST] 00000049 SibMessage    I   
    [BookstoreBus:GriffinCluster.000-BookstoreBus] CWSID0016I: Messaging engine 
    GriffinCluster.000-BookstoreBus is in state Started.
    [2/24/05 13:02:13:045 CST] 00000051 WSChannelFram A   CHFW0019I: The Transport Channel 
    Service has started chain chain_1.
    [2/24/05 13:02:13:764 CST] 00000051 SibMessage    I   
    [BookstoreBus:GriffinCluster.000-BookstoreBus] CWSIT0028I: The connection for 
    messaging engine GriffinCluster.000-BookstoreBus in bus BookstoreBus to messaging engine 
    TechCluster.000-BookstoreBus started.
  • While a messaging engine restarts, there may be one or two XARecovery messages reporting "No suitable messaging engine is available in bus BookstoreBus." This can be expected as the transaction manager may try to recover a transaction related to messaging before the messaging engine finishes restarting. The transaction manager will retry in a minute and we should not see this message repeated.

    Log example 2. XA recovery related to ME
    [3/3/05 22:50:12:938 CST] 00000051 XARecoveryDat W   WTRN0005W: The XAResource for a 
    transaction participant could not be recreated and transaction recovery may not be able 
    to complete properly. The resource was 
    [com.ibm.ws.sib.ra.recovery.impl.SibRaXaResourceInfo@18562154 
    <busName=BookstoreBus> <meName=GriffinCluster.000-BookstoreBus> 
    <meUuid=9D3344B5EB791118> <userName=null> <password=null>]. The 
    exception stack trace follows: com.ibm.ws.Transaction.XAResourceNotAvailableException: 
    com.ibm.websphere.sib.exception.SIResourceException: CWSIT0019E: No suitable messaging 
    engine is available in bus BookstoreBus 
    at com.ibm.ws.sib.ra.recovery.impl.SibRaXaResourceFactory.getXAResource
    (SibRaXaResourceFactory.java:99)
            at com.ibm.ws.Transaction.JTA.XARecoveryData.recover(XARecoveryData.java:535)
            at com.ibm.ws.Transaction.JTA.PartnerLogTable.recover(PartnerLogTable.java:512)
            at com.ibm.ws.Transaction.JTA.RecoveryManager.resync(RecoveryManager.java:1721)
            at com.ibm.ws.Transaction.JTA.RecoveryManager.run(RecoveryManager.java:2263)
               Caused by: com.ibm.websphere.sib.exception.SIResourceException: CWSIT0019E: 
                No suitable messaging engine is available in bus BookstoreBus
            at com.ibm.ws.sib.trm.client.TrmSICoreConnectionFactoryImpl2.createConnection..
            at com.ibm.ws.sib.trm.client.TrmSICoreConnectionFactoryImpl2.createConnection..
            at com.ibm.ws.sib.ra.recovery.impl.SibRaXaResourceInfo.createXaResource..
            at com.ibm.ws.sib.ra.recovery.impl.SibRaXaResourceFactory.getXAResource..

Messages continue to be received

  • Our application prints out messages to the application server's SystemOut.log when it processes each message. After a messaging engine starts, we expect to see them in the log.

    Log example 3. Griffin application receiving messages
    [3/3/05 22:58:09:571 CST] 00000069 Griffin    A com.ibm.wspi.bookstore.griffin.GriffinReceiveBean
    onMessage Received a message, getting its text message to send to processing.
    [3/3/05 22:58:09:658 CST] 00000069 Griffin       A 
    com.ibm.wspi.bookstore.griffin.GriffinOrderMQLogicBean createOrder Received an order from Tech 
    and created an order in Griffin DB.
    [3/3/05 22:58:09:702 CST] 00000069 Griffin       A 
    com.ibm.wspi.bookstore.griffin.GriffinReceiveBean onMessage Finished processing message.
  • We can also monitor the queue from the administrative console by going to the queue destination on the bus and clicking on the Runtime tab. We can review the number of messages on the queue and review the contents of messages.

    Figure 12. Queue on the administrative console
    Figure 12. Queue on the administrative console

Transaction peer recovery

  • Review the application server's SystemOut.log to verify that a peer server has picked up the transaction logs and performed recovery.

    Log Example 4. Transaction logs being recovered
    [2/24/05 15:24:37:272 CST] 0000002a RecoveryDirec A   WTRN0100E: Performing recovery 
    processing for this WebSphere server (FileFailureScope: 
    sitkaCell01\sun1Node01\GriffinServer1 [-1533007270])
    [2/24/05 15:24:37:533 CST] 0000002a RecoveryDirec A   WTRN0100E: All persistant services 
    have been directed to perform recovery processing for this WebSphere server 
    (FileFailureScope: sitkaCell01\sun1Node01\GriffinServer1 [-1533007270])
    [2/24/05 15:24:37:622 CST] 0000002a RecoveryDirec A   WTRN0100E: All persistant services 
    have been directed to perform recovery processing for this WebSphere server 
    (FileFailureScope: sitkaCell01\sun1Node01\GriffinServer1 [-1533007270])
    [2/24/05 15:24:39:137 CST] 0000002b RecoveryManag A   WTRN0028I: Transaction service 
    recovering 1 transaction.
  • Various XA recovery messages may follow if the transaction service is recovering transactions. Some may be expected, such as an XAER_NOTA error. This can occur after a server is killed. The transaction manager retries recovery every minute, and after a few retries, the XAER_NOTA should not reoccur. Other XAER messages may need to be investigated.

  • For more information, see:

    Log Example 5. XAER_NOTA on recovery
    [3/4/05 13:09:08:855 CST] 00000071 RecoveryDirec A   WTRN0100E: Performing recovery 
    processing for a peer WebSphere server (FileFailureScope: 
    sitkaCell01\sizzlerNode01\TechServer5 [233834913])
    [3/4/05 13:09:08:876 CST] 00000071 RecoveryDirec A   WTRN0100E: All persistant services 
    have been directed to perform recovery processing for a peer WebSphere server 
    (FileFailureScope: sitkaCell01\sizzlerNode01\TechServer5 [233834913])
    [3/4/05 13:09:09:155 CST] 00000071 RecoveryDirec A   WTRN0100E: All persistant services 
    have been directed to perform recovery processing for a peer WebSphere server 
    (FileFailureScope: sitkaCell01\sizzlerNode01\TechServer5 [233834913])
    [3/4/05 13:09:09:862 CST] 00000072 RecoveryManag A   WTRN0027I: Transaction service 
    recovering 1 transaction.
    ...
     [3/4/05 13:09:14:968 CST] 00000072 WSRdbXaResour E   DSRA0304E:  XAException occurred. 
    XAException contents and details are: "".
    [3/4/05 13:09:14:982 CST] 00000072 WSRdbXaResour E   DSRA0302E:  XAException occurred.  
    Error code is: XAER_NOTA (-4).  Exception is: XAER_NOTA

Application databases

  • There should be no lingering in-doubt transactions on the application databases after the transaction peer recovery finishes. For DB2, you can connect to each database and use the list indoubt transactions command to verify. The return message "SQL1251W No data returned for heuristic query. SQLSTATE=00000" means that there are no leftover transactions.

  • We can also compare key application database tables to verify that the correct number of entries is added. Both Tech and Griffin should have the same number of additional database entries on the tables involved in the orders.


Troubleshooting

Next, we will describe various configuration problems and run time errors you might encounter and and some possible solutions.

Trouble accessing the transaction logs

If an application server shuts itself down after you attempt to start it, review the application server's SystemOut.log. It could be that the server does not have the right permissions to access the directories. Review the permissions for the transaction logs, verify that all the application server systems can access all of the logs on the NAS machine, then start the server again.

Log example 6. Server shutting itself down
[10/26/04 8:41:39:100 CDT] 00000029 RecoveryHandl A   WTRN0100E: An attempt to 
acquire a file lock need to perform recovery processing failed. Either the target 
server is active or the recovery log configuration is incorrect
....
[10/26/04 8:42:34:921 CDT] 00000027 HAGroupImpl   I   HMGR0130I: The local member of 
group GN_PS=fwsitkaCell01\fwwsaix1Node01\GriffinServer3,IBM_hc=GriffinCluster,type
=WAS_TRANSACTIONS has indicated that is it not alive. The JVM will be terminated.
[10/26/04 8:42:34:927 CDT] 00000027 SystemOut     O Panic:component requested panic from 
isAlive

Transaction failover did not occur

If no failover occurred for the transaction logs when you expected it:

  1. Verify that your transaction log location points to the correct directory on all cluster members involved.

  2. Verify that all the systems have the appropriate access to the transaction logs and that the NAS system is mounted. Network Deployment will create logs if it does not find an existing file structure. For example, if the NAS system is no longer mounted, Network Deployment will do a cold start of the logs and create local logs. Peer recovery will not take place since other systems will not have access to the local logs.

    Log example 7. New log creation
    [2/24/05 12:51:52:522 CST] 00000029 LogHandle     A   WTRN0100E: No existing rec
    overy log files to process. Cold starting the recovery log
    [2/24/05 12:51:52:624 CST] 00000029 LogFileHandle A   WTRN0100E: Creating new recovery 
    log file /mnt/nas/was601/GriffinServer1/tranlog/log1
    [2/24/05 12:51:52:986 CST] 00000029 LogFileHandle A   WTRN0100E: Creating new recovery 
    log file /mnt/nas/was601/GriffinServer1/tranlog/log2
    [2/24/05 12:51:53:282 CST] 00000029 LogHandle     A   WTRN0100E: No existing rec
    overy log files to process. Cold starting the recovery log
    [2/24/05 12:51:53:386 CST] 00000029 LogFileHandle A   WTRN0100E: Creating new recovery 
    log file /mnt/nas/was601/GriffinServer1/partnerlog/log1
    [2/24/05 12:51:53:430 CST] 00000029 LogFileHandle A   WTRN0100E: Creating new recovery 
    log file /mnt/nas/was601/GriffinServer1/partnerlog/log2

    (If you are not creating new logs or did not delete existing logs, you should not see the messages in Log example 7.)

  3. Review the permissions on the transaction log directories.

  4. Confirm that at least one other cluster member is started. If other cluster members are not started, no failover will occur until the original server restarts and finishes any work, or until another server starts and does peer recovery.

  5. Review the other cluster member logs for any errors that might indicate they could not get a lock on the transaction logs.

  6. Review the core group policy settings if you created a new policy. If you have Preferred servers only enabled, review the preferred servers list for the correct members.

XAER messages

There may be some XAER messages, listed below, after failover. Some of these are acceptable or expected. If a transaction tries to recover before a messaging engine has restarted, there may be a recovery error related to the messaging engine; if so, you can ignore it.

Log example 8: XA recovery related to ME
[3/3/05 22:50:12:938 CST] 00000051 XARecoveryDat W   WTRN0005W: The XAResource for a 
transaction participant could not be recreated and transaction recovery may not be able 
to complete properly. The resource was 
[com.ibm.ws.sib.ra.recovery.impl.SibRaXaResourceInfo@18562154 
<busName=BookstoreBus> <meName=GriffinCluster.
000-BookstoreBus> <meUuid=9D3344B5EB791118> <userName=null> 
<password=null>]. The exception stack trace follows: 
com.ibm.ws.Transaction.XAResourceNotAvailableException: 
com.ibm.websphere.sib.exception.SIResourceException: CWSIT0019E: No suitable messaging 
engine is available in bus BookstoreBus
        at com.ibm.ws.sib.ra.recovery.impl.SibRaXaResourceFactory.getXAResource  at
            com.ibm.ws.Transaction.JTA.XARecoveryData.recover(XARecoveryData.java:535)
        at com.ibm.ws.Transaction.JTA.PartnerLogTable.recover(PartnerLogTable.java:512)
        at com.ibm.ws.Transaction.JTA.RecoveryManager.resync(RecoveryManager.java:1721)
        at com.ibm.ws.Transaction.JTA.RecoveryManager.run(RecoveryManager.java:2263)
            Caused by: com.ibm.websphere.sib.exception.SIResourceException: 
             CWSIT0019E: No suitable messaging engine is available in bus BookstoreBus
        at com.ibm.ws.sib.trm.client.TrmSICoreConnectionFactoryImpl2.createConnection 
        at com.ibm.ws.sib.trm.client.TrmSICoreConnectionFactoryImpl2.createConnection 
        at com.ibm.ws.sib.ra.recovery.impl.SibRaXaResourceInfo.createXaResource   
        at com.ibm.ws.sib.ra.recovery.impl.SibRaXaResourceFactory.getXAResource

Generally, you should only see XAER and recovery errors once or twice, since they should be resolved the next time the transaction manager attempts the actions; the transaction manager attempts every minute.

If you continue to see XAER or recovery errors, review the database for in-doubt transactions. During testing, if you encounter repeated problems with transactions that are trying to recover, you can stop the application servers and delete the transaction logs. This will not fix the original problem that caused the transactions to fail, but it will give you a clean start. Other XAER messages may need to be investigated. (See Message Reference for WebSphere Transactions.)

When a server fails over, you may see a XAER_NOTA error; if so, you can ignore it. (See Tips for troubleshooting transactions.)

Log example 9: XAER_NOTA after failover
[3/4/05 13:09:08:855 CST] 00000071 RecoveryDirec A   WTRN0100E: Performing recovery 
processing for a peer WebSphere server (FileFailureScope: 
sitkaCell01\sizzlerNode01\TechServer5 [233834913])
[3/4/05 13:09:08:876 CST] 00000071 RecoveryDirec A   WTRN0100E: All persistant services 
have been directed to perform recovery processing for a peer WebSphere server 
(FileFailureScope: sitkaCell01\sizzlerNode01\TechServer5 [233834913])
[3/4/05 13:09:09:155 CST] 00000071 RecoveryDirec A   WTRN0100E: All persistant services 
have been directed to perform recovery processing for a peer WebSphere server 
(FileFailureScope: sitkaCell01\sizzlerNode01\TechServer5 [233834913])
[3/4/05 13:09:09:862 CST] 00000072 RecoveryManag A   WTRN0027I: Transaction service 
recovering 1 transaction.
...
 [3/4/05 13:09:14:968 CST] 00000072 WSRdbXaResour E   DSRA0304E:  XAException 
occurred. XAException contents and details are: "".
[3/4/05 13:09:14:982 CST] 00000072 WSRdbXaResour E   DSRA0302E:  XAException occurred.  
Error code is: XAER_NOTA (-4).  Exception is: XAER_NOTA

Recovery takes a long time

If peer recovery seems to take longer than it should, the peer servers may be waiting for the HA manager to acknowledge that a server is down. The TCP values on the systems themselves may also have to be changed to detect when a system has gone down. For instance, if a lock remains on a DB2 database for hours before it is released, this can prevent the messaging engine from restarting. In such a case, you may need to add or change the TCP keep-alive parameter, KeepAliveTime, on the machine.

On a Windows machine, we add the KeepAliveTime to the Windows registry. To do this, navigate to HKEY_LOCAL_MACHINE => System => CurrentControlSet => Services => Tcpip => Parameters, and add a new DWORD value named "KeepAliveTime". Set the base to decimal. The value entered is in milliseconds; for example, 60000 equates to one minute. In one scenario, we used 300000 for this value. (Also see Transaction failover did not occur.)

Server hangs while starting or stopping

If a server hangs while starting, review the transaction log locations for all cluster members and make sure they are all unique. The server start process waits for the transaction manager to start successfully before completing the start of the server. If two servers are trying to access the same logs, one server may wait forever while trying to get a lock.

If you have a non-default HA policy in use without fail back enabled, the transaction logs may be held by another server which would prevent the failed server from restarting. Without fail back enabled, the server that does peer recovery keeps the transaction logs rather than let them fail back to the restarted failed server. Enable fail back and save changes to let the logs fail back to the original server.

If the server appears to hang while stopping, review the application server's SystemOut log for transaction timeouts or hang messages. In a testing environment, you can try to stop the server again from the administrative console or kill the server process. The messaging engine or transaction manager can cause a server to take a little longer to stop if it is performing failover from the other servers, which are also stopping.

Incorrect configuration settings

  • Incorrect JNDI name for messaging engine data store

    This will cause the messaging engine to fail to start. You should see this message in the application server's SystemOut.log:

    CWSIS0002E: The messaging engine encountered an exception while starting. Exception: com.ibm.ws.sib.msgstore.MessageStoreRuntimeException: CWSIS1524E: Data source, jdbc/TechMEO, not found

    To fix this, go to the messaging engine's Properties page and select Data store. Change the data store JNDI name to the correct name and restart the cluster.

  • Incorrectly configured HA policy

    If you change the default transaction manager policy, Clustered TM Policy, you may cause the transaction logs to incorrectly attempt recovery. You should see a message in the application server's SystemOut.log about this activity:

    An attempt to halt transactional recovery and forward processing for the local server has been aborted.

    Do not change the default policy; create new policies instead. If you have a custom policy, review the match criteria and other policy settings. (See Creating a policy for a high availability group.)

  • Using a listener port instead of activation specification

    If you are using default messaging, you need to use an activation specification. When using IBM WebSphere MQ, you use a listener port. If you mix these by assigning a default messaging connection factory and destination to a listener port, you will get an error during server start. You will see this message in the application server's SystemOut log:

    Unable to start message-driven bean (MDB) {0} against listener port {1}. It is not valid to specify a default messaging Java Message Service (JMS) resource for a listener port; the MDB must be redeployed against a default messaging JMS activation specification.

    Map your application to an activation specification and restart it.

  • Using an existing messaging engine data store

    Each messaging engine has a unique identification number that it stores in its data store. If you try to use an existing data store that another messaging engine has used, you will get an error when the messaging engine tries to start. To fix this, create a new database, change to a different table schema, or drop the existing tables. You will see this message in the application server's SystemOut log:

    CWSIS1535E: The messaging engine"s unique id does not match that found in the data store. ME_UUID={0}, ME_UUID(DB)={1}.

For more information, see:


Non-default settings

In this article, we illustrated peer recovery for transactions and messages using the default HA settings. There are several additional parameters that can be modified to change the HA run time behavior.

For example, you can create a user-defined core group instead of using the default one, and use non-default core group policies, such as m-of-n or All active, to control how many members are activated at a given point of time.

You can have a variety of policy settings with a one-of-n policy. Below, we use several simple transaction failover examples to show you different run time HA behaviors based on the different settings selected. In this example:

  • TechServer1 and TechServer2 are application servers in the TechCluster.
  • A one-of-n transaction policy called TechServer1TranPolicy was created.
  • Matching criteria is associated to TechServer1.
  • Quorum is disabled for all examples.
  • The kill -9 command is used, and the node agent automatically restarts the server.
Example 1
Policy settings:Fail back: enabled
Preferred servers only: disabled
Preferred servers: TechServer1
Action:Kill TechServer1.
HA behavior:Transaction fails over to TechServer2. TechServer1 restarts and takes back its transaction logs.
Example 2
Policy settings:Fail back: enabled
Preferred servers only: enabled
Preferred servers: TechServer1
Action:Kill TechServer1.
HA behavior:No failover to TechServer2. TechServer1 restarts and finishes transaction.
Example 3
Policy settings:Fail back: enabled
Preferred servers only: enabled
Preferred servers: TechServer1 and TechServer2
Action:Kill TechServer1.
HA behavior:Transaction fails over to TechServer2. TechServer1 restarts and takes back its transaction logs.
Example 4
Policy settings:Fail back: disabled
Preferred servers only: enabled
Preferred servers: TechServer1 and TechServer2
Action:Kill TechServer1.
HA behavior:Transaction fails over to TechServer2. TechServer1 waits when it restarts as TechServer2 retains the transaction logs (fail back is not enabled). You can enable fail back on the policy which will let TechServer1 take back its transaction logs and complete its restart.
Figure 13. A transaction policy
Figure 13. A transaction policy

Conclusion

IBM WebSphere Application Server Network Deployment V6.0.x introduces the new High Availability Manager component to provide a high availability solution for customers, and to support application server hot peer recovery for transactions and messaging, internally and automatically. Third party clustering software is no longer needed for transaction and messaging failover among clustered application servers. By providing a hot peer failover capability for its critical services, Network Deployment eliminates a single point of failure, significantly improves application server availability, and greatly reduces the system down time.

This article discussed the steps for configuring, verifying, and troubleshooting an environment in which peer application servers take over in-flight messages and in-doubt transactions from failed servers. Peer recovery was described in an example scenario made up of a clustered environment with multiple applications and application servers, default messaging, two-phase commit transaction, non-root users, remote databases for messaging, and remote NAS for transaction logs.

To conduct a complete failover test for transactions and messages using Network Deployment, you need to identify all possible critical failures that could have a great impact on the availability of your environment. The failures we attempted do not cover all possible scenarios; others, such as CPU or memory overload, should all be considered when preparing for production. Applications can also be scaled vertically or split across cells with the use of foreign buses.

The example described here is provided as a learning tool for working with the new WebSphere Application Server Network Deployment V6.0.x features, and for planning high availability environments.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=94180
ArticleTitle=IBM WebSphere Developer Technical Journal: Automate peer recovery for transactions and messages in WebSphere Application Server V6.0.x
publish-date=09212005