Ensure higher availability by making your XA transaction manager resilient to resource manager failures

Many middleware and database products adhere to the XA protocol to achieve transactional capabilities while interoperating between middleware and resource managers. There can be scenarios where the availability of the transaction managers could be compromised due to failures (such as transient or long-duration network failures) or non-availability of one of the resource managers. This article looks at some common error scenarios that could affect XA transaction availability, ways to help you determine if your transaction manager will maintain higher availability during failure conditions (and ways to make sure it does), and some tips about resource managers. This content is part of the IBM WebSphere Developer Technical Journal.

Share:

Ajay Sood (r1sood@in.ibm.com), Senior Software Engineer, IBM  

Ajay Sood is an Senior Software Engineer at IBM in Bangalore, India. He has been with IBM for more than 13 years, and has been a developer on such product development team efforts as DB2 DataLinks, and TXSeries-CICS. His experiences include work in transaction processing, middleware, and system programming on UNIX platforms.


developerWorks Contributing author
        level

Tomohiro Taguchi (TOMOTAG@jp.ibm.com), IT Specialist, IBM  

Tomohiro Taguchi is an IT specialist at IBM Japan Systems Engineering Co.,Ltd. in Makuhari, Japan (ISE). He has been worked in technical support mainly for CICS family products including TXSeries and CICS Transaction Gateway.



22 July 2009

Also available in Chinese

Introduction

As described in an earlier article on configuring and using XA in a middleware environment, the XA standards, set forth by the Open Group's X/Open Distributed Transaction Processing (DTP) model, define the interfaces between the transaction manager, application program, and the resource manager to achieve the two-phase commit in a DTP environment:

  • The application program implements the desired business function. It specifies a sequence of operations that involve resources, such as databases. An application program defines the start and end of a global transaction, accesses resources within transaction boundaries, and usually decides whether to commit or roll back each transaction.
  • The transaction manager manages global transactions and coordinates the decision to commit them or roll them back, thus ensuring their atomicity. The transaction manager also coordinates recovery activities of the resource managers when necessary, such as after a component failure.
  • The resource manager manages a certain part of the computer's shared resources. Many other software entities can request access to the resources from time to time, using services that the resource managers provide. Some resource managers manage a communications resource.

In a distributed transaction environment that involves XA compliant resource managers, an application needs to be especially cognizant of transaction processing, as there are additional considerations that must be accommodated in an application running in this environment. Applications must be aware of how the transaction manager behaves should the resource manager fail, and compensate with special logic to handle such failures. It is also important that the transaction manager be able to handle certain resource manager failures, and as such continue operating should this situation occur. This helps ensure the availability of the transaction manager and, to some extent, can help simplify application logic. As an application programmer, you should be sure to test the behavior of the transaction manager -- in addition to the application -- in such situations so you can be sure to sufficiently accommodate any special handling that might be needed.

This article looks at the expected behavior of a transaction manager when a resource manager fails. The following sections will discuss common failure conditions, the corresponding XA errors, the behavior of transaction managers when participating resource managers fail, and considerations to help you achieve higher availability of the transaction managers in these scenarios.


Common causes of resource manager failure

There are a few common reasons why a resource manager might fail or otherwise be unavailable. Among them:

  • Run time inconsistencies on the resource manager side can cause a transaction failure.
  • A failure in the resource manager storage can cause the resource manager to go down.
  • A transient or long duration network outage can cause the resource manager to be unavailable.

There are two major “catastrophic” XA errors that can be returned by a resource manager:

  • XAER_RMERR: This error indicates that a resource manager error occurred in the transaction branch. There are a few possible scenarios that could result in this error. For example, if a resource manager concluded that it could never commit the branch, and that it could not hold the branch’s resources in a prepared state; or, an error occurred in dissociating the transaction branch from the thread of control.
  • XAER_RMFAIL: This error indicates that a general error occurred that has made the resource manager unavailable.

The failure of a participating resource manager can happen at different points of execution. The transaction manager will generally detect the failure while making one of the XA related calls. Here are a few example cases and the corresponding desired behavior from the transaction manager:

  • The process executing the transaction detects an error while establishing a connection using XA_OPEN.

    Desired behavior: If the XA_OPEN fails, the transaction manager should issue a warning and continue normally. The transaction manager should not cater to transactions directed to the unavailable resource manager, and there should not be any XA related flows to the unavailable resource manager.

  • The transaction manager detects an error before starting the transaction.

    Desired behavior: If any of the calls in the flow (such as XA_START, XA_END, XA_PREPARE, and so on) fail, then the transaction should fail. There should not be any further XA flows to the unavailable resource manager.

  • The resource manager fails between prepare and commit processing, and the transaction manager detects the failure.

    Desired behavior: If the failure occurs before commit processing is complete, then the transaction goes in doubt. The transaction manager is expected to finish the transaction when the resource manager becomes available again. It is left up to the implementer of the transaction manager to decide whether the thread of execution will continue to retry the commit call, or just keeps blocked.


Can your transaction manager promise higher availability?

Here are some considerations you can apply against your transaction manager to make sure that it will continue to be highly available even after a catastrophic XA failure, such as when a resource manager is not available during a distributed XA transaction:

  1. Are you able to run non-XA transactions using the transaction manager after the failure and before the resource manager becomes available?
  2. Are you able to run transactions that do not involve the failed resource manager?
  3. Are you able to submit the next transaction on the transaction manager without any problem, even as the failed resource manager continues to be unavailable?
  4. Does the transaction that involves the failed resource manager get resolved automatically when the resource manager becomes available again?
  5. What is the behavior of the transaction manager before the resource manager becomes available? Does the thread or process executing the transaction to be recovered keep retrying the resource manager, expecting it to become available?
  6. Does the transaction manager store stale connections? Do the first invocations of new transactions fail after the resource manager becomes available?

The next section presents some test cases to help you answer these questions as they relate to the transaction manager, in the event of a resource manager failure at different points in time. If any of these conditions are not satisfied, you should check with your vendor about the capabilities of the transaction manager.


Test cases for XA resiliency

The following four test cases will help you determine whether your transaction manager will be able to perform with higher availability should a resource manager fail. All test cases share these assumptions:

  • The transaction manager can either be a multi-threaded or a multi-process environment. If multi-threaded, the unit executing a transaction is a thread (called a thread of execution, or ToE). If multi-process, the unit executing a transaction is a process (called a process of execution, or PoE).
  • These test cases present a multi-process transaction manager, and so each transaction is executed in a separate process. However, the desired behavior of the transaction manager would be the same even if the unit of execution was a thread. The intent is to show that only the process or the thread executing the transaction is affected, while the overall transaction manager remains unaffected and available for other kinds of work.
  • IBM® DB2® is used here as an example of a resource manager, which will undergo state changes (available or not available) at different points during the tests. The behavior of the transaction manager should not be affected by the state change of the resource manager.
  • During lab testing, debuggers such as dbx (on AIX®) were used to force stopping or starting of the PoE at certain points. You can use the tools or methods of your choice when running these tests.

Case 1: Resource manager fails during transaction

Scenario: The resource manager becomes unavailable during the execution of a transaction that uses the resource manager. See Figure 1.

The test:

  1. The transaction manager is running and available.
  2. A process in the transaction manager executes a DB2-related transaction.
  3. Bring down DB2 while the transaction is in progress.
  4. The transaction issues a commit or a rollback (xa_commit flows from the process).
  5. The transaction (and, therefore, the process) encounters an xa_error (XAER_RMERR) because DB2 is unavailable.
  6. The process executing the transaction handles the error in its desired manner (for example, restart the process of executing the transaction).
  7. Check behavior:
    • Make sure that the transaction manager overall remains available, even after the process detects the failure, Only the process executing the transaction should be affected.
    • If the behavior of the transaction manager is to restart the process in the event of a failure like XAER_RMERR, check if the restarted process of execution (PoE(1) in Figure 1) retries until the failed transaction is recovered. The transaction recovers when the failed resource manager becomes available. This check is relevant only if the failed transaction requires recovery.
    • Other transactions involving the failed DB2 instance might fail until that particular DB2 instance becomes available again.
    • Any transactions not involving DB2 should continue to work normally, even if the failed DB2 instance remains unavailable.
  8. Bring up DB2, making it available.
  9. All freshly submitted transactions should again work normally.

This test will help you verify questions c,d, and e.

Figure 1. Case 1
Figure 1. Case 1

Case 2: Resource manager fails while connected to a process

Scenario: The process of execution is connected to the resource manager. The resource manager goes down. Submit a transaction involving the failed resource manager. See Figure 2.

The test:

  1. The transaction manager is running and available.
  2. The process for executing transactions is connected to DB2. No transactions are running.
  3. Bring down DB2.
  4. Run a transaction that does not involve DB2. The transaction should execute normally.
  5. Run a transaction that does involve DB2. The transaction should fail and an SQL error should occur.
  6. PoE should handle the error and take appropriate action.
  7. Check behavior:
    • The transaction manager overall should remain available.
  8. Bring up DB2.
  9. Invoke a transaction involving DB2. An XA_OPEN is issued on the start of the new transaction and a connection is established. The transaction should execute normally.

This test will help you verify question a.

Figure 2. Case 2
Figure 2. Case 2

Case 3: Process retaining stale handles after resource manager failure

Scenario: The resource manager goes down. Submit a transaction involving the resource manager after it becomes available. See Figure 3.

The test:

  1. The transaction manager is running.
  2. The process for executing transactions is connected to DB2 using XA_OPEN.
  3. Bring down DB2.
  4. Transactions not involving DB2 should run normally.
  5. Bring up DB2.
  6. Submit a transaction involving DB2. Check if the application gets a connect error because of some stale connection handles in the PoE.
  7. Check behavior:
    • In some transaction managers, the PoE is restarted after the error due to stale handles, after which a new connection is established and transaction is resubmitted. This implies that the PoE needs to be recycled to get rid of the old connection handles.

This test will help you verify question f.

Figure 3. Case 3
Figure 3. Case 3

Case 4: Starting a process and submitting a transaction with an unavailable resource manager

Scenario: Start a PoE when the resource manager is unavailable. Submit a transaction when the resource manager is unavailable. See Figure 4.

The test:

  1. The transaction manager is available.
  2. DB2 is unavailable.
  3. Start a new PoE. XA_OPEN fails.
  4. Submit a transaction involving SQL statements directed toward DB2. An SQL error is encountered as DB2 is unavailable. No ax_reg is flown (in the case of dynamic registration). A rollback should work fine, as no XA calls are made to the unavailable DB2 (this particular DB2 is not registered yet as a participating XA resource manager).
  5. Check behavior:
    • The transaction manager overall should remain available.
  6. Non-SQL transactions (and transactions not involving DB2) should continue to work normally.
  7. Bring up DB2.
  8. Submit a transaction involving DB2. The connection is established with DB2 and the transaction should succeed.

This test will help you verify question b.

Figure 4. Case 4
Figure 4. Case 4

Protecting your transaction manager against resource manager failures

Here are some design considerations to help you make your transaction manager resilient to resource manager failures:

  • Make sure the unavailable resource manager is marked unavailable with the transaction manager, and that the transaction manager is able to continue with other resource managers.
  • The transaction manager should be able to continue with transactions that do not involve the failed resource manager.
  • The transaction manager should be available, even in the event of a catastrophic failure on one of the resource managers participating in the transaction.
  • The transaction manager should be able to detect the availability of the resource manager as it becomes available.
  • The transaction manager should be able to resolve the failed transaction when the failed resource manager becomes available.
  • The transaction manager should be able to continue with new transaction requests involving the failed resource manager when it becomes available.
  • Stale connections should be removed from the transaction manager cache so that first invocations of transactions after the resource manager becomes available do not fail.

Finally, here are a few tips to help you work with some popular resource managers:

  • For IBM WebSphere® MQ, ax_reg flows to the transaction manager only when MQPUT with syncpoint option is issued. This implies that if WebSphere MQ is configured for dynamic registration and a failure happens before MQPUT with syncpoint, WebSphere MQ might not support the XA resiliency feature.
  • With some popular resource managers other than DB2, if the failure happens during xa_end, XAERR_PROTO is returned instead of XAERR_RMERR. The transaction manager needs to be tolerant to different behaviors due to different error codes returned by different resource managers for similar kinds of failures.

Conclusion

In today’s enterprise, where the availability of core systems is key, it is very important that core components like transaction managers remain available in spite of failures of other related components. The test cases presented in this article can help you proactively determine whether your transaction manager is resilient to certain kinds of failures in enterprise systems.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=415567
ArticleTitle=Ensure higher availability by making your XA transaction manager resilient to resource manager failures
publish-date=07222009