Application recovery is the automated continuation of application processing after failover. Application recovery following failover requires careful design. Some applications need to be aware failover has taken place.
The objective of application recovery is for the application to continue processing with only a short delay. Before continuing with new processing, the application must back out and resubmit the unit of work that it was processing during the failure.
A problem for application recovery is losing the context that is shared between the IBM® MQ MQI client and the queue manager, and stored in the queue manager. The IBM MQ MQI client restores most of the context, but there are some parts of the context that cannot be reliably restored. The following sections describe some properties of application recovery and how they affect the recovery of applications connected to a multi-instance queue manager.
From the perspective of delivering messages, failover does not change the persistent properties of IBM MQ messaging. If messages are persistent, and correctly managed within units of work, then messages are not lost during a failover.
From the perspective of transaction processing, transactions are either backed out or committed after failover.
Uncommitted transactions are rolled back. After failover, a re-connectable application receives a
MQRC_BACKED_OUT reason code to indicate that the transaction has failed. It then
needs to restart the transaction again.
Committed transactions are transactions that have reached the second phase of a two-phase commit, or single phase (message only) transactions that have begun MQCMIT.
If the queue manager is the transaction coordinator and MQCMIT has begun the second phase of its two-phase commit before the failure, the transaction successfully completes. The completion is under the control of the queue manager and continues when the queue manager is running again. In a reconnectable application, the MQCMIT call completes normally.
In a single phase commit, which involves only messages, a transaction that has started commit processing completes normally under the control of the queue manager once it is running again. In a reconnectable application, the MQCMIT completes normally.
Reconnectable clients can use single phase transactions under the control of the queue manager as the transaction coordinator. The extended transactional client does not support reconnection. If reconnection is requested when the transactional client connects, the connection succeeds, but without the ability to be reconnected. The connection behaves as if it is not reconnectable.
Application restart or resume
Failover interrupts an application. After a failure an application can restart from the beginning, or it can resume processing following the interruption. The latter is called automatic client reconnection. Automatic client reconnect is not supported by IBM MQ classes for Java™.
With an IBM MQ MQI client application, you can set a
connection option to reconnect the client automatically. The options are
MQCNO_RECONNECT or MQCNO_RECONNECT_Q_MGR. If no option is set, the
client does not try to reconnect automatically and the queue manager failure returns
MQRC_CONNECTION_BROKEN to the client. You might design the client to try and start
a new connection by issuing a new MQCONN or MQCONNX call.
Server programs have to be restarted; they cannot be automatically reconnected by the queue manager at the point they were processing when the queue manager or server failed. IBM MQ server programs are typically not restarted on the standby queue manager instance when a multi-instance queue manager instance fails.
- Package your server application as a queue manager service. It is restarted when the standby queue manager restarts.
- Write your own failover logic, triggered for example, by the failover log message written by a standby queue manager instance when it starts. The application instance then needs to call MQCONN or MQCONNX after it starts, to create a connection to the queue manager.
- A messaging application that gets or receives messages over a messaging channel does not normally require the queue manager at the other end of the channel to be running: it is unlikely to be affected if the queue manager at the other end of the channel restarts on a standby instance.
- An IBM MQ MQI client application processes persistent message input
from one queue and puts persistent message responses onto another queue as part of a single unit of
work: if it handles an
MQRC_BACKED_OUTreason code from MQPUT, MQGET or MQCMIT within sync point by restarting the unit of work, then no messages are lost. Additionally the application does not need to do any special processing to deal with a connection failure.
Losing the browse cursor is one example of how the application context changes following reconnection. Other cases are documented in Recovery of an automatically reconnected client.
- No reconnection
In this pattern, the application stops all processing on the current connection when the connection is broken. For the application to continue processing, it must establish a new connection with the queue manager. The application is entirely responsible for transferring any state information it requires to continue processing on the new connection. Existing client applications that reconnect with a queue manager after losing their connection are written in this way.
The client receives a reason code, such as
MQRC_Q_MGR_NOT_AVAILABLEfrom the next MQI call after the connection is lost. The application must discard all its IBM MQ state information, such as queue handles, and issue a new MQCONN or MQCONNX call to establish a new connection, and then reopen the IBM MQ objects it needs to process.
The default MQI behavior is for the queue manager connection handle to become unusable after a connection with the queue manager is lost. The default is equivalent to setting the MQCNO_RECONNECT_DISABLED option on MQCONNX to prevent application reconnection after failover.
- Failover tolerant
- Write the application so it is unaffected by failover. Sometimes careful error handling is sufficient to deal with failover.
- Reconnection aware
- Register an MQCBT_EVENT_HANDLER event handler with the queue manager. The event
handler is posted with
MQRC_RECONNECTINGwhen the client starts to try to reconnect to the server, and
MQRC_RECONNECTEDafter a successful reconnection. You can then run a routine to reestablish a predictable state so that the client application is able to continue processing.
Recovery of an automatically reconnected client
Failover is an unexpected event, and for an automatically reconnected client to work as designed the consequences of reconnection must be predictable.
A major element of turning an unexpected failure into a predictable and reliable recovery is the use of transactions.
In the previous section, an example, 2, was given of an IBM MQ MQI client using a local transaction to coordinate MQGET and MQPUT. The client issues an MQCMIT or MQBACK call in response to a MQRC_BACKED_OUT error and then resubmits the backed out transaction. The queue manager failure causes the transaction to be backed out, and the behavior of the client application ensures no transactions, and no messages, are lost.
Not all program state is managed as part of a transaction, and therefore the consequences of reconnection become harder to understand. You need to know how reconnection changes the state of an IBM MQ MQI client in order to design your client application to survive queue manager failover.
You might decide to design your application without any special failover code, handling reconnection errors with the same logic as other errors. Alternatively, you might choose to recognize that reconnection requires special error processing, and register an event handler with IBM MQ to run a routine to handle failover. The routine might handle the reconnection processing itself, or set a flag to indicate to the main program thread that when it resumes processing it needs to perform recovery processing.
- New, or previously undiagnosed errors, are returned from MQI calls until a consistent new
context state is restored by the application program.
An example of receiving a new error is the return code
MQRC_CONTEXT_NOT_AVAILABLEwhen trying to pass context after saving context before the reconnection. The context cannot be restored after reconnection because the security context is not passed to an unauthorized client program. To do so would let a malicious application program obtain the security context.
Typically, applications handle common and predictable errors in a carefully designed way, and relegate uncommon errors to a generic error handler. The error handler might disconnect from IBM MQ and reconnect again, or even stop the program altogether. To improve continuity, you might need to deal with some errors in a different way.
- Non-persistent messages might be lost.
- Transactions are rolled back.
- MQGET or MQPUT calls used outside a sync point might be interrupted with the possible loss of a message.
- Timing induced errors, due to a prolonged wait in an MQI call.
- Non-persistent messages are discarded, unless put to a queue with the NPMCLASS(HIGH) option, and the queue manager failure did not interrupt the option of storing non-persistent messages on shutdown.
- A non-durable subscription is lost when a connection is broken. On reconnection, it is re-established. Consider using a durable subscription.
- The get-wait interval is recomputed; if its limit is exceeded it returns
MQRC_NO_MSG_AVAILABLE. Similarly, subscription expiry is recomputed to give the same overall expiry time.
- The position of the browse cursor in a queue is lost; it is typically reestablished before the
- MQGET calls that specify MQGMO_BROWSE_MSG_UNDER_CURSOR or
MQGMO_MSG_UNDER_CURSOR, fail with reason code
- Messages locked for browsing are unlocked.
- Browse marked messages with handle scope are unmarked and can be browsed again.
- Cooperatively browse marked messages are unmarked in most cases.
- MQGET calls that specify MQGMO_BROWSE_MSG_UNDER_CURSOR or MQGMO_MSG_UNDER_CURSOR, fail with reason code
- Security context is lost. Attempts to use saved message context, such as putting a message with
MQPMO_PASS_ALL_CONTEXT fail with
- Message tokens are lost. MQGET using a message token returns the reason code
MQRC_NO_MSG_AVAILABLE.Note: MsgId and CorrelId, as they are part of the message, are preserved with the message during failover, and so MQGET using
CorrelIdwork as expected.
- Messages put on a queue under sync point in an uncommitted transaction are no longer available.
- Processing messages in a logical order, or in a message group, results in a return code of
- An MQI call might return
MQRC_RECONNECT_FAILEDrather than the more general
MQRC_CONNECTION_BROKENthat clients typically receive today.
- Reconnection during an MQPUT call outside sync point returns
MQRC_CALL_INTERRUPTEDif the IBM MQ MQI client does not know if the message was delivered to the queue manager successfully. Reconnection during MQCMIT behaves similarly.
MQRC_CALL_INTERRUPTEDis returned - after a successful reconnect - if the IBM MQ MQI client has received no response from the queue manager to indicate the success or failure of
- the delivery of a persistent message using an MQPUT call outside sync point.
- the delivery of a persistent message or a message with default persistence using an MQPUT1 call outside sync point.
- the commit of a transaction using an MQCMIT call. The response is only ever returned after a successful reconnect.
- Channels are restarted as new instances (they might also be different channels), and so no channel exit state is retained.
- Temporary dynamic queues are restored as part of the process of recovering reconnectable clients
that had temporary dynamic queues open. No messages on a temporary dynamic queue are restored, but
applications that had the queue open, or had remembered the name of the queue, are able to continue
There is the possibility that if the queue is being used by an application other than the one that created it, that it might not be restored quickly enough to be present when it is next referenced. For example, if a client creates a temporary dynamic queue as a reply-to queue, and a reply message is to be placed on the queue by a channel, the queue might not be recovered in time. In this case, the channel would typically place the reply-to message on the dead letter queue.
If a reconnectable client application opens a temporary dynamic queue by name (because another application has already created it), then when reconnection occurs, the IBM MQ MQI client is unable to re-create the temporary dynamic queue because it does not have the model to create it from. In the MQI, only one application can open the temporary dynamic queue by model. Other applications that wish to use the temporary dynamic queue must use MQPUT1, or server bindings, or be able to try the reconnection again if it fails.
Only non-persistent messages might be put to a temporary dynamic queue, and these messages are lost during failover; this loss is true for messages being put to a temporary dynamic queue using MQPUT1 during reconnection. If failover occurs during the MQPUT1, the message might not be put, although the MQPUT1 succeeds. One workaround to this problem is to use permanent dynamic queues. Any server bindings application can open the temporary dynamic queue by name because it is not reconnectable.