Q Replication recovery

Advice for Q Replication system management

This article discusses best practices in the area of Q Replication recovery management. By examining several recovery scenarios and methods, you'll learn about key recovery techniques and principles that can be employed in similar cases or in more general scenarios. Additionally, this article provides recommendations to help the user with planning, monitoring, and the collection of service diagnostics.

Beth Hamel (hameleb@us.ibm.com), Product Manager for Dynamic Warehouse on z/OS, IBM

Author Photo: Beth HamelBeth Hamel is an IBM senior technical staff member and is currently the worldwide product manager for Dynamic Warehouse on z/OS. Beth has held jobs in IMS, DB2, database replication, and performance. She spent 18 years working in the area of database replication, and has consulted with hundreds of customers worldwide.



27 September 2007

Introduction

Companies have a wide variety of needs for database replication. Some traditional uses and future trends include business continuity or disaster recovery, the incremental loading of data warehouses, workload balancing, distribution or consolidation of data, and application integration. Q Replication is becoming a popular database replication product that can handle all of these customer needs.

Topics covered in this article include:

  • Q Replication basics and operations
  • Q Replication planning and monitoring advice
  • The selective filtering of data or transactions
  • Restarting Q Capture from a given point in the log
  • Recovery from queue full conditions
  • Recovery from a catastrophic loss of queued data
  • Restarting a failed subscription
  • Recommendations for first point of failure diagnostic capture

Reference sites are cited throughout this article where you can find more detailed information on particular topics of interest. For a list of all reference information, refer to the Resources section. For a complete list of Q Replication reference materials, refer to the WebSphere Replication Server (Q replication) page.


An introduction to Q Replication and related terminology

Q Replication uses message queues to transport transactional data changes between source and target databases, as shown in Figure 1. The queuing system employed is IBM's® flagship messaging product, WebSphere® MQ. WebSphere MQ is packaged with WebSphere Replication Server V9 on z/OS® and all distributed platforms.

Figure 1. Q Replication
Q Replication

Q Replication has two primary components, the Q Capture program and the Q Apply program. Q Capture reads data from the database recovery log, and creates messages consisting of selected data changes from committed transactions. Those messages are then written to one or more WebSphere MQ (hereafter referred to as MQ) queues. This means that this process is asynchronous from the source database, and that messages are sent to MQ only after the database changes have been committed to the source database. Each logical message represents a complete and committed database transaction. Q Apply receives the transaction messages from the message queue and applies each message as a transactional unit to the target system.

MQ messaging facilities handle the transmission of messages across heterogeneous systems and network protocols. Under the covers, MQ uses log files and data files, similar to the way a database management system might, to deal with managing integrity and persistence of the messages.

Q Replication allows multi directional replication, that is, multiple copies of data with changes occurring at more than one site and database changes flowing in more than one direction. This also means that conflicts can occur, such as when more than one site updates the same row at the same time. Q Replication provides multiple options for the detection and resolution of those conflicts.

Q Subscriptions define the relationships between the source and target tables, and can be defined with a mapping type of unidirectional, bidirectional, or peer-to-peer. Data changes replicated in either direction between two servers can be defined as bidirectional or peer-to-peer. Bidirectional mappings provide a simple, low-cost method of conflict detection and resolution, by evaluating and comparing data values at the source and target databases.

Peer-to-peer mappings provide more robust conflict detection and resolution, based on version information added to the source application data. The version information is implemented under the covers of Q Replication using triggers in the underlying database engine. This method enables the convergence of two or more databases that can be updated in parallel.

For a more complete introduction to Q Replication, see chapter 1 of "WebSphere Information Integrator Q Replication: Fast Track Implementation Scenarios."


Q Replication operations

The Q Capture program operates independently and asynchronously from the Q Apply program. Any informational exchange between these two programs is performed using MQ messages. The MQ queues to be used in replication are created during product configuration. In addition, a set of control tables is created for each replication program.

The MQ queues

The Q Capture program works with three types of MQ queues:

  • Restart queue
  • Admin queue
  • Send queue

The restart queue is a local queue that contains one message after the Q Capture program has been started successfully at least once. This message contains the information needed to allow the Q Capture program to know at which point to restart within the DB2 logs, and information about committed transactions that have already been successfully processed. The admin queue is a local queue used by the Q Capture program to receive any necessary informational messages from the Q Apply program, such as LOADDONE or CAPSTOP messages. One or more send queues are used to store the transactional data messages that are sent to the Q Apply program for processing. Assuming the typical configuration, where the Q Capture program and Q Apply program are remote from one another, each send queue is the remote definition of a Q Apply receive queue.

The Q Apply program also works with three types of MQ queues:

  • Admin queue
  • Receive queue
  • Spill queue

Assuming the typical configuration, where the Q Capture program and Q Apply program are remote from one another, the admin queue is the remote definition of the Q Capture admin queue. The admin queue is typically very small because very few messages are exchanged from the Q Apply program to the Q Capture program. One or more local receive queues are processed by the Q Apply to receive and apply the transactional data messages sent by the Q Capture program. Spill queues are created dynamically as local queues when necessary, to store changes that are being received at the same time that data is being loaded into target tables. This might occur when target data is initially populated, or during some kind of resynchronization activity (this could be from automatic or manual load operations).

Replication operational control tables

The operations of the Q Capture program are influenced by the parameter settings in the table capture_schema.IBMQREP_CAPPARMS, some of which can be overridden at startup and some of which can be changed during execution. There are also settings at the individual subscription level (capture_schema.IBMQREP_SUBS) and at the individual send queue level (capture_schema.IBMQREP_SENDQUEUES) that can influence the behavior of the program.

Similarly, the operations of the Q Apply program are influenced by the parameter settings in the table apply_schema.IBMQREP_APPLYPARMS, some of which can be overridden at startup and some of which can be changed during execution. There are also settings at the individual subscription level (apply_schema.IBMQREP_TARGETS) and at the individual receive queue level (apply_schema.IBMQREP_RECVQUEUES) that can influence the behavior of the program. The Q Apply records the MQ messages that have been successfully applied in the apply_schema.IBMQREP_DONEMSG table, making this table important to the restart of the Q Apply program and to the clean up of messages from the MQ receive queues.

For a more complete introduction to the control tables and queues used in Q Replication, refer to chapter 1 of the "WebSphere Information Integrator Q Replication: Fast Track Implementation Scenarios."


Q Replication planning

Product versions and compatibility

At the time of publication, there are two current versions of Q Replication: Version 8 and Version 9. On the distributed platform (Linux®, UNIX®, and Windows®), the Q Replication code is installed as part of a DB2 installation, and the typical installation is configured such that the version of Q Replication is the same as the version of DB2 with which it is being used. On the z/OS platform, the Q Replication is a completely separate installation from the DB2 product. Therefore, the two versions may be the same or may be different. DB2 for z/OS Version 7 and DB2 for z/OS Version 8 are both fully supported by WebSphere Replication Server for z/OS Version 8 and Version 9. For all platforms, the Version 8 Q Capture program can coexist with the Version 9 Q Apply program, and the Version 9 Q Capture program can coexist with the Version 8 Q Apply program. This allows source and target servers to be migrated independently.

WebSphere MQ Version 6 is bundled with WebSphere Replication Server Version 9 on all platforms. In WebSphere Replication Server Version 8, WebSphere MQ 5.3 was bundled only on the distributed platform. WebSphere MQ Version 5.3 and WebSphere MQ Version 6 are both compatible with WebSphere Replication Server Version 8 and Version 9 on all platforms. In general, it is recommended that the products be upgraded to the latest maintenance levels available.

MQ recommendations

The typical recommendation from MQ is that a dead letter queue (DLQ) be created for every MQ Queue Manager. However, a different recommendation is made on this subject.

When an MQ Queue Manager is being used strictly for replication purposes, a DLQ is not necessary and it is recommended that there be no DLQ defined.

The MQ DLQ is a repository of messages that could not be successfully delivered to their appropriate destination queue. Reasons for unsuccessful delivery could include an incorrectly coded message destination or a full destination queue. In many MQ applications, it is desirable to continue processing whatever messages arrive at a destination, regardless of their origin or ordering.

Within the replication Q Apply application, messages must be processed in a strict order, as created at the source server by the Q Capture program. A message sequence number is generated to manage this strict processing order. If messages were to be delivered to a DLQ, they definitely need to be moved back to their destination queue or the Q Apply is not able to continue processing once it has reached this part of the sequence of messages. On the other hand, if messages cannot be delivered from a Q Capture program, the Q Capture program can just stop, providing a very clear indication of a problem, and then later be restarted when this problem has been resolved. Below, view the procedure of allowing the Q Capture program to be stopped at delivery failure and then later be restarted, to be a relatively simple recovery procedure. An explanation of the more elaborate alternate steps that might need to be taken in a queue full condition is provided in the Recovery Scenario: Queue full condition section.

On the z/OS platform, it is strongly recommended that an index on MSGID be created for each receive queue. If you are using shared queues on a z/OS platform, an index on CORRELID is recommended on the restart queue.

There are excellent examples of MQ definitions for Q Replication on z/OS documented in the IBM Redbook "WebSphere Replication Server for z/OS Using Q Replication: High Availability Scenarios for the z/OS Platform."

See section 4.5.7 for an example of MQ definitions using shared queues and section 5.5.7 for an example of MQ definitions using a shared disk implementation.

Replication table recommendations

On the z/OS platform, it is recommended that following control tables to be in their own segmented tablespace with large primary and secondary allocation:

IBMQREP_DONEMSG LOCKSIZE ROW
IBMQREP_SPILLEDROW LOCKSIZE ROW
IBMQREP_APPLYMON LOCKSIZE PAGE 
IBMQREP_APPLYTRACE LOCKSIZE PAGE 

IBMQREP_CAPTRACE LOCKSIZE PAGE 
IBMQREP_CAPMON LOCKSIZE PAGE 
IBMQREP_CAPQMON LOCKSIZE PAGE 
IBMQREP_SIGNAL LOCKSIZE ROW 

IBMQREP_MONTRAIL LOCKSIZE PAGE 
IBMQREP_ALERTS LOCKSIZE PAGE 
IBMQREP_MONTRACE LOCKSIZE PAGE

When using DB2 z/OS V8 in new function mode (NFM), it is recommended that the new VOLATILE option be utilized. For example:

ALTER TABLE apply_schema.IBMQREP_DONEMSG VOLATILE CARDINALITY

Note that on certain procedures, triggers, and functions (PTFs) or fix pak levels of the Version 8.2 or Version 9 product that you are running, there are optional new tables capture_schema.IBMQREP_IGNTRAN and capture_schema.IBMQREP_IGNTRANTRC. For more information on these tables, refer to the "IBM DB2 Database for Linux, UNIX, and Windows Information Center."

The usage of these tables is explained in the Recovery scenario: Filtering data and selectively skipping data section of this article.

Replication parameter recommendations

As described in the previous section, there are a number of parameters stored in the replication control tables that influence the behavior of the Q Capture and Q Apply programs. This section discusses several of these parameters with recommendations on their settings.

Note: A number of the time unit values have changed in Version 9 to reflect milliseconds instead of seconds.

  • STARTMODE in capture_schema.IBMQREP_CAPPARMS
    • Indicates how the Q Capture program should be started. WARMSI or WARMNS are the recommended settings.
  • MEMORY_LIMIT in capture_schema.IBMQREP_CAPPARMS
    • Indicates how much memory can be used by the Q Capture program when storing changed data in memory. Generally the default of 32MB is an acceptable starting point. Further tuning can be performed after observing the actual memory used by a workload as reported in the CAPMON table column CURRENT_MEMORY. Also reported in the CAPMON table is TRANS_SPILLED. When this value is greater than zero, it indicates that transactions are larger than can be handled in memory, which may result in decreased performance.
    • On z/OS, it is recommended that MEMORY_LIMIT be 80 percent of the Q Capture REGION size. In z/OS, it is possible to set a MEMORY_LIMIT of 0 (the table constraint must be dropped to allow this setting in V8), in which case the Q Capture program uses a MEMORY_LIMIT of 80 percent of the available memory as dictated by the REGION parameter in the JCL. If the user specifies a non-zero value, the Q Capture program uses the lesser of the specified value and 80 percent of the available memory.
  • NUM_APPLY_AGENTS in apply_schema.IBMQREP_RECVQUEUES
    • Indicates the number of agents to be used by the Q Apply program in processing the receive queue. Throughput for most applications is best at 8-16 agents per queue. However, it is not always the case that more agents provides higher throughput and it is recommended that this value be tuned for any given application. Make sure that DB2 and MQ settings allow for the number of parallel application connections that are created based on this setting.
  • MEMORY_LIMIT in apply_schema.IBMQREP_RECVQUEUES
    • Indicates the amount of memory to be used by the Q Apply program browser to store transactions in preparation for their eventual processing by the apply agents. Note that more memory does not always improve throughput and performance. A "memory full" indication in the APPLYMON table is not necessarily an indication that more memory is needed or desirable — it may just indicate that the browser is reading transactions into memory faster than they are being applied by the agents. Like NUM_APPLY_AGENTS, this setting should be tuned for individual workloads.
  • ERROR_ACTION in apply_schema.IBMQREP_TARGETS
    • Indicates the action to be taken by the Q Apply program when encountering any unexpected error or sqlcode. When there is a single receive queue defined for a Q Apply program, then the recommendation for this value is "S," which tells the Q Apply program to stop on such an error and allows for corrective action to be taken. When there is more than one receive queue, then the choice might be "Q," which tells the Q Apply program to stop processing the receive queue associated with the received error. If the value "Q" is chosen, it is strongly recommended that monitoring be in place so that any error that has stopped queue processing will be noticed such that corrective action can be taken in a timely manner. Without proper monitoring in place, it may be less obvious when a queue is not being processed than when the Q Apply program has stopped completely.
  • CONFLICT_RULE in apply_schema.IBMQREP_TARGETS
    • Indicates what kind of conflict detection is in place. In some cases, this is set based on the type of subscription, for example peer to peer. In other cases, this is chosen during the subscription definition. For unidirectional subscriptions, the only choice available is "K," which means key conflicts is examined. For bidirectional subscriptions, the choices are "K" — check for key conflicts only, "C" — check for changed column conflicts and key conflicts, and "A" — check all columns for conflicts.
  • CONFLICT_ACTION in apply_schema.IBMQREP_TARGETS
    • Indicates the action to be taken when a data conflict is present. Again, in peer to peer this setting is fixed, and the latest timestamp rule is applied in a conflict. In unidirectional replication, there is a trade-off here between sending more data on updates, to allow source actions to be forced at the target, or sending less data on updates and choosing that the conflicts be ignored or that the Q Apply program should stop on such a conflict. In bidirectional replication, it is typical that the two subscriptions be set up so that conflicts are forced on one server (the losing server), and ignored on the other server (the winning server). Any other choice can result in data inconsistencies.
  • OKSQLSTATES in apply_schema.IBMQREP_TARGETS
    • Indicates that certain sqlstates returned when trying to process a subscription allow the Q Apply program to continue, rather than taking the otherwise declared error_action. When such a sqlstate occurs for the subscription, the Q Apply program writes the information about the failed row to the EXCEPTIONS table but otherwise continues applying the transaction as though no failure occurred. Use of this option can result in data inconsistencies since data applied at the source is not applied at the target.

Monitoring

Before putting replication into a production environment, it is critical that a good monitoring strategy be in place. Particularly in environments that rely on very low latency replication for data delivery or workload distribution, proper monitoring techniques are a cornerstone to the overall reliability strategy. WebSphere Replication Server offers many features that can be employed to monitor the replication environment.

The Q Capture and Q Apply programs track operational statistics in memory and, on a regular user-defined monitor interval, this operational data is written to the monitoring control tables capture_schema.IBMQREP_CAPMON, capture_schema.IBMQREP_CAPQMON, and apply_schema.IBMQREP_APPLYMON. The default monitor interval for both programs is five minutes. It is recommended that you set these values depending on the granularity desired in your monitoring strategy. For example, if the desired latency of the replication scenario is a small number of seconds, then it may be advantageous to set the monitor interval to a small number of seconds also, to provide a clear historical or current view of the replication performance. These monitoring control tables provide the basic information used for monitoring by the Alert Monitor, the replication dashboard, the Replication Center performance reports, and any additional queries or tools that you might create.

The replication Alert Monitor is an optional, but highly recommended, program delivered with WebSphere Replication Server. The Alert Monitor is a background task that can be configured in many different ways using the Replication Center or ASNCLP scripts. It can be run on any distributed DB2 or DB2 for z/OS server, and can monitor one or many replication Q Capture or Q Apply programs. You define how often each replication program should be monitored, and at this interval the Alert Monitor connects to the server associated with that program and performs one or more operations to check the health of that replication program. The checks to be performed are user selected and configured to best match the desired goals of the replication scenario. In general, the Alert Monitor is looking for warning messages, error messages, latency that has grown beyond a user-defined threshold value, or queues that have grown beyond a user-defined threshold limit. Alerts can be written to the z/OS system console when running the Alert Monitor on a z/OS server, or sent to a pre-defined e-mail address on a distributed server.

For more information on the replication Alert Monitor, refer to the "IBM DB2 Database for Linux, UNIX, and Windows Information Center."

The replication dashboard is a tool that is documented and available as a Web download from the Download Q Replication Dashboard.

This Java™ tool can be run on any DB2 Windows client, to provide a graphical display of the health, throughput, and latency characteristics of any Q Replication program. In addition to an instant visual health check, this tool provides some very useful drill-down capabilities, such as queries against the message, trace, and exceptions tables maintained by the Q Capture and Q Apply programs.

In the Replication Center, there are also various reports that can be requested as part of the operations support. Reports can be tailored to include various details about errors or performance characteristics, for individual replication programs, and for selected time periods. The SQL that is created to support a report can be viewed and captured, and report data can be exported to be used in spreadsheets or other formats.

Some customers set up their own monitoring queries directly against the replication control tables, to suit their own completely customized monitoring needs. It is strongly advised that all user queries performed against replication control tables be performed with UR, so that no interference occurs between replication programs and user-defined monitoring programs.

There are certainly other monitoring tools that may be used to some advantage in a replication monitoring strategy, including monitor tools from MQ, DB2, Tivoli, and many others.


Recovery scenario: Filtering data and selectively skipping data

Predictive filtering

There are many ways to filter data during replication. When defining a subscription the user can:

  • Select a subset of the source table columns.
  • Select a subset of the source table rows (through a where clause definition)
  • Choose to not replicate deletes.

It is also possible to selectively choose not to replicate data transactions based on certain properties associated with those transactions using the new ignore transaction capability.

This feature was introduced with APAR PK27949 in the z/OS Version 8.2 product. Details of this new feature can be found at PK27949: QCAPTURE AND SQL CAPTURE WILL BE ENHANCED TO IGNORE CERTAIN TRANSACTIONS. NOTE DDL CHG, SEE II11809.

Transactions that can be identified through certain transaction properties can be selectively filtered by inserting these property values in the Q Capture control table cap_schema.IBMQREP_IGNTRAN. On all platforms, you can use the AUTHID value, which is the value identifying the user associated with a transaction. On the z/OS platform, you can also use the plan name or the authorization token.

These methods of filtering are all based on a predictable pattern of data transactions. For example, in building a data warehouse, you might only need certain columns from a source table and never want to replicate deletes, and so you build your subscriptions with these options. Or perhaps your installation runs very large purge jobs once a month, and you would prefer to run these jobs against each database rather than replicate the data. In this case, you could always use a specific and unique user ID for those large purge jobs, and enter that user ID in the IGNTRAN table.

Transactions that have been filtered based on values supplied in IGNTRAN, are tracked in the new table capture_schema.IBMQREP_IGNTRANTRC.

Reactive filtering

There are also occasions when you need to ignore or skip transactional data on a reactive basis. For example, a long-running batch job was unexpected and it is not desirable to capture the data from this transaction, or a row might contain data values that are valid at a source database but are not valid at a target database Both of these cases are examined below.

CAUTION: To ignore a transaction that will ultimately be aborted causes no data integrity issues. To skip or ignore any transaction or row that was committed at the source server will typically cause divergence between source and target servers and other actions will be necessary to synchronize the database servers. Caution should be used when using any of these techniques.

Stop the capture of a problematic batch job.

Consider the long running batch job with no interim commits that eventually is aborted. Such a batch job produces potentially millions of DB2 log records, all of which might need to be examined and stored internally by the Q Capture program before the abort transaction record can be seen and processed, allowing the records for this entire transaction to be removed from memory or disk. If the records are being spilled to disk, this may be causing a slow down on the Q Capture program. If it becomes clear that such a scenario is in progress, the following actions can provide relief.

CAUTION: It is possible to skip a single transaction only using the transaction identifier (TRANSID) startup parameter. Multiple interleaving transactions cannot be skipped in this manner.

  1. Stop the Q Capture program.
  2. Fully identify the problematic transaction.

    Determine the TRANSID for the problematic transaction. The Q Capture program may in fact have issued a warning with this identifier noted. If not, log utilities such as DSN1LOGP on z/OS or the DB2 Recovery Expert for Multiplatforms tool can be used to gather this information. Using either of these tools, what is referred to here as the TRANSID is noted as the unit of recovery ID (URID), or might in some cases be noted as the unit of work ID (UOWID), typically as 12 hexadecimal characters representing a 6 byte field. In z/OS, you also need to note the data sharing member ID in the last 2 bytes of the log record header in the LOGP output. The member ID is 0000 if data sharing is not enabled.

  3. Restart the Q Capture program, providing the TRANSID as a transaction to be ignored, using the new TRANSID Q Capture startup parameter.

TRANSID parameter entry

Although the transID is only 6 bytes on the z/OS platform as well as on the Linux, UNIX, and Windows platforms, the Q Capture program stores it in memory as 10 bytes. The following are the formats in hex required for entering this TRANSID parameter:

  • z/OS: 0000:xxxx: xxxx:xxxx:mmmm

    where xxxx:xxxx:xxxx is the transID and mmmm is the data-sharing member ID (0000 if data-sharing is off)

  • Linux, UNIX, Windows: nnnn:0000:xxxx:xxxx:xxxx

    where nnnn is the node ID in a DPF environment (0000 if not DPF partitioned)

    xxxx:xxxx:xxxx is the transID

The following are example invocations of the Q Capture program using the TRANSID parameter:

asnqcap capture_server=<capture_server>  capture_schema=<capture_schema>  
transid=0000:0000:C2C5:01B5:773A
 //CAPQALLT  EXEC PGM=ASNQCAP,REGION=0M,TIME=NOLIMIT,                   
 // PARM='/CAPTURE_SERVER=SSTR CAPTURE_SCHEMA=QALLTYPE 
 //              TRANSID=0000:BD71:1E23:B089:0001'

Skip a problematic row

There are certain occasions when it can be very useful to allow the Q Apply program to continue, despite the fact that it is receiving any unexpected error condition when trying to apply data, indicating that a certain row cannot be applied. For example, in some cases a data value that is acceptable at the source database, such as hex zeros, is not acceptable at the target database, creating SQLSTATE 12345. Perhaps this data value shows up infrequently, but it is highly desirable that this occurrence NOT stop the Q Apply program. The subscription for this data mapping can be adjusted to allow for this occasional problem.

  1. Update OKSQLSTATES for the subscription(s).

    Update the subscription to allow SQLSTATE 12345 to be accepted by the Q Apply program.

  2. Restart the queue or program as necessary.

    If the Q Apply program has stopped because of this condition, then restart the program. If the Q Apply program has stopped processing the receive queue for this subscription, then issue a STARTQ command.

  3. Manually correct data as necessary.

    Details of rows that were not applied because of the SQLSTATE 12345 are written to the EXCEPTIONS table. These must be processed and manually applied to correct data inconsistencies between the source and target databases. Note that when one row is skipped because of OKSQLSTATES, this can cause ripple effects in processing. For example, if a parent row is not applied, then foreign key rules can keep child rows from being applied.

    You can download a tool to easily format data from the EXCEPTIONS table from the Download the Q replication exceptions table formatter page.

    The previously mentioned Q Replication Dashboard tool also provides this function and can be downloaded from the Download Q Replication Dashboard page.


Recovery scenario: Queue full condition

When an MQ Queue Manager is being used strictly for replication purposes, a DLQ is not necessary and it is recommended that there be no DLQ defined. This recommendation is described in the Q Replication planning section. However, it may not always be possible to follow this recommendation. In some cases, there may be multiple applications sharing an MQ Queue Manager and a DLQ may be desirable for one or more of the other applications using this server. In some environments, it may be a standard practice to create a DLQ for all MQ Queue Managers. It is also strongly recommended that the replication receive queue depth is monitored so that corrective action can be taken when a queue has reached a threshold depth. The previously mentioned Replication Alert Monitor can be used to monitor queue depth.

This recovery scenario assumes that neither of these recommendations have been followed. The steps needed to resolve a failure scenario involving a full receive queue is described below.

The failure scenario

A target tablespace becomes full, making it impossible to apply changes to a replication target table. The Q Apply error_action for the problem subscription is to "S" (stop the Q Apply process) and so the Q Apply process stops in response to this condition. The Q Capture process continues to capture data and put messages on the send queue at the source. Eventually the receive queue becomes full, and at this point messages are redirected to the DLQ at the target MQ Queue Manager. The DLQ becomes full also, at which point the messages from the source server are no longer deliverable to the target server. Finally, the xmit queue at the source server becomes full, and because the error_action on the send queue is set to "S," the Q Capture stops when hitting this queue full condition. To recap, at this point the receive queue is full, the DLQ is full, the xmit queue is full, and both Q Capture and Q Apply processes are stopped.

Steps to recovery

  1. Repair the condition that stopped the Q Apply in the first place.

    In this scenario, there was a target table that became unavailable due to a full tablespace, so this space condition would be relieved in order to make it possible for the Q Apply process to be restarted.

  2. Stop the channel for the xmit queue at the source server.

    Before any attempts are made to restart the Q Apply or recover messages from the DLQ, it might be best to first make sure that no additional messages will be delivered to the target server. The arrival of new messages while recovering the queues at the target server may aggravate the recovery process, so it is recommended that you stop new messages from being sent to the receive queue from the source server by stopping the channel or the queue manager associated with the xmit queue.

  3. Restart the Q Apply process.

    The receive queue is full. Therefore, the DLQ messages cannot be recovered to the receive queue in its current condition. Before the DLQ messages can be recovered, either the max queue depth must be increased for the receive queue or some of the messages must be processed and deleted by the Q Apply. The simplest strategy, if possible, is to allow Q Apply to process the messages.

    It is certainly possible that prior to consuming the entire full receive queue, that the Q Apply hits a point where it is looking for a message that is in the DLQ and outputs a message indicating that it has encountered a message gap and cannot continue until this message has been recovered. In this case, the DLQ messages need to be recovered prior to the completion of this step.

  4. Recover the DLQ messages.

    If the Q Apply was able to process the entire receive queue, then there certainly should be room in the receive queue for the recovered DLQ messages. If Step 3 was interrupted because there is a gap in the sequencing of the messages, then assess whether there is sufficient space for the messages in the receive queue to handle the recovered messages from the DLQ. If necessary, increase the max queue depth for the receive queue in order to allow the full recovery of the messages in the DLQ.

    Running the DLQ handler, with a simple rule to retry the delivery, should suffice:

    DLQ Handler Command
    DLQ handler command on the distributed platform:
    runmqdlq  QueueName QueueManagerName
    
    rules table entry: 
    REASON(MQRC_Q_FULL) ACTION(RETRY) RETRY(5)
  5. Make sure all queue messages have been recovered and applied at the target.

    Verify that the DLQ handler successfully retried the delivery of the messages to the receive queue. Verify that the Q Apply has successfully applied the messages and no longer has any indications of a gap in the message sequence.

  6. Restart the xmit queue at the source.

    At this point, there is space available at the target and the sending of messages can be resumed.

  7. Restart the Q Capture process.

Replication recovery is now complete.


Recovery scenario: Catastrophic loss of queued data

WebSphere MQ is a highly reliable mechanism for data delivery and Q Replication uses only persistent messages to deliver data. Persistent messages are logged and recoverable, in the same way that DB2 data is logged and recoverable. The catastrophic loss of a queue or of queued data is therefore a completely unexpected event in Q Replication. This scenario is included mainly to illustrate a few basic recovery principles and to highlight procedures that might be useful in a variety of conditions. One potential application of this procedure might be the resumption of replication between an existing Q Apply and a new Q Capture that is starting up on a new source site after an HADR or hardware failover of DB2 to that site.

The failure scenario

This is a unidirectional scenario from one source database to one target database, using one receive queue at the target. In this scenario, it is assumed that an unrecoverable disk crash has rendered the receive queue unusable. The synchronization between the source and target databases is at this point unknown, and it is desirable to resynchronize at the lowest possible cost and with the least amount of elapsed time. The guiding principle for this recovery is that even if queued, captured data has been lost, the DB2 recovery logs still contain the original changed data and this data can be recaptured and reapplied.

Steps to recovery

Clearly one possible option is to perform a full refresh of the target database, loading it from the data in the source database. This can be a long and costly process. Another possible option is to run the table differencing and repair utilities for all tables — asntdiff and asntrep. This may also be somewhat laborious and time consuming. What follows is an alternative that may be quite simple, effective, and fast.

  1. Completely redefine or completely clear out the receive queue that was destroyed.

    If the receive queue needs to be recreated, recreate it using the MQSC define command. If the queue is still defined but in a questionable state, clear all messages. The following steps assume that this receive queue is now empty of messages but in an operable state.

  2. Determine a valid restart point for Q Capture.

    Using the APPLYMON table, find the most recent OLDEST_TRANS value. This value represents a source timestamp up to which all transactions were successfully applied. Other newer transactions may well have been applied in addition, but the guarantee here is that all transactions from the source database that have a commit timestamp earlier than this timestamp have definitely been applied. This is an excellent starting point.

    Using log utilities on the source side, find the transaction that provides the best starting point for Q Capture. (Log utilities to be used include DSN1LOGP on z/OS or the DB2 Recovery Expert for Multiplatforms tool or db2pdlog on Linux, UNIX, and Windows) The transaction associated with OLDEST_TRANS is a precise set of log records on the DB2 source recovery log and it is known that all transactions that committed prior to, and including this point, were applied, but there may be some transactions that started earlier than this known point in the log but will commit at a later point in the log. A full analysis against the log data may reveal exactly which transactions might have started earlier than this and the earliest start point of such transactions would provide the necessary restart LSN for Q Capture.

    A reasonable and sufficient alternative would be to review the CAPMON table entries. Find the CAPMON entry where the CURRENT_LOG_TIME is equal to or slightly older than the identified APPLYMON OLDEST_TRANS value. In this row, what is the value of RESTART_SEQ? This indicates the point from which the Q Capture needed to restart when it was currently processing the data that had been known to be committed by the Q Apply program. This is the earliest uncommitted LSN from which Q Capture needs to start in order to recapture any uncommitted data that might not have been captured and/or applied prior to the failure. You will use this value in a later step to restart the Q Capture program.

  3. Alter subscriptions to allow data conflicts to be forced.

    It is likely that some data will be recaptured and resent to the Q Apply program. Setting the subscriptions to allow CONFLICT_ACTION = "F" allows any data that gets recaptured and resent to be re-applied.

  4. Restart the Q Capture at the determined restart point.

    The Q Capture program normally restarts based on the values that are found in the restart queue. The restart queue tracks the restart point within the log, and the maximum committed transaction that has been captured and sent. These values can be overridden with user-supplied alternates, specifically for recovery purposes.

    In this case, supply the restart point that was determined in Step 2, and provide a maximum committed transaction value of all zeros to indicate that all data transactions from the restart point should be captured and transmitted to the Q Apply program.

    The following is an example invocation of the Q Capture program using LSN and MAXCMTSEQ, to override the restart queue:

    asnqcap capture_server=<capture_server>  capture_schema=<capture_schema>  
    LSN=0000:0000:0000:115b:7704 MAXCMTSEQ=0000:0000:0000:0000:0000

    These restart parameters are more completely described in Chapter 14 of the "Replication Guide and Reference," or refer to the "IBM DB2 Database for Linux, UNIX, and Windows Information Center."

    Note: On earlier release levels of replication (prior to V9 fix pak 1, on the distributed platform), the Q Capture program does not start using the "LSN= MAXCMTSEQ=" override parameters, if there is no restart message in the restart queue. This case requires that you start Q Capture COLD once, stop it, and then WARM start again with the chosen restart point.

  5. Reset the DONEMSG table.

    During restart, the Q Apply program first deletes any messages from the receive queue that are known to have been processed and then it starts reading the receive queue from the next message in a known sequence, as provided by the Q Capture program and as remembered through the DONEMSG table. Since you are altering the message flow, and sequencing from the Q Capture program, you need to realign the DONEMSG table to match this new sequence of messages that the Q Capture has started.

    Delete all the existing rows from the DONEMSG table for this receive queue. Use the asnqmfmt utility to retrieve the first message that is now sitting in the receive queue ready to be processed, and note the MQMSGID value from that message. Insert a single row in the DONEMSG table, with the receive queue name and the MQMSGID value of 1.

    The following is an example of an asnqmfmt invocation, to retrieve the first message and the message descriptor:

    asnqmfmt ASN.QM1.DATAQ QM1 -o output.file -mqmd -l 1

    The following is an example message output:

    **** Descriptor for message number: 1
        StrucId         MD   
        MsgId           52dd5045c255110000000000000000000000000000000456
        CorrelId        000000000000000000000000000000000000000000000000
        Version         1 
        Report          0 
        MsgType         8 
        Expiry          -1 
        Feedback        0 
        Encoding        546 
        CodedCharSetId  1252 
        Format          COMPACT  
        Priority        0

    The following is an example insert to the DONEMSG table with a message ID value of one less than the message ID of the first message on the queue.

    INSERT INTO apply_schema.IBMQREP_DONEMSG VALUES ('ASN.QM1.DATAQ',
    x'52dd5045c255110000000000000000000000000000000455')
  6. Restart the Q Apply program.

    Verify that the Q Apply program restarts successfully and begins to process the messages. Expect rows to be sent to the EXCEPTION table for the rows that are forced. (Inserts that receive a -803 are converted to updates, updates that receive +100 are converted to inserts, and deletes that receive +100 are left alone.)

    Replication recovery is now complete.

    NOTE: The target database is returned briefly to a back level state during the process of reapplying the data, from the point where the Q Apply program is restarted, until the data is past the point where the previous Q Apply stopped. It might be reasonable to quiesce user activity against the target database during the initial phase of the recovery to avoid anomalous query results.


Restart subscriptions

When an error occurs during initialization of a replication subscription, or the set of individual subscriptions that comprise a bidirectional or peer-to-peer mapping, the "user response" associated with the error message might suggest the complete redefinition of that subscription. There may be an easier way to reset the subscription(s) and restart.

For example, an MQ channel might not have been started, or the MQ configuration was defined incorrectly. So in this case, the subscription definition might be fine, but what is required is to correct the MQ configuration and restart. The following section is a tip on resetting the subscription for that restart.

Reset a subscription for restart

  1. Address the problematic issue.

    This might involve upgrading software, making small changes to source or target tables, or associated database elements.

  2. Reset the status markers within the control tables.

    At each server, execute these SQL statements:

    UPDATE capture_schema.IBMQREP_SUBS SET STATE='I', STATE_TRANSITION=NULL,
    GROUP_MEMBERS=NULL WHERE SUBNAME='<subname>';
    
    UPDATE apply_schema.IBMQREP_TARGETS SET STATE='I' WHERE SUBNAME='<subname>';
  3. Reactivate subscriptions.

    This could be using CAPSTART signal(s), or at one of the subscription servers the STATE can be set to "N" (new) with a subsequent Q Capture REINIT command to allow the Q Capture to recognize the subscription in the "N" state.


Collect diagnostic data at the first point of failure

When an unexpected error occurs in a test or production environment, it is strongly advised that a reasonable amount of diagnostic data be collected for purposes of identifying the cause of that failure. In some scenarios, replication is a critical element in keeping a business fully operational. Therefore, the primary objective in handling a failure or error might be to restore and recover the system promptly. However, if no diagnostics were gathered prior to that restoration, there may be no way to identify the root cause of the failure, with the resulting concern that there could be an undiagnosed problem that remains in the system. This section provides some guidance on what diagnostic data may be useful to collect.

In the case of a suspected replication issue, gather the following:

  • Q Capture log file output
  • Q Apply log file output
  • When appropriate to the failure, restart the Q Capture or Q Apply program with the "debug" parameter or use the ASNTRC facility to record and dump program trace output.
  • Keep DB2 recovery logs from around the time of the unexpected condition.
  • When appropriate to the failure, use asnqmfmt to print a reasonable set of messages from the Q Apply receive queue.
  • Output from the db2support command.
  • Output from the Q Replication Analyzer tool. Details of this tool are available from the DB2 Information Center.

In the case of a suspected MQ issue, gather the following:

  • The MQ error log
  • Print a reasonable set of messages from the queue, for which there might be an issue.

In the case of an unexpected DB2 return code or other suspected DB2 issue, gather the following:

  • For Linux, UNIX, and Windows, the DB2 diag log
  • Other DB2 service-related items, as recommended for the given return code or DB2 issue

Contributors

Major contributions to this document were provided by Jayanti Mahapatra, Kevin Lau, and Somil Kulkarni. Additional contributors include Carol Rigdon and Andrew Hickson.


Conclusion

This article discussed best practices for recovery management when using the Q Replication feature of Websphere Replication Server. When the recommended planning and monitoring techniques are in place, most failures can be avoided, and it is unlikely that recovery management is necessary, but it is always best to be prepared and to have plans in place for operational recovery.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=258285
ArticleTitle=Q Replication recovery
publish-date=09272007