Bidirectional replication versus peer-to-peer replication for high-availability scenarios

If you are configuring two servers for a high-availability scenario, then you might want to choose either bidirectional or peer-to-peer replication, depending on your business needs. Both types can be used in such a scenario.

You should consider the following trade-offs:

How quickly the secondary server can take over in case the primary server fails, or how quickly you can switch back to the primary server
The need for manual or automated procedures to control the process of taking over
The overhead required by each method

Bidirectional replication for high-availability scenarios

You might want to choose bidirectional replication for your high availability scenario if you want the change with the most recent timestamp to win when a conflict occurs.

Recommendation: If you choose bidirectional replication for high availability, set up your configuration in the following way:

Set up a primary server that is available for read and write applications.
Set up a secondary server that is available only for read applications.
Set the conflict rule to force for the primary server so that the primary server is the designated loser. Any conflicts coming from the secondary server are forced onto the primary server.
Set the conflict rule to ignore for the secondary server so that the secondary server is the designated winner. Any conflicts coming from the primary server are ignored by the secondary server.

With the configuration set up this way, the primary server is updated by applications and replicates those changes to the secondary server, which is not typically being updated by applications. In the event that the primary server becomes unavailable, redirect applications to update the secondary server. When the primary server is available again, the more recent changes from the secondary server are replicated back to the primary server, and those changes from the secondary server overwrite the primary server's older changes.

While the primary server is unavailable, some data might be stuck on the failed server because it had not yet been replicated before the server failed. This data will be applied to the secondary server later, when the primary server is available again. Other replicated changes might be in the receive queue at the secondary server but not yet applied to the copies of the tables at the secondary server. These changes might conflict with new changes that are made when applications are redirected to the secondary server. If there are collisions between the queued data and the more recent changes on the secondary server, then the new redirected application activity wins, because the secondary server is set to ignore conflicts from the primary server.

It is possible to implement a procedural takeover that requires the receive queue on the secondary server to be emptied before traffic is redirected to the secondary server. Therefore, you eliminate the potential for data collisions during failover. When you switch back to the primary server, the old data that was not yet captured from the primary server is then captured and applied to the secondary server, and any collisions lose because the most recent changes are on the secondary server, where applications were redirected. The data that was captured at the secondary server is then applied to the primary server, and again this newer data will win any collisions.

Before you redirect applications back to the primary server, you must take steps to avoid the case where new changes from applications at the primary server start losing to older transactions that occurred at the secondary server. Quiesce the database activity at the secondary server, and ensure that all data changes are applied at the primary server.

Peer-to-peer replication for high-availability scenarios

If you choose peer-to-peer replication for high availability, then all servers are available for read and write at any time. (More than two servers can be configured for peer-to-peer replication.) Therefore, if one server becomes unavailable, then the other servers are immediately available to take over or switch back. This configuration provides for the most robust conflict detection. However, you must evaluate this ease of use, lack of outage time, and robust conflict detection against the additional overhead that the system incurs by the extra versioning columns on each copy of the replicated table and the triggers that are required to maintain those versioning columns.

Considerations for conflict detection in high-availability scenarios

Even the highest level of value-based conflict detection (checking all columns) in bidirectional replication is not as robust as version-based conflict detection in peer-to-peer replication. Some situations might result in a conflict not being detected.

Recommendation: If you expect conflicts by application design, then choose a peer-to-peer configuration.

When application developers design an application that involves distributed copies of tables that can be updated at any server, the developers must fully explore the potential for conflicts, the impact of conflicts, and how conflicts will be resolved. Because conflicts can result in loss of durability of a transaction, applications should be designed to minimize the potential for conflicts.

When the Q Apply program detects a conflict for a given row, it acts only on that row. In either bidirectional or peer-to-peer replication, the conflicting row is either accepted as a change to the target, or it is ignored and not applied. Because the goal of peer-to-peer replication is to provide a convergent set of copies of a database table, the conflicting row is acted upon, not the entire transaction. The practice of rejecting or accepting whole transactions is likely to quickly lead to a set of database copies that do not converge.

The Q Apply program reports all conflicting rows in the IBMQREP_EXCEPTIONS table. In a peer-to-peer configuration, which might include several servers, Q Replication attempts to report the conflicting row only once. But in some cases a conflict might show up in the IBMQREP_EXCEPTIONS table on more than one server. These duplications are easy to see because the data values and versioning information are identical. To see the complete conflict activity for a peer-to-peer or bidirectional configuration, look at the IBMQREP_EXCEPTIONS tables of all servers.