Note: Please read the disclaimer before you read the article..
With the release of WebSphere Information Integrator 8.2, IBM has delivered a major advancement in database replication through a function called queue replication, also known as Q replication. This new replication architecture has emerged from the need for high performance: high transaction volume combined with low latency, which is the elapsed time from committing database changes on the source database until they are committed on the target database.
This article describes the new Q replication architecture, discusses what levels of performance to expect and what factors influence that performance, and includes examples drawn from laboratory measurements on both AIX and z/OS platforms. This information will be useful to customers interested in using Q Replication, and curious about its effects on performance.
Why a new architecture?
Data replication technology is not new for IBM or DB2®, having been first introduced a decade ago with a product named DataPropagator™ Relational (DPropR). Since then the names and packages have changed, the product breadth and functionality have expanded, and the resulting set of replication products has experienced substantial commercial success.
The existing replication architecture, known as SQL replication, continues to hold a place alongside Q replication, and will likely continue to be the optimal solution for a variety of customer scenarios.
With SQL replication, database changes are captured and stored temporarily in relational tables called staging tables. Those staging tables are then read from a client interface on the target system and applied to the target tables. SQL Replication is packaged with DB2 for Linux, UNIX®, and Windows®, available as DB2 Data Propagator for z/OS, and is part of all WebSphere Information Integrator offerings that include replication. For some scenarios, such as those involving a non-DB2 database as source or target, SQL Replication may still be the best choice at this point in time.
Despite its success, SQL Replication has faced some technical challenges, particularly as customer systems have grown and expanded their performance requirements. Q Replication has been designed to address these performance needs, along with increased function and improved manageability.
Systems continue to grow in size and stretch throughput limits. At the same time, customers want more real-time access to the replicated changes. These real-time changes often need to flow in more than one direction, creating the need for synchronizing two or more systems, and the need for resolving conflicts when these systems concurrently change the same data.
Customers have a wide variety of needs for database replication, a process that has often required careful planning and scheduling. Some traditional uses and future trends include:
- Maintaining data warehouses. This is one of the most common uses of replication today. Warehouses are isolated from operational data, making them suitable for ad hoc queries without interfering with performance of production applications.
- Business continuity and disaster recovery. This may be the fastest growing requirement for database replication. Today the industry provides a variety of hardware and software implementations, along with synchronous and asynchronous replication techniques, to address the problems of business continuity. Q replication operates asynchronously, meaning that changes are propagated after the source database operations are committed. Customers often choose an asynchronous approach when long distances separate primary and backup databases. Distance increases transmission latency, making synchronous approaches impractical for high performance applications. However, asynchronous replication implies that the application can tolerate some transactions lost in transit in the event of a failure. Smaller delay (lower latency) means fewer lost transactions. One unique advantage of Q Replication over other commonly used hardware and software solutions is that the backup database can be continuously active, minimizing the recovery time.
- Distribution of data for workload balancing. This usage makes sense as long as the added cost of replication is small relative to the overall amount of processing, as is the case for applications that have low rates of modified data relative to retrieval or application processing.
- Geographic distribution or consolidation of data. Performance can be enhanced by keeping appropriate applications “close" to end-users, or co-located with other relevant data.
- Enabling applications that use multiple databases. Many Web-based applications are comprised of more than a single database. A customer making an online trade may work with one database when making a trade, but reference a second database to verify completion of the process. The trading database may be designed for a legacy application and organized for online transaction processing (OLTP) performance, whereas the trade history database may be organized for query performance, and reside on a different platform. It is therefore critical to the seamless operation of some applications to be able to move changed data almost immediately from one database to another.
How does Q replication work?
As you can see in Figure 1 below, Q replication uses message queues to transport transactional data changes between source and target databases. The queuing system employed is IBM’s flagship messaging product, WebSphere MQ. WebSphere MQ is packaged with WebSphere Information Integrator editions for Linux, UNIX, and Windows platforms and is a prerequisite product for the z/OS environment.
Figure 1. Q replication
Q replication has two primary components, Q Capture and Q Apply. Q Capture reads data from the database recovery log, and creates messages consisting of selected data changes from committed transactions. Those messages are then written to one or more MQ queues. This means that this process is asynchronous from the source database, and that messages are sent to MQ only after the database changes have been committed to the source database. Each logical message represents a complete and committed database transaction. Q Apply receives the transaction messages from the message queue and applies each message as a transactional unit to the target system.
WebSphere MQ messaging facilities handle the problems of transmitting messages across heterogeneous systems and network protocols. Under the covers, MQ uses log files and data files, similar to the way a database management system might, to deal with managing integrity and persistence of the messages. These files essentially replace the staging tables and associated database logging functions handled by DB2 as part of SQL replication.
As described earlier, Q replication allows multiple copies of data with changes occurring at more than one site and database changes flowing in more than one direction. This also means that conflicts can occur, for example when more than one site updates the same row. Q replication provides a mechanism for resolving those conflicts.
Q subscriptions define the relationships between the source and target tables, and can be defined with a mapping type of unidirectional, bidirectional, or peer-to-peer. Data changes replicated in either direction between two servers can be defined as bidirectional or peer-to-peer. Bidirectional mappings provide a simple, low-cost method of conflict detection and resolution, by evaluating data values at the source and target databases.
Peer-to-peer mappings provide more robust conflict detection and resolution based on version information added to the source application data. The version information is implemented under the covers of Q replication using triggers in the underlying database engine. While more costly from a performance perspective, this method enables the convergence of two or more databases that can be updated in parallel. The fifth section of this article compares the performance of these alternatives.
How does performance of Q replication compare to SQL replication?
Figures 2 and 3 below illustrate the benefits of Q replication compared to SQL replication for both AIX and z/OS environments. Both sets of measurements for equivalent workloads and configurations show almost three times the throughput for Q replication compared to SQL replication, while maintaining much lower latencies at the high throughput points. In both of these configurations, Q replication is able to replicate more than 12,000 rows per second, with an end-to-end latency of less than 2 seconds, and consistently less than 1 second at lower rates. The workload used in these measurements consisted of INSERTs only, simulating a moderately complex transaction with 10 INSERTs per transaction, and 212-byte rows. This very simple workload is useful in that it focuses on the job that replication performs and minimizes hardware resources used in the test. For example, we don’t need to allocate hardware resources for SELECTs or application processing, as a real application might. Further information about the workloads and configurations used in this paper can be found in the "Workloads and configurations" section.
Also, note that the latency improvements compared to SQL replication may be even better than those shown in these charts. Q replication provides very nice performance monitoring capabilities, including actual measurements of end-to-end latency, meaning the time from committing a transaction on the source until it is committed on the target. In the case of SQL replication, these monitoring capabilities do not exist, and so the benchmark application employed a technique to measure the average row latency. Since there are multiple rows per transaction, the transaction latency of SQL replication could actually be higher than shown in the chart.
For both the z/OS and AIX measurements, the Q replication tests showed replication throughput limits of approximately 12,000 rows per second, at which point various resources become saturated, and cause latencies to increase. Perhaps more significant is the fact that the latencies remain very stable until those saturation points are reached, and it is our experience that most customer requirements will likely fall well below these limits. For peace of mind, it would be advisable to estimate customer throughput requirements in terms of replicated rows per second, and ensure that peak periods remain below these limits. In addition, the limits are not absolute and are subject to workload and configuration variables discussed later.
As you can see in Figure 2, Q replication has higher throughput and lower latency in z/OS....
Figure 2. Performance of Q replication vs. SQL replication - z/OS
...and likewise on AIX.
Figure 3. Performance of Q replication vs. SQL replication - AIX
How does Q replication improve performance?
There are a number of factors that help contribute to improved performance with Q replication, which include:
- Reduced DB2 logging activity associated with staging tables
- Reduced processor usage
- Increased parallelism in Q Apply processing
Reduced DB2 log activity With SQL replication, database changes are captured in DB2 staging tables on the source database. This requires that additional log records be written to the DB2 log. (If all database changes were captured, this would result in additional logging for inserts and deletes of staging table rows, potentially tripling the write activity to the DB2 log.) With Q replication, the staging tables are replaced with MQ queues. Data files and log files provide the underlying storage for MQ queues, and as long as these files are placed on separate disks from the DB2 log, contention on the log files is reduced.
Reduced processor usage For a variety of reasons, Q replication consumes substantially fewer processor cycles than SQL replication on the source server, somewhat higher on the target server, and fewer overall. This is most significant, since the source server is most likely to be running the “production" application, where processor cycles may be scarce.
Figure 4 below shows comparisons of processor utilizations on z/OS for SQL replication and Q replication. The graph shows two measures of processor costs on the source and target systems: microseconds per replicated row and MIPS(1) required for every one thousand rows replicated per second(2)You can use these metrics to understand the differences in processors costs for the different replication methods, as well as a rough “capacity planning" estimate for replication.
The measurements tried to compare measurement points with roughly equal, though not identical, throughputs. In all of these cases, replication is able to keep pace with the workload demand and latencies are less than five seconds.
Figure 4 illustrates the reduced processor consumption with Q replication:
- Q capture CPU reduced 2-3X
- Q capture CPU around 70-80 microseconds per row
- Slightly higher CPU on target (queue management)
Figure 4. Q replication - reduced processor consumption
Here are some observations from these measurements:
- Comparing Q replication with SQL replication, the processor costs per replicated row on the source server are reduced by approximately two thirds. The Q Capture cost is approximately 70-80 microseconds per row, less than 20 MIPS per one thousand rows replicated per second.
- Comparing Q replication to SQL replication, processor costs increase somewhat on the target system, due to the fact that some housekeeping functions, such as “pruning" the message queues, is now being performed predominantly on the target system. However, the overall cost for both systems is typically reduced with Q replication.
Increased parallelism in Q Apply processing Q Apply employs a high degree of parallelism, the key to sustaining the throughput needs of a source application at the replication target. If a source application employs a high degree of parallelism to achieve high throughput, the Q Apply component can use parallelism to keep pace.
As illustrated in Figure 5 below, for every incoming receive queue, Q Apply initiates a browser (think of this process as “browsing a queue"). Q Apply reads transactions from the queue, examines dependencies between transactions, and based on this dependency analysis applies data changes through parallel Q Apply agents, while also ensuring data integrity. One Q Apply program can process multiple message queues (a browser for each queue), and each browser can initiate many agents. Q Apply agents can apply transactions in parallel, emulating the parallelism of the originating source application and thereby dramatically increasing replication throughput.
Figure 5. Q Apply process
In theory, SQL replication can achieve some level of parallelism by initiating several instances of the SQL Apply program, but that requires work by an administrator to divide the tables into appropriate groups and to balance the workload across those groups. In addition, it does not allow for parallelism within a single highly-active table, as can be done with Q replication.
Keep in mind that even with a single message queue and single browser, Q replication employs multiple agents (16 per browser is the default) with no special tuning effort on the part of the administrator. Q Apply is designed to automatically determine the dependencies among transactions so that data changes are applied in the proper sequence. The administrator can define additional queues, but this requires more care, since no transactions can modify tables across message queues. This might be an appropriate thing to do for totally separate applications.
Figure 6 below illustrates the effects of increasing the number of Q Apply agents and browsers. In this measurement, Q Apply was tested with “pre-loaded" message queues, eliminating any constraints on the Q Capture side. In addition, six processors were used on the z/OS LPAR(3) . As you can see, increasing the number of agents improves throughput dramatically, with some additional benefit from an additional browser. Using a single message queue and the default of 16 agents is often adequate for most applications, in this case up to 15 thousand rows per second (a very hefty workload indeed).
Figure 6. Q Apply throughput rates in z/OS
How do bi-directional and peer-to-peer mappings affect performance?
Figure 7 below compares throughput for various replication alternatives: an INSERT workload with no replication, and Q replication using unidirectional, bidirectional, and peer-to-peer mappings. The workload used in these measurements consisted of eight concurrent jobs executing INSERTs, with a short application delay between transactions(4). However, in the case of bi-directional and peer-to-peer measurements, that identical workload is run on two systems, doubling the total workload demand. The workloads are also executed against separate data, avoiding the costs of conflict resolution. In real life, we would expect customers to design their application to avoid conflicts as much as possible, minimizing effects on performance.
Figure 7. How bi-directional and peer-to-peer affect performance
Figure 8 uses the same format as in Figure 4, showing processor costs in microseconds per row and MIPS required per one thousand rows replicated per second.
Figure 8. How bi-directional and peer-to-peer affect CPU and resource consumption
Here are some observations from these two charts:
- When comparing the bidirectional to unidirectional case, notice that the total throughput rates nearly double as we are running a cloned application on the second system. Keep in mind that each system is running its base workload, capturing those changes, and applying the changes for the other system. Although this nearly doubles the processor utilization on each system, the cost per row(5) increases slightly (approximately 12%). Some additional work is required for bidirectional processing to avoid recursion, ensuring that Q Replication does not re-capture the changes being made by Q Apply. The throughput rate on each system declines slightly, perhaps simply due to increased processor utilizations.
- When comparing peer-to-peer to bidirectional tests, throughput on each system declines slightly again (approximately 17%), though CPU costs per row increase more dramatically (70%). This is primarily due to the internal costs of maintaining version information through database triggers. In our tests, the triggers not only created extra processing costs, but also created some added delay because they are synchronous with the application. Also note that the measurements shown here have been taken on DB2 on z/OS environments. The costs associated with triggers are measurably less on DB2/UDB. Also keep in mind that this workload is highly exaggerated (INSERTs only!), and so the percentage delay is highly exaggerated relative to most normal customer workloads that also read and process data. In addition, allowing applications on separate systems to operate concurrently against data replicas may provide users with overall productivity, data availability, and throughput enhancements.
How might these results apply to other workloads and configurations?
Consistent with all performance measurements taken through the history of time, results can vary wildly depending on workloads and configurations. You might be wondering how those factors would affect your intended usage of Q replication. The following are some thoughts on the influence of some of these factors.
Throughput In this paper, we have chosen to measure replication throughput in terms of replicated rows per second. There are other possible metrics, such as transactions per second, but given the wide variance in complexity of transactions, we have avoided using that as a general throughput measure. However, the complexity of a transaction does influence replicated rows per second. As mentioned earlier, most of the measurements taken in this effort have used a workload that is extreme in some respects (INSERTs only), but reasonable in other respects: ten rows per transaction, 212-byte rows, 14 columns.
Figure 9 shows that changing these parameters influences throughput (rows per second). This chart shows the throughput achieved in the capture process when the number of rows per transaction is varied (from 2 rows/transaction up to 20 rows/transactions), and when the size of the rows is varied (from 192 bytes to 2K bytes). From that chart you can see that as number of rows per transaction is increased, replicated rows per second increases. This is due to the fact that replication requires overhead for processing each row as well as overhead for processing each transaction. More rows per second imply fewer commit points, and therefore less overhead. Likewise, replicating more data (larger row sizes) creates more work, and reduces replicated rows per second.
Figure 9. Throughput of Q Capture vs. row and transaction size
Latency In this paper, we have shown some excellent performance results, and in some cases latency values below one second. Although this kind of result is achievable in well-optimized environments, there are other factors that will likely increase replication latency values above what have been previously shown. Some of these include:
- More complex transactions (such as thousands of rows per transaction). We have defined (and Q replication measures) latency as transaction latency, meaning the time from committing a database transaction on the source until it is committed on the target. The process does not start until after the source transaction is committed. Very large transactions are simply more work, and will take longer to process.
- Larger row sizes
- UPDATE transactions. The database management system must do more work on an UPDATE than an INSERT, such as retrieving the row of interest.
- Long network distances (wide area networks)(6)
- Data sharing. Q replication must retrieve ane process multiple log files.
Further information about these monitoring capabilities, as well as a wealth of other performance tuning information can be found in the first resource listed below.
Figure 10. An example with 4-way data sharing
Of course, there are many factors that can affect performance expectations. Results are highly dependent on workload and configuration variables. Although sub-second latencies are achievable, they are certainly not guaranteed. By the standards of today’s technology, replication latency under five seconds should still be considered outstanding, and probably achievable in most normal Q replication scenarios.
Workloads and configurations
The measurements shown in this paper were conducted by the performance analysis team for WebSphere Information Integration at the Silicon Valley lab during the product development of Q Replication. Some of the measurements were taken using product code levels prior to general availability.
The workloads were intended to provide as much stress on Q Replication as possible, while minimizing hardware resources. Consequently, most of the measurements consisted of nothing but database INSERTs. (Since real life workloads would contain both application processing and SELECTs, the resources consumed by the replication components for this INSERT workload are much higher than a customer might expect on a comparable system.)
For the z/OS measurements, the configuration consisted of the following:
- System: Two LPARs of a 2064-216 z900 mainframe. Each LPAR consisted of four processors, except where noted and eight GB of memory. Each processor was roughly a 200MIP machine (800 MIPs total).
- DB2 for z/OS V7
- MQ 5.3.1
These measurements were taken using a 2064-216 mainframe split into 2 LPARs, in this case each LPAR using 4 processors.
- Network: 1 Gb OSA card (ethernet) between LPARs
- Disk: Enterprise Storage Server Model 800
For the AIX measurements, the configuration consisted of the following:
- System: Two LPARs of an IBM pSeries model p690. Each LPAR consisted of 8 processors, 8GB memory.
- Network card: 10/100 Mbit Ethernet
- Software: AIX 5.2, MQ 5.3, DB2 V8.2
- Disk: SSA disk with fast write cache
Performance monitoring and tuning
Although it is important for a product to achieve good performance, it is also important for an analyst to be able to monitor that performance. Q replication provides some useful capabilities not currently available with SQL replication. Perhaps most critically, it is now possible to measure the end-to-end transaction latencies achieved by Q replication. The Replication Center provides reports for Q Capture and Q Apply, including such measures as:
- Replication throughput
- Replication latency
This includes a measure of total end-to-end latency, as well as measures of latency components.
- Memory usage
- Other exceptional condiitons
In addition, the performance statistics are maintained in monitoring tables (CAPMON, CAPQMON, and APPLYMON), which can be queried by other monitoring software.
Further information about these monitoring capabilities, as well as a wealth of other performance tuning information, can be found in the Resources section.
We believe Q replication provides a significant advancement in the technology of database replication, with substantial improvements over SQL replication. In a test environment, we found:
- Comparisons to SQL replication show approximately three times the replication throughput and substantially better replication latency.
- Q replication achieved over 12 thousand replicated rows per second, while sustaining latencies consistently below five seconds, and in many cases below one second.
- Q replication required fewer over processor cycles, with substantial savings on the source server.
These performance enhancements, along with other functional improvements, will enable you to make better use of database replication to solve your business needs.
Various members of the Q Replication development and performance teams have contributed to the material in this document including John Aschoff, Nhut Bui, Ying Chang, Nguyen Dao, and Beth Hamel. Questions and comments can be directed to John Aschoff (firstname.lastname@example.org).
(1) MIPS = Million Instructions per Second, an approximate method of rating relative speed of mainframe processors. Each of the four processors used in these experiments was rated at approximately 200 MIPS. Although MIPS ratings are very approximate measures of relative
(2) The technique used here for calculating the microseconds per row was simply based on the observed total processor utilization on this 4-processor system amortized per row. This technique has the advantage in that it captures all of the time spent, but the drawback that it might be somewhat inflated at low utilizations due to “low utilization effects." On the Q capture side, 100 microseconds per row was deducted, as this was the observed cost per row for a simple INSERT workload without any replication. The MIPs required is strictly based on this one set of observations of this 200 MIP processor.
(3) Most of the other z/OS measurements in this paper used a 4-processor LPAR.
(4) This workload configuration was not designed to find the maximum possible throughputs, but rather to be able to compare various alternatives with an equal workload demand. As a result, we intentionally introduced a short (9 millisecond) delay between transactions.
(5) When calculating the cost per row, we are taking the CPU cost divided by the sum of the captured and applied rows on each system.
(6) Hey, light and electricity can travel only so fast!
The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Anyone attempting to adapt these techniques to their own environment do so at their own risk.
The performance information data shown in this document is based on results obtained in a test environment. Individual results and performance may vary depending on a variety of factors. Further information about the workloads and configurations used in this paper can be found in the "Workloads and configurations" section.
- DB2 Information Integrator Tuning for Replication and Event Publishing Performance (SC18-9289-00)
- DB2 Information Integrator Replication and Event Publishing Guide and Reference (SC18-7568-00)
- DB2 Information Integrator Introduction to Replication and Event Publishing (GC18-7567-00)
- To learn more about WebSphere Information Integrator, visit the Websphere Information Integrator page on developerWorks. You'll find technical articles, links to documentation and downloads, and more.