Ensuring transactional consistency with Netezza when using CDC and DataStage
Properly configuring the Netezza Connector
Meeting today's big data challenges provides us with exciting and complicated tasks. Luckily, there are amazing tools to reconcile, maintain, and deliver data and data services for analytic applications. This article walks you through a use case to learn how to ensure transactional consistency with Netezza when IBM’s InfoSphere Data Replication’s Change Data Capture (CDC) is sending transactions to InfoSphere DataStage, with PureData System for Analytics (Netezza) as the target. Learn about Netezza Connector configurations that are and are not supported, and discover how to configure the Netezza Connector to ensure true transaction consistency.
This article is intended for intermediate users of DataStage, novice users of CDC, and new users of PureData System for Analytics (Netezza). Knowledge of DataStage jobs, routines, and job sequences is assumed. Details on creating CDC subscriptions are outside the scope of this article.
What is not supported
First, let's walk through a short use case of what "doesn't work." Doesn't work is a bit misleading: A "not supported" implementation is a much better way to describe the configuration in Figure 1, because both the bookmark link and the data link must go to the same Connector Stage. However, the Netezza Connector can only have one input link. If you separate the two links going to two different Connector Stages, you can lose transactional integrity.
Figure 1. Non-supported implementation
Interestingly, and misleadingly to the DataStage developer, for this particular configuration and when using the CDC Transaction Stage as input, the DataStage job runs, but the changes do not get committed to Netezza until the DataStage job is brought down.
The CDC subscription appears to be successfully passing and writing transactions directly into DataStage via the CDC Transaction Stage, as in Figure 1. When you bring up the CDC subscription, it successfully starts the DataStage job as it is designed to do. The DataStage links turn blue, indicating that the job is running. The Director log also shows that the job is running and waiting for transactions to arrive. The DataStage job actually passes those incoming CDC transactions to Netezza (but won't show that in the log, just yet). But, the rows never get committed to Netezza until you stop the CDC subscription, which subsequently stops the DataStage job (as designed). Then, all transactions commit, and the DataStage log shows that the job successfully updated Netezza with x number of rows.
However, when the bookmark link and data link are separate, you can lose transactional
integrity at some point. If the two writes are in separate transactions (targeting
two distinct Netezza Connector stage instances as shown in Figure 1 by the separate
links), there is no guarantee they will be correctly synchronized. The commit to the
bookmark table could succeed while writing to the target database could fail. This
situation would lead to the
successfully committed bookmark value being
Thus, this is not a supported implementation. Netezza itself is not an OLTP database; it expects batches (in this case micro-batches, if you will) of transactions. Even if transactional integrity could be guaranteed, this use case doesn't make sense because the CDC subscription and the corresponding DataStage job would need to be brought down every so often for the rows to get committed to Netezza.
What is supported
This section describes what works and is supported.
CDC automatically creates the proper .dsx file to import into DataStage when you build a subscription that writes to DataStage using flat files. CDC supplies two routines, a job sequence, and a sample DataStage job in that .dsx file, as shown in Figure 2. However, the DataStage job that is supplied is just an example job with DB2 as the target. You need to build your own job. For the example use case, I built a job that uses Netezza as the target (Figure 2).
Figure 2. Supported implementation
When CDC writes out the transactions to a flat file, it also includes a column indicating the operation type for each affected source row. The operation type indicates whether that row is an insert (I), an update (U), or a delete (D). It is important to know and consider the operation type because the type of transaction, and its proper order, must be understood and respected when writing the transactions to Netezza.
The DataStage job in Figure 2 reads the files in the directory that CDC writes to. This job also writes out all the transactions to a flat file, in a sub-directory where the original flat files land from CDC, as a backup (this additional step is optional). There is a job sequence that controls when to read, and which files to read, as well as when to delete them, once they are fully digested by DataStage. The job sequence correctly detects which flat files to read in the right sequence based on a timestamp in the file name. It also detects when the file has been hardened and is ready to be read. This is all somewhat explained in the job sequence that CDC supplies in the .dsx to be imported by DataStage. (Experienced DataStage users will quickly understand how the job sequence works.)
When using flat files from CDC, the bookmark table is stored where CDC for DataStage is installed, so that's one item you don't need to be concerned about.
Configuring the Netezza Connector
How should the Netezza Connector be configured in order to ensure true transactional consistency? How will Netezza know the type of incoming transactions and, most importantly, how will it keep the order of those incoming transactions correct? Can it ensure inserts occur before updates to the same row, that a delete of a row should not occur before the same row is inserted, and so forth?
Set your options in the DataStage Netezza Connector as shown in Figure 3 to ensure that all inserts, deletes, and updates are committed to Netezza correctly.
Figure 3. Netezza Connector setup
When Atomic mode is set to Yes, all write mode statements are executed in one transaction. That's how Netezza works by default. In this use case, that is not good. If the flat file you are reading is inserting a row, and in the same flat file you have a delete for that same row, the insert will occur but the delete will not. That happens because the insert was not committed before the delete, so the delete is effectively ignored.
When Atomic mode is set to No, each write mode statement (Update, Insert, Delete) for the rows in the same flat file are executed in a separate transaction. In effect, this simulates the behavior of each row being committed one at a time. The option in the Action column, which is being used in Figure 3, effectively breaks the input rows into groups for execution in different SQL statements based on the value of the action column, and other factors, to achieve the same final result as if each row (action) were applied one at a time. The Connector is "smart" enough to detect if an insert and delete occur for the same row (same primary key) in the same flat file batch; the insert will not take place at all and will be ignored.
Of course, if the insert and delete are in two different flat files and Atomic mode is set to Yes, then this won't happen. The insert will occur, then the delete for the same row will occur because the first flat file (with the insert) will be committed before the second flat file (with the delete) is read. For this use case, it is not recommended to set Atomic mode to Yes and to assume that inserts and deletes of the same row will always be written to separate flat files. That would be a bad assumption.
We set Check duplicate rows to Yes along with the Duplicate row action sub-option to Filter them out. If an insert of duplicate primary keys attempts to load into Netezza, the DataStage job aborts. Setting the options as shown in Figure 3 alleviates that issue.
To write to Netezza sequentially, in the Netezza Connector, set the option for Execution mode to Sequential, as in Figure 4. This ensures that transactions are written in the correct order to the temporary work table in Netezza, from DataStage. It is recommended that DataStage be run with a one node configuration so that transactions are not partitioned across multiple nodes.
Figure 4. Advanced Stage properties
In terms of performance, there are no significant performance issues even with options set to Sequential, Atomic mode to No, or running DataStage with a one node configuration (which you might suspect would have an impact).
There aren't that many CDC transactions. In our subscription, we set the option to send a flat file to DataStage when either 300 transactions are made or 1 minute passed. So, the DataStage job writing to Netezza happens very quickly anyway. You can set the CDC option to be much longer, say every 5 or 15 minutes, depending on the volume of changes occurring at the source. It's up to you to determine the optimum time span, or number of transactions, based on the business requirements. The DataStage routines and job sequence that are automatically created by CDC will correctly handle which flat file to read first because they do get time-stamped.
The flat file implementation provides the near-real time experience users might expect. It is the most common implementation when it comes to using Netezza as a target when DataStage is in the mix.
For CDC users, there is another supported implementation to consider. CDC can be set up to write to an audit database. DataStage will read the data from the staging table. The data in the table will only be visible to the DataStage job after a commit, which is effectively equivalent to the hardening of the flat files in the use case in this article. Both use cases should be considered with performance in mind based on your specific business requirements and environment.
I would like to thank Branislav Barnak, Paul Stanley, Glen Sakuth, Elaine Pang, Michael Aldrich, and Camila Hiskey for their input and editing assistance with this article.
- InfoSphere DataStage: Integrate all types of data on distributed and mainframe platforms.
- InfoSphere Change Data Capture: Replicate information in your heterogeneous data stores.
- IBM PureData System for Analytics: Powered by Netezza technology, this is a simple data appliance for serious analytics. Get the highlights from the data sheet.