Ensuring transactional consistency with Netezza when using CDC and DataStage

Properly configuring the Netezza Connector

This article shows you how to configure the Netezza Connector properly when transactions coming from IBM InfoSphere® Data Replication’s Change Data Capture (CDC) are first passed through DataStage® before being written to PureData™ System for Analytics (Netezza). Walk through a use case where a flat file implementation provides near real-time experience. The author highlights Netezza Connector implementations that do and do not work.

Share:

John Bekisz (jbekisz@us.ibm.com), Technical Specialist, North American Team, IBM

John BekiszJohn Bekisz has more than 30 years of experience with a diverse background spanning from hardware, software, operations, sales, business development, and management. Currently, he is on the North American Technical SWAT Team for IBM, where his specialty is data integration and information governance.



06 March 2014

Also available in Russian

Introduction

Meeting today's big data challenges provides us with exciting and complicated tasks. Luckily, there are amazing tools to reconcile, maintain, and deliver data and data services for analytic applications. This article walks you through a use case to learn how to ensure transactional consistency with Netezza when IBM’s InfoSphere Data Replication’s Change Data Capture (CDC) is sending transactions to InfoSphere DataStage, with PureData System for Analytics (Netezza) as the target. Learn about Netezza Connector configurations that are and are not supported, and discover how to configure the Netezza Connector to ensure true transaction consistency.

Prerequisites

This article is intended for intermediate users of DataStage, novice users of CDC, and new users of PureData System for Analytics (Netezza). Knowledge of DataStage jobs, routines, and job sequences is assumed. Details on creating CDC subscriptions are outside the scope of this article.


What is not supported

First, let's walk through a short use case of what "doesn't work." Doesn't work is a bit misleading: A "not supported" implementation is a much better way to describe the configuration in Figure 1, because both the bookmark link and the data link must go to the same Connector Stage. However, the Netezza Connector can only have one input link. If you separate the two links going to two different Connector Stages, you can lose transactional integrity.

Figure 1. Non-supported implementation

Interestingly, and misleadingly to the DataStage developer, for this particular configuration and when using the CDC Transaction Stage as input, the DataStage job runs, but the changes do not get committed to Netezza until the DataStage job is brought down.

The CDC subscription appears to be successfully passing and writing transactions directly into DataStage via the CDC Transaction Stage, as in Figure 1. When you bring up the CDC subscription, it successfully starts the DataStage job as it is designed to do. The DataStage links turn blue, indicating that the job is running. The Director log also shows that the job is running and waiting for transactions to arrive. The DataStage job actually passes those incoming CDC transactions to Netezza (but won't show that in the log, just yet). But, the rows never get committed to Netezza until you stop the CDC subscription, which subsequently stops the DataStage job (as designed). Then, all transactions commit, and the DataStage log shows that the job successfully updated Netezza with x number of rows.

However, when the bookmark link and data link are separate, you can lose transactional integrity at some point. If the two writes are in separate transactions (targeting two distinct Netezza Connector stage instances as shown in Figure 1 by the separate links), there is no guarantee they will be correctly synchronized. The commit to the bookmark table could succeed while writing to the target database could fail. This situation would lead to the successfully committed bookmark value being incorrect.

Thus, this is not a supported implementation. Netezza itself is not an OLTP database; it expects batches (in this case micro-batches, if you will) of transactions. Even if transactional integrity could be guaranteed, this use case doesn't make sense because the CDC subscription and the corresponding DataStage job would need to be brought down every so often for the rows to get committed to Netezza.


What is supported

This section describes what works and is supported.

CDC automatically creates the proper .dsx file to import into DataStage when you build a subscription that writes to DataStage using flat files. CDC supplies two routines, a job sequence, and a sample DataStage job in that .dsx file, as shown in Figure 2. However, the DataStage job that is supplied is just an example job with DB2 as the target. You need to build your own job. For the example use case, I built a job that uses Netezza as the target (Figure 2).

Figure 2. Supported implementation

When CDC writes out the transactions to a flat file, it also includes a column indicating the operation type for each affected source row. The operation type indicates whether that row is an insert (I), an update (U), or a delete (D). It is important to know and consider the operation type because the type of transaction, and its proper order, must be understood and respected when writing the transactions to Netezza.

The DataStage job in Figure 2 reads the files in the directory that CDC writes to. This job also writes out all the transactions to a flat file, in a sub-directory where the original flat files land from CDC, as a backup (this additional step is optional). There is a job sequence that controls when to read, and which files to read, as well as when to delete them, once they are fully digested by DataStage. The job sequence correctly detects which flat files to read in the right sequence based on a timestamp in the file name. It also detects when the file has been hardened and is ready to be read. This is all somewhat explained in the job sequence that CDC supplies in the .dsx to be imported by DataStage. (Experienced DataStage users will quickly understand how the job sequence works.)

When using flat files from CDC, the bookmark table is stored where CDC for DataStage is installed, so that's one item you don't need to be concerned about.


Configuring the Netezza Connector

How should the Netezza Connector be configured in order to ensure true transactional consistency? How will Netezza know the type of incoming transactions and, most importantly, how will it keep the order of those incoming transactions correct? Can it ensure inserts occur before updates to the same row, that a delete of a row should not occur before the same row is inserted, and so forth?

Set your options in the DataStage Netezza Connector as shown in Figure 3 to ensure that all inserts, deletes, and updates are committed to Netezza correctly.

Figure 3. Netezza Connector setup

When Atomic mode is set to Yes, all write mode statements are executed in one transaction. That's how Netezza works by default. In this use case, that is not good. If the flat file you are reading is inserting a row, and in the same flat file you have a delete for that same row, the insert will occur but the delete will not. That happens because the insert was not committed before the delete, so the delete is effectively ignored.

When Atomic mode is set to No, each write mode statement (Update, Insert, Delete) for the rows in the same flat file are executed in a separate transaction. In effect, this simulates the behavior of each row being committed one at a time. The option in the Action column, which is being used in Figure 3, effectively breaks the input rows into groups for execution in different SQL statements based on the value of the action column, and other factors, to achieve the same final result as if each row (action) were applied one at a time. The Connector is "smart" enough to detect if an insert and delete occur for the same row (same primary key) in the same flat file batch; the insert will not take place at all and will be ignored.

Of course, if the insert and delete are in two different flat files and Atomic mode is set to Yes, then this won't happen. The insert will occur, then the delete for the same row will occur because the first flat file (with the insert) will be committed before the second flat file (with the delete) is read. For this use case, it is not recommended to set Atomic mode to Yes and to assume that inserts and deletes of the same row will always be written to separate flat files. That would be a bad assumption.

We set Check duplicate rows to Yes along with the Duplicate row action sub-option to Filter them out. If an insert of duplicate primary keys attempts to load into Netezza, the DataStage job aborts. Setting the options as shown in Figure 3 alleviates that issue.


Additional settings

To write to Netezza sequentially, in the Netezza Connector, set the option for Execution mode to Sequential, as in Figure 4. This ensures that transactions are written in the correct order to the temporary work table in Netezza, from DataStage. It is recommended that DataStage be run with a one node configuration so that transactions are not partitioned across multiple nodes.

Figure 4. Advanced Stage properties

Performance considerations

In terms of performance, there are no significant performance issues even with options set to Sequential, Atomic mode to No, or running DataStage with a one node configuration (which you might suspect would have an impact).

There aren't that many CDC transactions. In our subscription, we set the option to send a flat file to DataStage when either 300 transactions are made or 1 minute passed. So, the DataStage job writing to Netezza happens very quickly anyway. You can set the CDC option to be much longer, say every 5 or 15 minutes, depending on the volume of changes occurring at the source. It's up to you to determine the optimum time span, or number of transactions, based on the business requirements. The DataStage routines and job sequence that are automatically created by CDC will correctly handle which flat file to read first because they do get time-stamped.


Conclusion

The flat file implementation provides the near-real time experience users might expect. It is the most common implementation when it comes to using Netezza as a target when DataStage is in the mix.

For CDC users, there is another supported implementation to consider. CDC can be set up to write to an audit database. DataStage will read the data from the staging table. The data in the table will only be visible to the DataStage job after a commit, which is effectively equivalent to the hardening of the flat files in the use case in this article. Both use cases should be considered with performance in mind based on your specific business requirements and environment.


Acknowledgements

I would like to thank Branislav Barnak, Paul Stanley, Glen Sakuth, Elaine Pang, Michael Aldrich, and Camila Hiskey for their input and editing assistance with this article.

Resources

Learn

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=964898
ArticleTitle=Ensuring transactional consistency with Netezza when using CDC and DataStage
publish-date=03062014