There are many deployment models available for InfoSphere Data Replication's CDC technology of which DataStage integration is a popular one. The deployment option selected will significantly affect the complexity, performance, and reliability of the implementation. If possible, the best solution is always to use CDC direct replication (i.e. do not add DataStage to the mix).
CDC integration with DataStage is the right solution for replication when:
- You need to target a database that CDC doesn't directly support and is not appropriate for CDC FlexRep
- Complex transformations are required that could not be handled natively with CDC, such as complex table look-ups
- When integrating with MDM
Cons of replicating from CDC to DataStage to an eventual target database:
- Performance going through DataStage (no matter which integration option is chosen) will be significantly slower than applying via a CDC target directly to the database
- The exception to this rule is when targeting Teradata, if you use DataStage flatfile integration, the throughput will be higher than CDC direct to Teradata
- Adding DataStage into the replication stream introduces additional points of failure
- Having a resilient CDC installation is more complex if DataStage is also involved
- When integrating with DataStage, there are two independent GUIs for configuration, and two places required to monitor the replication stream
- There is significant development effort developing DataStage jobs for each additional table added to replication
- Incorrect DataStage job design can negatively affect transactional integrity and cause data corruption
- The maximum number of tables per CDC subscription is lower if targeting DataStage
- The CDC External Refresh does not work when targeting DataStage. A separate process would have to be put in place to de-dup duplicate records produced during the "in-doubt" period of a refresh (the captured changes that occurred while the source date was being refreshed).