Mapping to Hadoop

The latest CDC Replication Engine for InfoSphere® DataStage® supports delivery to Apache Hadoop via the Web HDFS table mapping method.

CDC Replication produces one or more files that contain information about one or more records and database operations. If you use the Single Record option when mapping your tables, each record occupies its own line followed by a delimiter. If you choose the Multiple Records option, each record occupies two lines. For update operations, the flat file has both the before and after image. The files are saved for processing by Apache Hadoop.

Understanding the workflow

With both connection modes, flat files are written to HDFS by CDC Replication either when data limits are reached (determined by the Batch Size Threshold settings in the Hadoop Properties dialog box in Management Console after mapping your tables) or when a refresh or mirroring operation ends. The process begins when a refresh or mirroring operation begins and CDC Replication starts writing change information to temporary data files for only those tables in the subscription for which there are changes. Temporary file names are prefixed by an underscore. Once the Batch Size Threshold limits are met, CDC Replication renames the temporary data files at the subscription level with timestamps in the file names. These files are ready for consumption by Apache Hadoop.

For the WebHDFS connection method, CDC Replication communicates with Hadoop by using the HTTP REST API. This method allows much great flexibility on where the CDC Replication target is installed. With the WebHDFS connection method, you can use simple or Kerberos authentication. The authentication method is configured at the subscription level and applies to all table mappings in the subscription.