Adding custom data formatter for generating Parquet format files

The CDC Replication Engine for InfoSphere® DataStage® provides sample custom data formats that you can extend or modify to suit your environment for generating Parquet format files.

The samples are found in samples.jar, which is located in the samples directory in your CDC Replication Engine for InfoSphere DataStage installation directory. The file contains the following samples:

SampleParquetTablesawDataConverter.java—Converts a CSV file generated by CDC Replication Engine for InfoSphere DataStage into a Parquet format file using third party Tablesaw API. This sample is located in com.datamirror.ts.userexit.sample.parquet.

Note the following:

To run the sample custom data formats without modifying them, you must specify the fully qualified path to the compiled custom data formats in Management Console. For example, com.datamirror.ts.target.publication.userexit.sample.parquet.SampleParquetTablesawDataConverter.
Compiled sample custom data formats are located in the java-engine-<cdc_version>.jar file, which is found in the lib directory in your CDC Replication Engine for InfoSphere DataStage installation directory. The compiled custom data formats in the java-engine-<cdc_version>.jar file have a *.class extension.
If you want to modify the sample custom data formats, you must compile the custom data formats after you change the source code.
The custom data formats class must also be in your classpath <CDC_home>/conf/user-classloader.cp.

Procedure

Stop CDC Replication.
Download the required tablesaw libraries (See list of tablesaw libraries required to use sample custom data format for generating Parquet format files).
Update the classpath file <cdc_home>/conf/user-classloader.cp with these libraries.
Example:

If the libraries are downloaded to the folder C:\Users\Administrator\Downloads\pq_tablesaw_libs then update the user-classloader.cp file as below. Set the <cdc_version> to the correct version of your instance, you can check the <cdc_home>/lib directory for the exact version of these jars. Ensure that you have the logging library slf4j-api-2.0.5.jar and your classpath file (see sample classpath below for the full classpath).
Note: When using the SampleParquetTablesawDataConverter from CDC, the Hadoop home directory must be set. Install Hadoop v3.4 and set the following environment variable on your CDC installation server:
```
HADOOP_HOME=<path_to_hadoop_home_directory>
```

Set the environment variables for the AWS access key, secret key, and bucket name:

Access key: AWS_ACCESS_KEY_ID
Secret key: AWS_SECRET_KEY_ID
Bucket name: AWS_BUCKET_NAME

Start CDC instance.
Create a new subscription with Cloud Object Storage as the delivery method.
Choose a source table for replication.
Set the local directory where the Parquet files must be written.
Set the custom table formatter class:
Example:

To use a sample shipped with CDC, set the class name as com.datamirror.ts.target.publication.userexit.sample.parquet.SampleParquetTablesawDataConverter.
Note: Do not specify the .class extension.
If required, update the threshold parameters in subscription's Cloud Object Storage properties for the subscription.
Start CDC Replication.
Note: If you plan to use the sample custom data-formats in production environments, you need to test the samples before they are deployed. IBM does not assume responsibility for adverse results that are caused by modified or customized custom data formats.

Sample classpath:

lib;lib/kafka-engine-kcop-11.4.0.5-<cdc_version>.jar;samples/samples.jar;lib/parquet-engine-11.4.0.5-<cdc_version>.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/slf4j-jdk14-2.0.0-alpha5.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/slf4j-api-2.0.0-alpha5.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/tablesaw_0.43.1-parquet-0.11.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/tablesaw-core-0.43.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-common-3.4.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-mapreduce-client-core-3.4.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-hadoop-1.14.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/guava-32.0.1-jre.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/univocity-parsers-2.9.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/classgraph-4.8.174.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/fastutil-8.5.6.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/woodstox-core-7.0.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-math3-3.6.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-configuration2-2.11.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-compress-1.27.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-text-1.12.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-lang3-3.16.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-collections-3.2.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-auth-3.4.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-shaded-guava-1.3.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/stax2-api-4.2.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/snappy-java-1.1.10.6.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-column-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-encoding-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-common-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-format-structures-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/failureaccess-1.0.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-jackson-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs\slf4j-api-2.0.5.jar;

List of Tablesaw libraries that are required to use sample custom data format for generating Parquet format files:

Tablesaw API libraries:

parquet-format-structures-1.14.2.jar
parquet-hadoop-1.14.1.jar
parquet-jackson-1.14.2.jar
snappy-java-1.1.10.6.jar
stax2-api-4.2.2.jar
tablesaw_0.43.1-parquet-0.11.0.jar
tablesaw-core-0.43.1.jar
univocity-parsers-2.9.1.jar
woodstox-core-7.0.0.jar
fastutil-8.5.6.jar
guava-32.0.1-jre.jar
hadoop-auth-3.4.0.jar
hadoop-common-3.4.0.jar
hadoop-mapreduce-client-core-3.4.0.jar
hadoop-shaded-guava-1.3.0.jar
parquet-column-1.14.2.jar
parquet-common-1.14.2.jar
commons-collections-3.2.2.jar
commons-compress-1.27.1.jar
commons-configuration2-2.11.0.jar
commons-lang3-3.16.0.jar
commons-math3-3.6.1.jar
failureaccess-1.0.2.jar
classgraph-4.8.174.jar
parquet-encoding-1.14.2.jar

Mapping of Parquet logical types to CDC DataStage data types used in the sample user exit SampleParquetTablesawDataConverter, which converts files by using Tablesaw API:

Table 1. CDC DataStage data types
CDC DataStage data type	Column type in Tablesaw API	Parquet logical type
CHAR	ColumnType.STRING	BINARY (STRING)
VARCHAR	ColumnType.STRING	BINARY (STRING)
INT	ColumnType.INTEGER	INT32
BIGINT	ColumnType.INTEGER	INT32
SMALLINT	ColumnType.INTEGER	INT32
DOUBLE	ColumnType.DOUBLE	DOUBLE
FLOAT	ColumnType.FLOAT	FLOAT
NUMERIC	ColumnType.FLOAT	FLOAT
DECIMAL	ColumnType.FLOAT	FLOAT
REAL	ColumnType.FLOAT	FLOAT
BOOLEAN	ColumnType.BOOLEAN	BOOLEAN
DATE	ColumnType.LOCAL_DATE	INT32 (DATE)
TIMESTAMP	ColumnType.LOCAL_DATE_TIME	INT64 (TIMESTAMP: MILLIS, not UTC)
BINARY	ColumnType.STRING	BINARY (STRING)

Limitations:

Only live audit table mapping is supported for generating a Parquet file on target.
No partition support.
Users must use custom data formatter with IIDR CDC DataStage target engine.