Adding custom data formatter for generating Parquet format files
The CDC Replication Engine for InfoSphere® DataStage® provides sample custom data formats that you can extend or modify to suit your environment for generating Parquet format files.
The samples are found in samples.jar, which is located in the samples directory in your CDC Replication Engine for InfoSphere DataStage installation directory. The file contains the following samples:
- SampleParquetTablesawDataConverter.java—Converts a CSV file
generated by CDC Replication Engine for InfoSphere
DataStage into a Parquet format
file using third party
TablesawAPI. This sample is located in com.datamirror.ts.userexit.sample.parquet.
Note the following:
- To run the sample custom data formats without modifying them, you must specify the fully qualified path to the compiled custom data formats in Management Console. For example, com.datamirror.ts.target.publication.userexit.sample.parquet.SampleParquetTablesawDataConverter.
- Compiled sample custom data formats are located in the java-engine-<cdc_version>.jar file, which is found in the lib directory in your CDC Replication Engine for InfoSphere DataStage installation directory. The compiled custom data formats in the java-engine-<cdc_version>.jar file have a *.class extension.
- If you want to modify the sample custom data formats, you must compile the custom data formats after you change the source code.
- The custom data formats class must also be in your classpath <CDC_home>/conf/user-classloader.cp.
Procedure
- Stop CDC Replication.
- Download the required
tablesawlibraries (See list oftablesawlibraries required to use sample custom data format for generating Parquet format files). - Update the classpath file <cdc_home>/conf/user-classloader.cp with these
libraries.
Example:
If the libraries are downloaded to the folder C:\Users\Administrator\Downloads\pq_tablesaw_libs then update the
user-classloader.cpfile as below. Set the<cdc_version>to the correct version of your instance, you can check the <cdc_home>/lib directory for the exact version of these jars. Ensure that you have the logging library slf4j-api-2.0.5.jar and your classpath file (see sample classpath below for the full classpath).Note: When using theSampleParquetTablesawDataConverterfrom CDC, theHadoophome directory must be set. Install Hadoop v3.4 and set the following environment variable on your CDC installation server:HADOOP_HOME=<path_to_hadoop_home_directory> - Set the environment variables for the AWS access key, secret key, and bucket name:
Access key: AWS_ACCESS_KEY_ID Secret key: AWS_SECRET_KEY_ID Bucket name: AWS_BUCKET_NAME - Start CDC instance.
- Create a new subscription with Cloud Object Storage as the delivery method.
- Choose a source table for replication.
- Set the local directory where the Parquet files must be written.
- Set the custom table formatter class:
Example:
To use a sample shipped with CDC, set the class name as com.datamirror.ts.target.publication.userexit.sample.parquet.SampleParquetTablesawDataConverter.Note: Do not specify the .class extension. - If required, update the threshold parameters in subscription's Cloud Object Storage properties for the subscription.
- Start CDC Replication.Note: If you plan to use the sample custom data-formats in production environments, you need to test the samples before they are deployed. IBM does not assume responsibility for adverse results that are caused by modified or customized custom data formats.
lib;lib/kafka-engine-kcop-11.4.0.5-<cdc_version>.jar;samples/samples.jar;lib/parquet-engine-11.4.0.5-<cdc_version>.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/slf4j-jdk14-2.0.0-alpha5.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/slf4j-api-2.0.0-alpha5.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/tablesaw_0.43.1-parquet-0.11.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/tablesaw-core-0.43.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-common-3.4.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-mapreduce-client-core-3.4.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-hadoop-1.14.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/guava-32.0.1-jre.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/univocity-parsers-2.9.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/classgraph-4.8.174.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/fastutil-8.5.6.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/woodstox-core-7.0.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-math3-3.6.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-configuration2-2.11.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-compress-1.27.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-text-1.12.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-lang3-3.16.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-collections-3.2.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-auth-3.4.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-shaded-guava-1.3.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/stax2-api-4.2.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/snappy-java-1.1.10.6.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-column-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-encoding-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-common-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-format-structures-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/failureaccess-1.0.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-jackson-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs\slf4j-api-2.0.5.jar;Tablesaw libraries that are required to use sample custom data format
for generating Parquet format files:- Tablesaw API libraries:
parquet-format-structures-1.14.2.jar parquet-hadoop-1.14.1.jar parquet-jackson-1.14.2.jar snappy-java-1.1.10.6.jar stax2-api-4.2.2.jar tablesaw_0.43.1-parquet-0.11.0.jar tablesaw-core-0.43.1.jar univocity-parsers-2.9.1.jar woodstox-core-7.0.0.jar fastutil-8.5.6.jar guava-32.0.1-jre.jar hadoop-auth-3.4.0.jar hadoop-common-3.4.0.jar hadoop-mapreduce-client-core-3.4.0.jar hadoop-shaded-guava-1.3.0.jar parquet-column-1.14.2.jar parquet-common-1.14.2.jar commons-collections-3.2.2.jar commons-compress-1.27.1.jar commons-configuration2-2.11.0.jar commons-lang3-3.16.0.jar commons-math3-3.6.1.jar failureaccess-1.0.2.jar classgraph-4.8.174.jar parquet-encoding-1.14.2.jar
SampleParquetTablesawDataConverter, which converts files by using
Tablesaw API:
| CDC DataStage data type | Column type in Tablesaw API | Parquet logical type |
|---|---|---|
| CHAR | ColumnType.STRING | BINARY (STRING) |
| VARCHAR | ColumnType.STRING | BINARY (STRING) |
| INT | ColumnType.INTEGER | INT32 |
| BIGINT | ColumnType.INTEGER | INT32 |
| SMALLINT | ColumnType.INTEGER | INT32 |
| DOUBLE | ColumnType.DOUBLE | DOUBLE |
| FLOAT | ColumnType.FLOAT | FLOAT |
| NUMERIC | ColumnType.FLOAT | FLOAT |
| DECIMAL | ColumnType.FLOAT | FLOAT |
| REAL | ColumnType.FLOAT | FLOAT |
| BOOLEAN | ColumnType.BOOLEAN | BOOLEAN |
| DATE | ColumnType.LOCAL_DATE | INT32 (DATE) |
| TIMESTAMP | ColumnType.LOCAL_DATE_TIME | INT64 (TIMESTAMP: MILLIS, not UTC) |
| BINARY | ColumnType.STRING | BINARY (STRING) |
- Only live audit table mapping is supported for generating a Parquet file on target.
- No partition support.
- Users must use custom data formatter with IIDR CDC DataStage target engine.