Adding custom data formatter for generating Parquet format files

The CDC Replication Engine for InfoSphere® DataStage® provides sample custom data formats that you can extend or modify to suit your environment for generating Parquet format files.

The samples are found in samples.jar, which is located in the samples directory in your CDC Replication Engine for InfoSphere DataStage installation directory. The file contains the following samples:

  • SampleParquetTablesawDataConverter.java—Converts a CSV file generated by CDC Replication Engine for InfoSphere DataStage into a Parquet format file using third party Tablesaw API. This sample is located in com.datamirror.ts.userexit.sample.parquet.

Note the following:

  • To run the sample custom data formats without modifying them, you must specify the fully qualified path to the compiled custom data formats in Management Console. For example, com.datamirror.ts.target.publication.userexit.sample.parquet.SampleParquetTablesawDataConverter.
  • Compiled sample custom data formats are located in the java-engine-<cdc_version>.jar file, which is found in the lib directory in your CDC Replication Engine for InfoSphere DataStage installation directory. The compiled custom data formats in the java-engine-<cdc_version>.jar file have a *.class extension.
  • If you want to modify the sample custom data formats, you must compile the custom data formats after you change the source code.
  • The custom data formats class must also be in your classpath <CDC_home>/conf/user-classloader.cp.

Procedure

  1. Stop CDC Replication.
  2. Download the required tablesaw libraries (See list of tablesaw libraries required to use sample custom data format for generating Parquet format files).
  3. Update the classpath file <cdc_home>/conf/user-classloader.cp with these libraries.

    Example:

    If the libraries are downloaded to the folder C:\Users\Administrator\Downloads\pq_tablesaw_libs then update the user-classloader.cp file as below. Set the <cdc_version> to the correct version of your instance, you can check the <cdc_home>/lib directory for the exact version of these jars. Ensure that you have the logging library slf4j-api-2.0.5.jar and your classpath file (see sample classpath below for the full classpath).

    Note: When using the SampleParquetTablesawDataConverter from CDC, the Hadoop home directory must be set. Install Hadoop v3.4 and set the following environment variable on your CDC installation server:
    HADOOP_HOME=<path_to_hadoop_home_directory>
  4. Set the environment variables for the AWS access key, secret key, and bucket name:
    Access key: AWS_ACCESS_KEY_ID
    Secret key: AWS_SECRET_KEY_ID
    Bucket name: AWS_BUCKET_NAME
    
  5. Start CDC instance.
  6. Create a new subscription with Cloud Object Storage as the delivery method.
  7. Choose a source table for replication.
  8. Set the local directory where the Parquet files must be written.
  9. Set the custom table formatter class:

    Example:

    To use a sample shipped with CDC, set the class name as com.datamirror.ts.target.publication.userexit.sample.parquet.SampleParquetTablesawDataConverter.
    Note: Do not specify the .class extension.
  10. If required, update the threshold parameters in subscription's Cloud Object Storage properties for the subscription.
  11. Start CDC Replication.
    Note: If you plan to use the sample custom data-formats in production environments, you need to test the samples before they are deployed. IBM does not assume responsibility for adverse results that are caused by modified or customized custom data formats.
Sample classpath:
lib;lib/kafka-engine-kcop-11.4.0.5-<cdc_version>.jar;samples/samples.jar;lib/parquet-engine-11.4.0.5-<cdc_version>.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/slf4j-jdk14-2.0.0-alpha5.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/slf4j-api-2.0.0-alpha5.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/tablesaw_0.43.1-parquet-0.11.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/tablesaw-core-0.43.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-common-3.4.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-mapreduce-client-core-3.4.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-hadoop-1.14.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/guava-32.0.1-jre.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/univocity-parsers-2.9.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/classgraph-4.8.174.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/fastutil-8.5.6.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/woodstox-core-7.0.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-math3-3.6.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-configuration2-2.11.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-compress-1.27.1.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-text-1.12.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-lang3-3.16.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/commons-collections-3.2.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-auth-3.4.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/hadoop-shaded-guava-1.3.0.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/stax2-api-4.2.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/snappy-java-1.1.10.6.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-column-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-encoding-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-common-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-format-structures-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/failureaccess-1.0.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs/parquet-jackson-1.14.2.jar;C:\Users\Administrator\Downloads\pq_tablesaw_libs\slf4j-api-2.0.5.jar;
List of Tablesaw libraries that are required to use sample custom data format for generating Parquet format files:
  • Tablesaw API libraries:
    parquet-format-structures-1.14.2.jar
    parquet-hadoop-1.14.1.jar
    parquet-jackson-1.14.2.jar
    snappy-java-1.1.10.6.jar
    stax2-api-4.2.2.jar
    tablesaw_0.43.1-parquet-0.11.0.jar
    tablesaw-core-0.43.1.jar
    univocity-parsers-2.9.1.jar
    woodstox-core-7.0.0.jar
    fastutil-8.5.6.jar
    guava-32.0.1-jre.jar
    hadoop-auth-3.4.0.jar
    hadoop-common-3.4.0.jar
    hadoop-mapreduce-client-core-3.4.0.jar
    hadoop-shaded-guava-1.3.0.jar
    parquet-column-1.14.2.jar
    parquet-common-1.14.2.jar
    commons-collections-3.2.2.jar
    commons-compress-1.27.1.jar
    commons-configuration2-2.11.0.jar
    commons-lang3-3.16.0.jar
    commons-math3-3.6.1.jar
    failureaccess-1.0.2.jar
    classgraph-4.8.174.jar
    parquet-encoding-1.14.2.jar
Mapping of Parquet logical types to CDC DataStage data types used in the sample user exit SampleParquetTablesawDataConverter, which converts files by using Tablesaw API:
Table 1. CDC DataStage data types
CDC DataStage data type Column type in Tablesaw API Parquet logical type
CHAR ColumnType.STRING BINARY (STRING)
VARCHAR ColumnType.STRING BINARY (STRING)
INT ColumnType.INTEGER INT32
BIGINT ColumnType.INTEGER INT32
SMALLINT ColumnType.INTEGER INT32
DOUBLE ColumnType.DOUBLE DOUBLE
FLOAT ColumnType.FLOAT FLOAT
NUMERIC ColumnType.FLOAT FLOAT
DECIMAL ColumnType.FLOAT FLOAT
REAL ColumnType.FLOAT FLOAT
BOOLEAN ColumnType.BOOLEAN BOOLEAN
DATE ColumnType.LOCAL_DATE INT32 (DATE)
TIMESTAMP ColumnType.LOCAL_DATE_TIME INT64 (TIMESTAMP: MILLIS, not UTC)
BINARY ColumnType.STRING BINARY (STRING)
Limitations:
  • Only live audit table mapping is supported for generating a Parquet file on target.
  • No partition support.
  • Users must use custom data formatter with IIDR CDC DataStage target engine.