IBM Support

Steps required to configure the File Connector to use PARQUET or ORC as the file format

Question & Answer


Question

How do I configure the File Connector to use PARQUET or ORC File Formats

Cause

In IBM InfoSphere Information Server 11.7, the File Connector has been updated to support the newer version of the ORC and PARQUET API in order to support the metadata import of the ORC format files and the date / time / timestamp data types in the PARQUET format.   To ensure that the existing jobs using the ORC or PARQUET file formats are not impacted because of the changes to the Connector, the connector has introduced two new variants, which are ORC (orc-2.1.jar) and PARQUET(parquet-1.9.0.jar) that are built against the latest version of the respective API.

Users who want to use the new functionality, i.e metadata import of the ORC format files or the date / time / timestamp data types in the PARQUET format files, need to follow the steps provided to configure the connector to use the new functionality.

Answer

IBM Infosphere Information Server 11.7, File Connector added support for Metadata Import of the ORC format files and support for the date / time / timestamp data types in the PARQUET format. The connector can be used either in the Design time (i.e import metadata) or runtime (i.e design jobs to read or write ORC or PARQUET format files)

Steps to configure the Connector to use the PARQUET / ORC file formats (Job runtime)

1. Set the File format property to either PARQUET or ORC
2. Set the desired compression type and other relevant properties for the selected File format
3. In order to use the latest jars for the PARQUET (parquet-1.9.0.jar) or ORC (orc-2.1), the environment variable CC_USE_LATEST_FILECC_JARS needs to be set to the value parquet-1.9.0.jar:orc-2.1.jar
4. In order to use the PARQUET or ORC file formats, the CLASSPATH needs to be configured to add the required jars for the specific format selected.
5. The CLASSPATH environment variable can be configured either at the job level or at the project level.
6. The list of the jars required are provided below.

avro-1.8.1.jar
commons-cli-1.2.jar
commons-collections-3.2.1.jar
commons-configuration-1.6.jar
commons-logging-1.1.3.jar
hadoop-auth-2.7.1.jar
hadoop-common-2.7.1.jar
hadoop-hdfs-2.7.1.jar
hadoop-mapreduce-client-core-2.7.1.jar
hive-exec-2.1.1.jar
javax.ws.rs.jar
jersey-bundle-1.9.1.jar
log4j-1.2.17.jar
org.mortbay.jetty.util.jar
slf4j-api-1.7.19.jar
slf4j-log4j12-1.7.19.jar
snappy-java-1.1.2.1.jar
parquet-avro-1.9.0.jar
parquet-column-1.9.0.jar
parquet-common-1.9.0.jar
parquet-encoding-1.9.0.jar
parquet-format-2.3.1.jar
parquet-hadoop-1.9.0.jar

7. Youyou can download the respective jars from the Apache or pull the jars from the Hadoop distribution being used.
8. Please note that the version of the jars could vary depending on the hadoop distribution (and version) from which the jars are being picked up.
9. In certain cases, the order of jars would be critical. So, it is advisable that the CLASSPATH should include the jars the order listed above.

Steps to configure the Connector to use the PARQUET / ORC file formats (Design Time)

1. In order to import the ORC or PARQUET format files, the ASBAgent needs to include the jars mentioned above in the CLASSPATH.
2. Since the connectors are loaded through the ASBAgent, the CLASSPATH needs to be set in the Agent.sh and the ASBAgent should be restarted.

Setting up the CLASSPATH in the Agent.sh

Setting CLASSPATH in Agent.sh
3. Typically, the Agent.sh and NodeAgents.sh can be located under <IS Install Path>/ASBNode/bin
4. Edit the Agent.sh and include the CLASSPATH in the line that is used to start the Agent.
eval exec '"${JAVA_HOME}/bin/java"' '$PLATFORM_OPTIONS' ......................${J2EE_OPTS} -classpath '<CLASSPATH_HERE>
5. In order to use the latest jars for the PARQUET (parquet-1.9.0.jar) or ORC (orc-2.1), the environment variable CC_USE_LATEST_FILECC_JARS needs to be set to the value parquet-1.9.0.jar:orc-2.1.jar. The environment variable can be set in the session where the ASBAgent is being started or in the NodeAgents.sh file.

Setting Environment Variable
6. Once the Agent.sh has been edited, save the file and restart the ASBAgent using the NodeAgents.sh
7. Please note that the ASBAgent should be restarted using the root user.

Note :
1. There are two versions of the PARQUET and ORC jars

a. parquet.jar (earlier jar based on the older version of the Parquet API. Uses the parquet-avro-1.6.0.jar and avro-1.7.7.jar).
b. parquet-1.9.0.jar (New version of the jar based on the Parquet 1.9.0 API. Uses the new parquet-avro-1.9.0.jar and avro-1.8.1.jar. Along with these two jars, another 5 jars (listed above) are required for using the newer version of the PARQUET jar)
c. orc.jar (earlier jar based on the older version of the ORC API. Uses the hive-exec-1.2.1.jar)
d. orc-2.1.jar (New version of the jar based on the ORC 2.1 API. Uses the new hive-exec-2.1.1.jar)

2. The newer version of the Parquet jars (i.e parquet-avro-1.9.0.jar), has a compatibility issue while handling the string data types. In order to address the issue, the connector introduced an environment variable FILECC_PARQUET_AVRO_COMPAT_MODE. The value of the variable should be set to TRUE. This variable is only required for the job runtime and is not required during the Metadata Import.

3. The environment variable CC_USE_LATEST_FILECC_JARS should be set only if the new variants of the PARQUET or ORC jars needs to be used with the File Connector.

4. Here is example CLASSPATH that should work for both parquet-1.9.0.jar and orc-2.1.jar

<PATH_TO_JARS>/avro-1.8.1.jar:<PATH_TO_JARS>/commons-cli-1.2.jar:<PATH_TO_JARS>/commons-collections-3.2.1.jar:<PATH_TO_JARS>/commons-configuration-1.6.jar:<PATH_TO_JARS>/commons-logging-1.1.3.jar:<PATH_TO_JARS>/hadoop-auth-2.7.1.jar:<PATH_TO_JARS>/hadoop-common-2.7.1.jar:<PATH_TO_JARS>/hadoop-hdfs-2.7.1.jar:<PATH_TO_JARS>/hadoop-mapreduce-client-core-2.7.1.jar:<PATH_TO_JARS>/hive-exec-2.1.1.jar:<PATH_TO_JARS>/javax.ws.rs.jar:<PATH_TO_JARS>/jersey-bundle-1.9.1.jar:<PATH_TO_JARS>/log4j-1.2.17.jar:<PATH_TO_JARS>/org.mortbay.jetty.util.jar::<PATH_TO_JARS>/slf4j-api-1.7.19.jar:<PATH_TO_JARS>/slf4j-log4j12-1.7.19.jar:<PATH_TO_JARS>/snappy-java-1.1.2.1.jar:<PATH_TO_JARS>/parquet-avro-1.9.0.jar:<PATH_TO_JARS>/parquet-column-1.9.0.jar:<PATH_TO_JARS>/parquet-common-1.9.0.jar:<PATH_TO_JARS>/parquet-encoding-1.9.0.jar:<PATH_TO_JARS>/parquet-format-2.3.1.jar:<PATH_TO_JARS>/parquet-hadoop-1.9.0.jar:

5. If the earlier version of the parquet.jar is being used with the File Connector, then the CLASSPATH should include parquet-avro-1.6.0.jar and avro-1.7.7.jar

6. If the earlier version of the orc.jar is being used with the File Connector, then the CLASSPATH should include hive-exec-1.2.1.jar.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSZJPZ","label":"IBM InfoSphere Information Server"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"11.7.0.0;11.7;11.7.0.1;11.7.1","Edition":"Enterprise","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
02 January 2020

UID

swg22016857