Question & Answer
Question
Cause
Users who want to use the new functionality, i.e metadata import of the ORC format files or the date / time / timestamp data types in the PARQUET format files, need to follow the steps provided to configure the connector to use the new functionality.
Answer
Steps to configure the Connector to use the PARQUET / ORC file formats (Job runtime)
1. Set the File format property to either PARQUET or ORC
2. Set the desired compression type and other relevant properties for the selected File format
3. In order to use the latest jars for the PARQUET (parquet-1.9.0.jar) or ORC (orc-2.1), the environment variable CC_USE_LATEST_FILECC_JARS needs to be set to the value parquet-1.9.0.jar:orc-2.1.jar
4. In order to use the PARQUET or ORC file formats, the CLASSPATH needs to be configured to add the required jars for the specific format selected.
5. The CLASSPATH environment variable can be configured either at the job level or at the project level.
6. The list of the jars required are provided below.
avro-1.8.1.jar
commons-cli-1.2.jar
commons-collections-3.2.1.jar
commons-configuration-1.6.jar
commons-logging-1.1.3.jar
hadoop-auth-2.7.1.jar
hadoop-common-2.7.1.jar
hadoop-hdfs-2.7.1.jar
hadoop-mapreduce-client-core-2.7.1.jar
hive-exec-2.1.1.jar
javax.ws.rs.jar
jersey-bundle-1.9.1.jar
log4j-1.2.17.jar
org.mortbay.jetty.util.jar
slf4j-api-1.7.19.jar
slf4j-log4j12-1.7.19.jar
snappy-java-1.1.2.1.jar
parquet-avro-1.9.0.jar
parquet-column-1.9.0.jar
parquet-common-1.9.0.jar
parquet-encoding-1.9.0.jar
parquet-format-2.3.1.jar
parquet-hadoop-1.9.0.jar
7. Youyou can download the respective jars from the Apache or pull the jars from the Hadoop distribution being used.
8. Please note that the version of the jars could vary depending on the hadoop distribution (and version) from which the jars are being picked up.
9. In certain cases, the order of jars would be critical. So, it is advisable that the CLASSPATH should include the jars the order listed above.
Steps to configure the Connector to use the PARQUET / ORC file formats (Design Time)
1. In order to import the ORC or PARQUET format files, the ASBAgent needs to include the jars mentioned above in the CLASSPATH.
2. Since the connectors are loaded through the ASBAgent, the CLASSPATH needs to be set in the Agent.sh and the ASBAgent should be restarted.
Setting up the CLASSPATH in the Agent.sh
3. Typically, the Agent.sh and NodeAgents.sh can be located under <IS Install Path>/ASBNode/bin
4. Edit the Agent.sh and include the CLASSPATH in the line that is used to start the Agent.
eval exec '"${JAVA_HOME}/bin/java"' '$PLATFORM_OPTIONS' ......................${J2EE_OPTS} -classpath '<CLASSPATH_HERE>
5. In order to use the latest jars for the PARQUET (parquet-1.9.0.jar) or ORC (orc-2.1), the environment variable CC_USE_LATEST_FILECC_JARS needs to be set to the value parquet-1.9.0.jar:orc-2.1.jar. The environment variable can be set in the session where the ASBAgent is being started or in the NodeAgents.sh file.
6. Once the Agent.sh has been edited, save the file and restart the ASBAgent using the NodeAgents.sh
7. Please note that the ASBAgent should be restarted using the root user.
Note :
1. There are two versions of the PARQUET and ORC jars
a. parquet.jar (earlier jar based on the older version of the Parquet API. Uses the parquet-avro-1.6.0.jar and avro-1.7.7.jar).
b. parquet-1.9.0.jar (New version of the jar based on the Parquet 1.9.0 API. Uses the new parquet-avro-1.9.0.jar and avro-1.8.1.jar. Along with these two jars, another 5 jars (listed above) are required for using the newer version of the PARQUET jar)
c. orc.jar (earlier jar based on the older version of the ORC API. Uses the hive-exec-1.2.1.jar)
d. orc-2.1.jar (New version of the jar based on the ORC 2.1 API. Uses the new hive-exec-2.1.1.jar)
2. The newer version of the Parquet jars (i.e parquet-avro-1.9.0.jar), has a compatibility issue while handling the string data types. In order to address the issue, the connector introduced an environment variable FILECC_PARQUET_AVRO_COMPAT_MODE. The value of the variable should be set to TRUE. This variable is only required for the job runtime and is not required during the Metadata Import.
3. The environment variable CC_USE_LATEST_FILECC_JARS should be set only if the new variants of the PARQUET or ORC jars needs to be used with the File Connector.
4. Here is example CLASSPATH that should work for both parquet-1.9.0.jar and orc-2.1.jar
<PATH_TO_JARS>/avro-1.8.1.jar:<PATH_TO_JARS>/commons-cli-1.2.jar:<PATH_TO_JARS>/commons-collections-3.2.1.jar:<PATH_TO_JARS>/commons-configuration-1.6.jar:<PATH_TO_JARS>/commons-logging-1.1.3.jar:<PATH_TO_JARS>/hadoop-auth-2.7.1.jar:<PATH_TO_JARS>/hadoop-common-2.7.1.jar:<PATH_TO_JARS>/hadoop-hdfs-2.7.1.jar:<PATH_TO_JARS>/hadoop-mapreduce-client-core-2.7.1.jar:<PATH_TO_JARS>/hive-exec-2.1.1.jar:<PATH_TO_JARS>/javax.ws.rs.jar:<PATH_TO_JARS>/jersey-bundle-1.9.1.jar:<PATH_TO_JARS>/log4j-1.2.17.jar:<PATH_TO_JARS>/org.mortbay.jetty.util.jar::<PATH_TO_JARS>/slf4j-api-1.7.19.jar:<PATH_TO_JARS>/slf4j-log4j12-1.7.19.jar:<PATH_TO_JARS>/snappy-java-1.1.2.1.jar:<PATH_TO_JARS>/parquet-avro-1.9.0.jar:<PATH_TO_JARS>/parquet-column-1.9.0.jar:<PATH_TO_JARS>/parquet-common-1.9.0.jar:<PATH_TO_JARS>/parquet-encoding-1.9.0.jar:<PATH_TO_JARS>/parquet-format-2.3.1.jar:<PATH_TO_JARS>/parquet-hadoop-1.9.0.jar:
5. If the earlier version of the parquet.jar is being used with the File Connector, then the CLASSPATH should include parquet-avro-1.6.0.jar and avro-1.7.7.jar
6. If the earlier version of the orc.jar is being used with the File Connector, then the CLASSPATH should include hive-exec-1.2.1.jar.
Was this topic helpful?
Document Information
Modified date:
02 January 2020
UID
swg22016857