Configuring the environment for the Big Data File stage

To access a Hadoop Distributed File System (HDFS) with the Big Data File stage , you must make the libhdfs.so shared library, its required JAR libraries, and its configuration files available to the Big Data File stage on the IBM® InfoSphere® Information Server engine tier system or systems.

You must have IBM InfoSphere DataStage® administrator level access to modify environment variables.
InfoSphere Information Server must be installed on a Linux® or AIX® operating system.
Because the Hadoop library names and location might change if a new version is released, ensure that you use the most recent information. See Java™ libraries used to configure Big Data File stage for use with HDFS, for example with InfoSphere BigInsights®.

To use the Big Data File stage, you must ensure that the stage can access the HDFS libhdfs.so shared library, its associated JAR libraries, and configuration files.

If you are using the Big Data File stage to access an InfoSphere BigInsights HDFS by way of the REST connection method or from an AIX operating system, the InfoSphere BigInsights libhdfs.so library is not required. Access to the library's associated .jar files and configuration files are, however, still required.

If the HDFS and the parallel engine are on different systems, ensure that InfoSphere Data Click can access the HDFS .jar files, the HDFS configuration directory, and the libhdfs.so shared library (if required).

One method to provide access to these HDFS components is to use NFS to mount directories from the HDFS computer onto the parallel engine computer. For more information about mounting directories by using NFS, see Potential issues sharing libhdfs via NFS mount.

Another method to provide access to the HDFS components is to copy them to the computer that hosts the parallel engine. If you use this method, you must update the HDFS components if the library names or locations change.

If you are using the copy method with an InfoSphere BigInsights HDFS on the Linux operating system, you can use the syncbi.sh tool. You can use the tool to do the initial copy of the needed libraries, .jar files, and configuration files. The tool also sets up the needed CLASSPATH environment variable by using the ishdfs.config file. You can also use the tool to keep the configuration in sync.

Required Hadoop JAR libraries

For the correct functioning of the libhdfs.so library, the Apache Software Foundation recommends including all of the .jar files in the $HADOOP_PREFIX and $HADOOP_PREFIX/lib directories, and the configuration directory containing the hdfs-site.xml file on the CLASSPATH variable. The locations of those directories vary from Hadoop distribution to Hadoop distribution, and might even vary from release to release. Some distributions or releases have even moved some of the jars from those directories into additional directories.

For InfoSphere BigInsights version 2.1.2, the required Hadoop JAR directories are $BIGINSIGHTS_HOME/IHC and $BIGINSIGHTS_HOME/IHC/share/hadoop/common/lib.

Note: If you use the syncbi.sh tool, the .jar files from these directories on the InfoSphere BigInsights system are copied into one directory on the parallel engine system, $DSHOME/../biginsights.

The configuration directory is $BIGINSIGHTS_HOME/hadoop-conf/ (copied to $DSHOME/../biginsights/hadoop-conf/ by the syncbi.sh tool).

For Cloudera CDH version 5.1, the required Hadoop JAR directories are <CDH_ROOT>/lib/haddop/, <CDH_ROOT>/lib/hadoop/lib/, and <CDH_ROOT>/lib/hadoop/client/. The configuration directory is /etc/hadoop/conf/.

For HortonWorks HDP version 2.1, the required Hadoop JAR directories are /usr/lib/hadoop-hdfs, /usr/lib/hadoop, and /usr/lib/hadoop/lib. The configuration directory is /etc/hadoop/conf/.

On the computer that is running the InfoSphere Information Server engine, log in as the InfoSphere DataStage administrator user.

Configure access to the Hadoop JAR libraries and configuration directory by way of the CLASSPATH environment variable by completing one of the following tasks. You can choose between either using the ishdfs.config file or setting the CLASSPATH variable from the environment settings. The first method, using the ishdfs.config file, is recommended because it allows you to specify all of the .jar files in the required directories with ‘*’ syntax. For example, you could specify all of the .jar files in the /opt/ibm/biginsights/IHC directory by using /opt/ibm/biginsights/IHC/*. If you set the CLASSPATH variable without using the isdhfs.config file, you must add each .jar file in the /opt/ibm/biginsights/IHC directory to the CLASSPATH variable individually, for example, /opt/ibm/biginsights/IHC/hadoop-core-2.2.0-mr1.jar: /opt/ibm/biginsights/IHC/hadoop-streaming.jar, and so forth.

Tip: When you add .jar files to the CLASSPATH variable, be aware of symbolic links. Many Hadoop distributions provide an unversioned alias name for the .jar files by way of symlink. For example, hadoop-core.jar might be a link to hadoop-core-2.2.0-mr1.jar. There is no need to include the .jar files by more than one name. Also, if you copy the .jar files from the Hadoop system to the parallel engine system, ensure that any symlink names that you use in the CLASSPATH variable still point to valid locations.

Important: If you are setting the CLASSPATH variable from the environment settings, and an existing CLASSPATH variable is being set in the dsenv file for other stages, for example the Oozie stage, do not replace that setting with the Big Data File stage setting. Instead, merge the Big Data File stage CLASSPATH variable with the existing dsenv CLASSPATH variable. Or if you are setting the CLASSPATH variable from the Administrator client, add the existing CLASSPATH variable to the Big Data File stage CLASSPATH being set in the Administrator client.

Use the ishdfs.config configuration file for the Big Data File stage to define the CLASSPATH environment variable.
Tip: If you are using the InfoSphere BigInsights HDFS, and are using the syncbi.sh tool to obtain the .jar files, the ishdfs.config file is created for you automatically from the ishdfs.config.biginsights file. This ishdfs.config file points to the .jar files that are downloaded and unpacked in the $DSHOME/../biginsights directory.
Set the CLASSPATH variable in the environment. The variable must contain:
1. All of the Hadoop required jars. See the preceding information about required Hadoop .jar libraries for the directories containing the required jars.
2. The Hadoop configuration directory, which is the directory containing hdfs-site.xml.
3. For InfoSphere BigInsights Console REST API access only, add the following jar libraries:
  - JSON4J.jar
  - httpcore-4.2.1.jar
  - httpclient-4.2.1.jar
  - biginsights-restfs-1.0.0.jar
  - cc-http-api.jar
  - cc-http-impl.jar

For example, if you are using InfoSphere BigInsights and require REST API support, the CLASSPATH variable set by way of the environment must contain:

CLASSPATH=${CLASSPATH}:
/opt/ibm/biginsights/IHC/hadoop-core-2.2.0-mr1.jar:
/opt/ibm/biginsights/IHC/hadoop-mr1-examples-2.2.0.jar:
/opt/ibm/biginsights/IHC/hadoop-streaming.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jersey-server-1.9.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/hadoop-auth-2.2.0.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/asm-3.2.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jackson-jaxrs-1.8.8.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/xmlenc-0.52.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-httpclient-3.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-codec-1.4.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jaxb-api-2.2.2.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jersey-core-1.9.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-net-3.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jackson-mapper-asl-1.8.8.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/avro-1.7.4.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-io-2.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-cli-1.2.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/guava-11.0.2.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/stax-api-1.0.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jettison-1.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jets3t-0.6.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-math-2.2.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jsr305-1.3.9.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jetty-util-6.1.26.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-digester-1.8.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/mockito-all-1.8.5.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jsch-0.1.42.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/paranamer-2.3.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-el-1.0.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/xz-1.0.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/servlet-api-2.5.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jetty-6.1.26.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-lang-2.5.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jackson-xc-1.8.8.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jsp-api-2.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/log4j-1.2.17.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-beanutils-1.8.0.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jersey-json-1.9.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jackson-core-asl-1.8.8.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/zookeeper-3.4.5.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/netty-3.6.2.Final.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-compress-1.4.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/activation-1.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/guardium-proxy.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/protobuf-java-2.5.0.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-collections-3.2.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-configuration-1.6.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/hadoop-annotations-2.2.0.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/commons-logging-1.1.1.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/slf4j-api-1.7.5.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/junit-4.8.2.jar:
/opt/ibm/biginsights/hadoop-conf:
/opt/ibm/biginsights/lib/JSON4J.jar:
/opt/IBM/InformationServer/ASBNode/eclipse/plugins/com.ibm.iis.client/httpclient-4.2.1.jar:
/opt/IBM/InformationServer/ASBNode/eclipse/plugins/com.ibm.iis.client/httpcore-4.2.1.jar:
/opt/IBM/InformationServer/Server/PXEngine/java/biginsights-restfs-1.0.0.jar:
/opt/IBM/InformationServer/Server/PXEngine/java/cc-http-api.jar:
/opt/IBM/InformationServer/Server/PXEngine/java/cc-http-impl.jar

If you use the ishdfs.config file, you can specify the same PATH variable more simply as follows:

CLASSPATH= /opt/ibm/biginsights/IHC/*:
/opt/ibm/biginsights/IHC/share/hadoop/common/lib/*:
/opt/ibm/biginsights/hadoop-conf:
/opt/ibm/biginsights/lib/JSON4J.jar:
/opt/IBM/InformationServer/ASBNode/eclipse/plugins/com.ibm.iis.client/httpclient-4.2.1.jar:
/opt/IBM/InformationServer/ASBNode/eclipse/plugins/com.ibm.iis.client/httpcore-4.2.1.jar:
/opt/IBM/InformationServer/Server/PXEngine/java/biginsights-restfs-1.0.0.jar:
/opt/IBM/InformationServer/Server/PXEngine/java/cc-http-api.jar:
/opt/IBM/InformationServer/Server/PXEngine/java/cc-http-impl.jar

Add the directory that contains the libhdfs.so file to the LD_LIBRARY_PATH variable.

In the following examples, the HDFS files are in the directory named /opt/ibm/.

For InfoSphere BigInsights, add the BigInsightsDirectory/IHC/c++/Linux-amd64-64/lib directory to the LD_LIBRARY_PATH variable. The LD_LIBRARY_PATH variable for InfoSphere Information Server might look like the following example.

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:
/opt/IBM/InformationServer/jdk/jre/bin/classic:
/opt/IBM/InformationServer/ASBNode/lib/cpp:
/opt/IBM/InformationServer/ASBNode/apps/proxy/cpp/linux-all-x86_64:
:/opt/ibm/biginsights/IHC/c++/Linux-amd64-64/lib:
/opt/IBM/InformationServer/jdk/jre/lib/amd64:
/opt/IBM/InformationServer/jdk/jre/lib/amd64/classic
export LD_LIBRARY_PATH

This directory that contains the libhdfs.so file is not required for the InfoSphere Information Server parallel engine on an AIX system.

For Cloudera or HortonWorks, add the directory that contains the libhdfs.so file and add the Java Development Kits (JDKs) that are used in Cloudera to the LD_LIBRARY_PATH variable. In the following example, the libhdfs.so is in the /usr/lib64 directory.

LD_LIBRARY_PATH=/opt/IBM/InformationServer/Server/branded_odbc/lib:
/opt/IBM/InformationServer/Server/DSComponents/lib:
/opt/IBM/InformationServer/Server/DSComponents/bin:
/opt/IBM/InformationServer/Server/DSEngine/lib:
/opt/IBM/InformationServer/Server/DSEngine/uvdlls:
/opt/IBM/InformationServer/Server/PXEngine/lib:
/opt/IBM/InformationServer/jdk/jre/bin:
/opt/IBM/InformationServer/jdk/jre/bin/classic:
/opt/IBM/InformationServer/ASBNode/lib/cpp:
/opt/IBM/InformationServer/ASBNode/apps/proxy/cpp/linux-all-x86_64:
/usr/lib:
.:
/usr/lib64:
/usr/local/jdk1.6.0_21/jre/lib/amd64/server/:
/usr/local/jdk1.6.0_21/jre/lib/amd64
export LD_LIBRARY_PATH

Add the JDK bin directory to the PATH variable.

The PATH environment variable for InfoSphere Information Server might look like the following examples based on which HDFS you are using.

For InfoSphere BigInsights:

PATH=$PATH:
/opt/IBM/InformationServer/jdk/bin:
/opt/IBM/InformationServer/Server/PXEngine/bin:
/usr/lib64/qt-3.3/bin:
/usr/kerberos/bin:/usr/local/bin:
/bin:/home/dsadm/bin:/opt/ibm/biginsights/jdk/bin
export PATH

For Cloudera or HortonWorks:

PATH=$PATH:
/opt/IBM/InformationServer/jdk/bin:
/opt/IBM/InformationServer/Server/PXEngine/bin:
/usr/local/bin:
/bin:
/home/dsadm/bin:
/usr/local/jdk1.6.0_21/bin
export PATH

The JDK level that is required for your computer and the location of the JDK might be different from what is shown in the preceding example.

Check for the fs.hdfs.impl property in the core-site.xml file. If the property is not present, add the property by using one of the following options.
- If you copied the Hadoop configuration directory to your DS system or systems, add the property to the core-site.xml file on this system or systems.
- If you accessed the core-site.xml remotely through an NFS mount, add the property to the core-site.xml file on the BI system.
After you add the property, your new property would be similar to the following example.
```
<property>
    <name>fs.hdfs.impl</name>
    <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
</property>
```
Restart InfoSphere Information Server and IBM WebSphere® Application Server services.