Configuring the environment for the Big Data File stage
To access a Hadoop Distributed File System (HDFS) with the Big Data File stage , you must make the libhdfs.so shared library, its required JAR libraries, and its configuration files available to the Big Data File stage on the IBM® InfoSphere® Information Server engine tier system or systems.
- You must have IBM InfoSphere DataStage® administrator level access to modify environment variables.
- InfoSphere Information Server must be installed on a Linux® or AIX® operating system.
- Because the Hadoop library names and location might change if a new version is released, ensure that you use the most recent information. See Java™ libraries used to configure Big Data File stage for use with HDFS, for example with InfoSphere BigInsights®.
To use the Big Data File stage, you must ensure that the stage can access the HDFS libhdfs.so shared library, its associated JAR libraries, and configuration files.
If you are using the Big Data File stage to access an InfoSphere BigInsights HDFS by way of the REST connection method or from an AIX operating system, the InfoSphere BigInsights libhdfs.so library is not required. Access to the library's associated .jar files and configuration files are, however, still required.
If the HDFS and the parallel engine are on different systems, ensure that InfoSphere Data Click can access the HDFS .jar files, the HDFS configuration directory, and the libhdfs.so shared library (if required).
One method to provide access to these HDFS components is to use NFS to mount directories from the HDFS computer onto the parallel engine computer. For more information about mounting directories by using NFS, see Potential issues sharing libhdfs via NFS mount.
Another method to provide access to the HDFS components is to copy them to the computer that hosts the parallel engine. If you use this method, you must update the HDFS components if the library names or locations change.
If you are using the copy method with an InfoSphere BigInsights HDFS on the Linux operating system, you can use the syncbi.sh tool. You can use the tool to do the initial copy of the needed libraries, .jar files, and configuration files. The tool also sets up the needed CLASSPATH environment variable by using the ishdfs.config file. You can also use the tool to keep the configuration in sync.
Required Hadoop JAR libraries
For the correct functioning of the libhdfs.so library, the Apache Software Foundation recommends including all of the .jar files in the $HADOOP_PREFIX and $HADOOP_PREFIX/lib directories, and the configuration directory containing the hdfs-site.xml file on the CLASSPATH variable. The locations of those directories vary from Hadoop distribution to Hadoop distribution, and might even vary from release to release. Some distributions or releases have even moved some of the jars from those directories into additional directories.
For Cloudera CDH version 5.1, the required Hadoop JAR directories are <CDH_ROOT>/lib/haddop/, <CDH_ROOT>/lib/hadoop/lib/, and <CDH_ROOT>/lib/hadoop/client/. The configuration directory is /etc/hadoop/conf/.
For HortonWorks HDP version 2.1, the required Hadoop JAR directories are /usr/lib/hadoop-hdfs, /usr/lib/hadoop, and /usr/lib/hadoop/lib. The configuration directory is /etc/hadoop/conf/.