Creating the configuration file for the Big Data File stage

To establish connections to distributed file systems, such as a Hadoop Distributed File System (HDFS) in InfoSphere® BigInsights®, you can create the ishdfs.config configuration file to contain the required CLASSPATH information.

You must have IBM® InfoSphere DataStage® administrator level access to create the ishdfs.config file. Also, the Big Data File stage must be installed on a Linux® or AIX® operating system.

The ishdfs.config file sets up the CLASSPATH parameter with Java™ libraries and file system folders that are necessary to import metadata from the HDFS and move data to the HDFS.

If the HDFS and the InfoSphere Information Server engine are not installed on the same computer, you can copy the HDFS client library and client configuration files to the computer where the InfoSphere Information Server engine is installed, or make them available from the remote HDFS system. Whatever method you use, the HDFS client .jar files and configuration file directories must be accessible to the InfoSphere Information Server engine.

Tip: If you are using the copy method with an InfoSphere BigInsights HDFS on the Linux operating system, you can use the syncbi.sh tool. You can use the tool to do the initial copy of the needed libraries, .jar files, and configuration files. The tool also sets up the needed CLASSPATH environment variable by using the ishdfs.config file. The ishdfs.config.biginsights template file is copied to the ishdfs.config file, and points at the .jar files downloaded and unpacked into the $DSHOME/../biginsights directory.

When the Big Data File stage is configured to use the HDFS API to communicate with the HDFS, it uses the CLASSPATH variable in the ishdfs.config configuration file if the configuration file is available. The use of this CLASSPATH variable overrides any other setting of the CLASSPATH variable for the Big Data File stage.

  1. Create an ishdfs.config configuration file with read permissions enabled for all users of the computer where the InfoSphere Information Server engine is installed.
    The ishdfs.config file must be in the IS_HOME/Server/DSEngine directory, where IS_HOME is the InfoSphere Information Server home directory. The configuration file name is case sensitive.
  2. In the ishdfs.config file, add the following parameter to specify the location of the Java libraries, which contain the classes and packages for the HDFS:
    CLASSPATH= hdfs_classpath

    Add the CLASSPATH line and specify the hdfs_classpath value. The value hdfs_classpath is the colon-separated list of file system artifacts that are used to access the HDFS. Each element must be a fully qualified file system directory path or .jar file path. This classpath example is for InfoSphere BigInsights. If you use another HDFS distribution, your classpath is different.

    The following is an example of the Java libraries and file system folders:
    $DSHOME/../../ASBNode/eclipse/plugins/com.ibm.iis.client/httpclient-4.2.1.jar
    $DSHOME/../../ASBNode/eclipse/plugins/com.ibm.iis.client/httpcore-4.2.1.jar
    $DSHOME/../PXEngine/java/biginsights-restfs-1.0.0.jar
    $DSHOME/../PXEngine/java/cc-http-api.jar
    $DSHOME/../PXEngine/java/cc-http-impl.jar
    /opt/IBM/biginsights/IHC/lib/*
    /opt/IBM/biginsights/IHC/*
    /opt/IBM/biginsights/lib/JSON4J.jar
    /opt/IBM/biginsights/hadoop-conf 
    $DSHOME is the environment variable that points to the following directory: IS_HOME/Server/DSEngine directory. The default directory is /opt/IBM/InformationServer/Server/DSEngine directory. For example, create the ishdfs.config file and enter the following line:
    CLASSPATH= $DSHOME/../../ASBNode/eclipse/plugins/com.ibm.iis.client/httpclient-4.2.1.jar:$DSHOME/../../ASBNode/eclipse/plugins/com.ibm.iis.client/httpcore-4.2.1.jar:$DSHOME/../PXEngine/java/biginsights-restfs-1.0.0.jar:$DSHOME/../PXEngine/java/cc-http-api.jar:$DSHOME/../PXEngine/java/cc-http-impl.jar:/opt/IBM/biginsights/IHC/lib/*:/opt/IBM/biginsights/IHC/*:/opt/IBM/biginsights/lib/JSON4J.jar:/opt/IBM/biginsights/hadoop-conf 
    If you use the syncbi.sh tool to create the ishdfs.config file, the contents might look like the following example:
    CLASSPATH=$DSHOME/../../ASBNode/eclipse/plugins/com.ibm.iis.client/httpclient-4.2.1.jar:$DSHOME/../../ASBNode/eclipse/plugins/com.ibm.iis.client/httpcore-4.2.1.jar::$DSHOME/../PXEngine/java/biginsights-restfs-1.0.0.jar:$DSHOME/../PXEngine/java/cc-http-api.jar:$DSHOME/../PXEngine/java/cc-http-impl.jar:$DSHOME/../biginsights/hadoop-conf:$DSHOME/../biginsights:$DSHOME/../biginsights/*
  3. Save the ishdfs.config file in the IS_HOME/Server/DSEngine directory, where IS_HOME is the InfoSphere Information Server home directory. The default directory is /opt/IBM/InformationServer, for example, /opt/IBM/InformationServer/Server/DSEngine.
    If the engine tier in your InfoSphere Information Server installation consists of multiple hosts, this file must be available from the same location on all the hosts. You can make this file available from the same location on all the hosts by configuring the DSEngine directory as a shared network directory.