Hadoop Pipes

Apache Hadoop provides an adapter layer called pipes which allows C++ application code to be used in MapReduce programs. Applications that require high numerical performance may see better throughput if written in C++ and used through Pipes.

Before you begin

PipesEnsure that the MapReduce framework in IBM® Spectrum Symphony is set to use these distributions:
  • Hadoop 2.7.2.
  • Cloudera Distribution including Hadoop (CDH) 5.5.1.

    To run pipes with Cloudera, download the required packages from the Cloudera web site.

To run Hadoop pipes on IBM PowerLinux systems, build your Hadoop version for PowerLinux and compile the pipes libraries as described in step 1.

About this task

Follow these steps to run a Hadoop pipes sample within the MapReduce framework in IBM Spectrum Symphony.

Procedure

  1. If you want to run pipes with Hadoop versions on PowerLinux systems, complete these steps before following the rest of this procedure. Otherwise, go to step 2.
    Note: While pipes is supported with the Hadoop and CDH versions listed in the previous section, pipes on PowerLinux is supported only with Hadoop 0.20.x and 1.0.x.

    To run pipes on PowerLinux, build your Hadoop version for PowerLinux as described in Building Hadoop on IBM PowerLinux (https://www.ibm.com/developerworks/) and compile your pipes libraries as described here.

    These steps compile the following pipes libraries to the $HADOOP_TOP/build/c++/Linux-ppc64-64/lib/ folder, where $HADOOP_TOP is the download directory of the source code distribution for the supported Hadoop versions:
    • libhadooppipes.a
    • libhadooputils.a
    • libhdfs.la
    • libhdfs.so
    1. If required, edit ${HADOOP_TOP}/src/c++/libhdfs/configure.
      Change:
      case $host_cpu in
      powerpc)
      
      to:
      case $host_cpu in
      powerpc*)
    2. Build the libraries.

      $ cd ${HADOOP_TOP}

      $ ant -Dlibhdfs=true -Dcompile.c++=true compile-c++

      $ ant -Dlibhdfs=true -Dcompile.c++=true compile-c++-libhdfs

    3. Check if the libraries are built.
      $ ls -l build/c++/Linux-ppc64-64/lib/
      total 643
      -rw-r--r-- 1 user group 735710 Feb  1 16:33 libhadooppipes.a
      -rw-r--r-- 1 user group 288328 Feb  1 16:33 libhadooputils.a
      -rwxr-xr-x 1 user group   1067 Feb  1 16:42 libhdfs.la*
      lrwxrwxrwx 1 user group     16 Feb  1 16:42 libhdfs.so -> libhdfs.so.0.0.0*
      lrwxrwxrwx 1 user group     16 Feb  1 16:42 libhdfs.so.0 -> libhdfs.so.0.0.0*
      -rwxr-xr-x 1 user group 136258 Feb  1 16:42 libhdfs.so.0.0.0*
  2. Compile Hadoop pipes samples using Apache ANT (if required, download ANT). For CDH, you would use Apache Maven (if required, download Maven).
    • For Hadoop versions, follow these steps:
      1. Install ANT in /home/ant:

        tar -zxvf apache-ant-1.8.2-bin.tar.gz -C /home/ant/

      2. Set the environment variables:

        export ANT_HOME=/home/ant/apache-ant-1.8.2

        export JAVA_HOME=/usr/java/latest

        export PATH=$PATH:$JAVA_HOME/bin:$ANT_HOME/bin

      3. Compile Hadoop samples:
        • Versions 2.7.2:

          cd $HADOOP_HOME

          ant -Dcompile.c++=yes examples

        • Version 0.21:

          cd $HADOOP_HOME/mapred

          $PMR_HOME/version/os_type/samples/pipes_compile.sh $HADOOP_HOME/mapred

          ant -Dcompile.c++=yes examples

          Note: Running pipes_compile.sh replaces instances of {hadoop-common.version} and {hadoop-hdfs.version} with {hadoop-common.version}-SNAPSHOT and {hadoop-hdfs.version}-SNAPSHOT, respectively in the ivy.xml and build.xml files at $HADOOP_HOME. If you want to roll back the changes, run the following commands:
          for xml in `find $HADOOP_HOME -name ivy.xml`
          do
              DIR_NAME=`dirname $xml`
              cat $xml | sed 's/{hadoop-common.version}-SNAPSHOT/{hadoop-common.version}/g' | sed "s/{hadoop-hdfs.version}-SNAPSHOT/{hadoop-hdfs.version}/g" > $DIR_NAME/temp.xml
              rm $xml
              mv $DIR_NAME/temp.xml $DIR_NAME/ivy.xml
          done
          
          cat $HADOOP_HOME/build.xml | sed 's:${hadoop-hdfs.version}-SNAPSHOT:${hadoop-hdfs.version}:g' > $HADOOP_HOME/build.temp.xml
          rm $HADOOP_HOME/build.xml
          mv $HADOOP_HOME/build.temp.xml $HADOOP_HOME/build.xml
    • For CDH, follow these steps:
      1. Install ANT in /home/ant:

        tar -zxvf apache-ant-1.8.2-bin.tar.gz -C /home/ant/

      2. Install Maven in /home/mvn:

        tar -zxvf apache-maven-3.0.4-bin.tar.gz -C /home/mvn

      3. Set the environment variables:

        export ANT_HOME=/home/ant/apache-ant-1.8.2

        export JAVA_HOME=/usr/java/latest

        export MVN_HOME=/home/mvn/apache-maven-3.0.4-bin

        export PATH=$PATH:$MVN_HOME/bin:$ANT_HOME/bin

      4. Compile the samples:

        cd $HADOOP_HOME

        ant -Dcompile.c++=yes examples

    Once the pipes samples are compiled, you will find the pipes-sort, wordcount-nopipe, wordcount-part, and wordcount-simple samples at this location:
    • Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:

      $HADOOP_HOME/build/c++-examples/Linux-amd64-64/bin/

    • Hadoop 0.21.0:

      $HADOOP_HOME/mapred/build/c++-examples/Linux-amd64-64/bin/

  3. Run a Hadoop pipes sample within the MapReduce framework in IBM Spectrum Symphony. The following steps describe how to run "wordcount-simple":
    1. Upload wordcount-simple to HDFS /pipes.
      • Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:

        $HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/build/c++-examples/Linux-amd64-64/bin/wordcount-simple /pipes/wordcount-simple

      • Hadoop 0.21.0:

        $HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/mapred/build/c++-examples/Linux-amd64-64/bin/wordcount-simple /pipes/wordcount-simple

    2. Create two new text files, "file1" and "file2", each with the following content:
      • file1: Hello world
      • file2: hello world
    3. Upload "file1" and "file2" to HDFS /wordcount/input.

      $HADOOP_HOME/bin/hadoop fs -copyFromLocal file1 /wordcount/input/file1

      $HADOOP_HOME/bin/hadoop fs -copyFromLocal file2 /wordcount/input/file2

    4. Run "wordcount-simple":
      • Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:

        mrsh pipes

        -D hadoop.pipes.java.recordreader=true

        -D hadoop.pipes.java.recordwriter=true

        -D mapred.job.name=wordcount

        -input hdfs://NameNodeAddress:Port/wordcount/input

        -output hdfs://NameNodeAddress:Port/wordcount/pipes

        -program hdfs://NameNodeAddress:Port/pipes/wordcount-simple

      • Hadoop 0.21.0:

        mrsh pipes

        -D mapreduce.pipes.isjavarecordreader=true

        -D mapreduce.pipes.isjavarecordwriter=true

        -D mapred.job.name=wordcount

        -input hdfs://NameNodeAddress:Port/wordcount/input

        -output hdfs://NameNodeAddress:Port/wordcount/pipes

        -program hdfs://NameNodeAddress:Port/pipes/wordcount-simple

    5. Go to the HDFS NameNode (for example, http://NameNode:50070) to check the result.
      The output file, which will be under an HDFS path such as /output/, should contain:
      • Hello 1
      • hello 1
      • world 2