Hadoop Pipes

Apache Hadoop provides an adapter layer called pipes which allows C++ application code to be used in MapReduce programs. Applications that require high numerical performance may see better throughput if written in C++ and used through Pipes.

Before you begin

PipesEnsure that the MapReduce framework in IBM® Spectrum Symphony is set to use these distributions:

Hadoop 2.7.2.
Cloudera Distribution including Hadoop (CDH) 5.5.1.
To run pipes with Cloudera, download the required packages from the Cloudera web site.

To run Hadoop pipes on IBM PowerLinux systems, build your Hadoop version for PowerLinux and compile the pipes libraries as described in step 1.

About this task

Follow these steps to run a Hadoop pipes sample within the MapReduce framework in IBM Spectrum Symphony.

Procedure

If you want to run pipes with Hadoop versions on PowerLinux systems, complete these steps before following the rest of this procedure. Otherwise, go to step 2.
Note: While pipes is supported with the Hadoop and CDH versions listed in the previous section, pipes on PowerLinux is supported only with Hadoop 0.20.x and 1.0.x.

To run pipes on PowerLinux, build your Hadoop version for PowerLinux as described in Building Hadoop on IBM PowerLinux (https://www.ibm.com/developerworks/) and compile your pipes libraries as described here.
These steps compile the following pipes libraries to the $HADOOP_TOP/build/c++/Linux-ppc64-64/lib/ folder, where $HADOOP_TOP is the download directory of the source code distribution for the supported Hadoop versions:
- libhadooppipes.a
- libhadooputils.a
- libhdfs.la
- libhdfs.so
1. If required, edit ${HADOOP_TOP}/src/c++/libhdfs/configure.
  Change:
```
case $host_cpu in
powerpc)
```
  to:
```
case $host_cpu in
powerpc*)
```
2. Build the libraries.
  $ cd ${HADOOP_TOP}
  
  $ ant -Dlibhdfs=true -Dcompile.c++=true compile-c++
  
  $ ant -Dlibhdfs=true -Dcompile.c++=true compile-c++-libhdfs
3. Check if the libraries are built.
  $ ls -l build/c++/Linux-ppc64-64/lib/
```
total 643
-rw-r--r-- 1 user group 735710 Feb  1 16:33 libhadooppipes.a
-rw-r--r-- 1 user group 288328 Feb  1 16:33 libhadooputils.a
-rwxr-xr-x 1 user group   1067 Feb  1 16:42 libhdfs.la*
lrwxrwxrwx 1 user group     16 Feb  1 16:42 libhdfs.so -> libhdfs.so.0.0.0*
lrwxrwxrwx 1 user group     16 Feb  1 16:42 libhdfs.so.0 -> libhdfs.so.0.0.0*
-rwxr-xr-x 1 user group 136258 Feb  1 16:42 libhdfs.so.0.0.0*
```
Compile Hadoop pipes samples using Apache ANT (if required, download ANT). For CDH, you would use Apache Maven (if required, download Maven).
- For Hadoop versions, follow these steps:
  1. Install ANT in /home/ant:
    tar -zxvf apache-ant-1.8.2-bin.tar.gz -C /home/ant/
  2. Set the environment variables:
    export ANT_HOME=/home/ant/apache-ant-1.8.2
    
    export JAVA_HOME=/usr/java/latest
    
    export PATH=$PATH:$JAVA_HOME/bin:$ANT_HOME/bin
  3. Compile Hadoop samples:
    - Versions 2.7.2:
      cd $HADOOP_HOME
      
      ant -Dcompile.c++=yes examples
    - Version 0.21:
      cd $HADOOP_HOME/mapred
      
      $PMR_HOME/version/os_type/samples/pipes_compile.sh $HADOOP_HOME/mapred
      
      ant -Dcompile.c++=yes examples
      Note: Running pipes_compile.sh replaces instances of {hadoop-common.version} and {hadoop-hdfs.version} with {hadoop-common.version}-SNAPSHOT and {hadoop-hdfs.version}-SNAPSHOT, respectively in the ivy.xml and build.xml files at $HADOOP_HOME. If you want to roll back the changes, run the following commands:
      for xml in `find $HADOOP_HOME -name ivy.xml` do DIR_NAME=`dirname $xml` cat $xml | sed 's/{hadoop-common.version}-SNAPSHOT/{hadoop-common.version}/g' | sed "s/{hadoop-hdfs.version}-SNAPSHOT/{hadoop-hdfs.version}/g" > $DIR_NAME/temp.xml rm $xml mv $DIR_NAME/temp.xml $DIR_NAME/ivy.xml done cat $HADOOP_HOME/build.xml | sed 's:${hadoop-hdfs.version}-SNAPSHOT:${hadoop-hdfs.version}:g' > $HADOOP_HOME/build.temp.xml rm $HADOOP_HOME/build.xml mv $HADOOP_HOME/build.temp.xml $HADOOP_HOME/build.xml
- For CDH, follow these steps:
  1. Install ANT in /home/ant:
    tar -zxvf apache-ant-1.8.2-bin.tar.gz -C /home/ant/
  2. Install Maven in /home/mvn:
    tar -zxvf apache-maven-3.0.4-bin.tar.gz -C /home/mvn
  3. Set the environment variables:
    export ANT_HOME=/home/ant/apache-ant-1.8.2
    
    export JAVA_HOME=/usr/java/latest
    
    export MVN_HOME=/home/mvn/apache-maven-3.0.4-bin
    
    export PATH=$PATH:$MVN_HOME/bin:$ANT_HOME/bin
  4. Compile the samples:
    cd $HADOOP_HOME
    
    ant -Dcompile.c++=yes examples
Once the pipes samples are compiled, you will find the pipes-sort, wordcount-nopipe, wordcount-part, and wordcount-simple samples at this location:
- Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:
  $HADOOP_HOME/build/c++-examples/Linux-amd64-64/bin/
- Hadoop 0.21.0:
  $HADOOP_HOME/mapred/build/c++-examples/Linux-amd64-64/bin/
Run a Hadoop pipes sample within the MapReduce framework in IBM Spectrum Symphony. The following steps describe how to run "wordcount-simple":
1. Upload wordcount-simple to HDFS /pipes.
  - Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:
    $HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/build/c++-examples/Linux-amd64-64/bin/wordcount-simple /pipes/wordcount-simple
  - Hadoop 0.21.0:
    $HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/mapred/build/c++-examples/Linux-amd64-64/bin/wordcount-simple /pipes/wordcount-simple
2. Create two new text files, "file1" and "file2", each with the following content:
  - file1: Hello world
  - file2: hello world
3. Upload "file1" and "file2" to HDFS /wordcount/input.
  $HADOOP_HOME/bin/hadoop fs -copyFromLocal file1 /wordcount/input/file1
  
  $HADOOP_HOME/bin/hadoop fs -copyFromLocal file2 /wordcount/input/file2
4. Run "wordcount-simple":
  - Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:
    mrsh pipes
    
    -D hadoop.pipes.java.recordreader=true
    
    -D hadoop.pipes.java.recordwriter=true
    
    -D mapred.job.name=wordcount
    
    -input hdfs://NameNodeAddress:Port/wordcount/input
    
    -output hdfs://NameNodeAddress:Port/wordcount/pipes
    
    -program hdfs://NameNodeAddress:Port/pipes/wordcount-simple
  - Hadoop 0.21.0:
    mrsh pipes
    
    -D mapreduce.pipes.isjavarecordreader=true
    
    -D mapreduce.pipes.isjavarecordwriter=true
    
    -D mapred.job.name=wordcount
    
    -input hdfs://NameNodeAddress:Port/wordcount/input
    
    -output hdfs://NameNodeAddress:Port/wordcount/pipes
    
    -program hdfs://NameNodeAddress:Port/pipes/wordcount-simple
5. Go to the HDFS NameNode (for example, http://NameNode:50070) to check the result.
  The output file, which will be under an HDFS path such as /output/, should contain:
  - Hello 1
  - hello 1
  - world 2