Hadoop Pipes
Apache Hadoop provides an adapter layer called pipes which allows C++ application code to be used in MapReduce programs. Applications that require high numerical performance may see better throughput if written in C++ and used through Pipes.
Before you begin
- Hadoop 2.7.2.
- Cloudera Distribution including Hadoop (CDH) 5.5.1.
To run pipes with Cloudera, download the required packages from the Cloudera web site.
To run Hadoop pipes on IBM PowerLinux systems, build your Hadoop version for PowerLinux and compile the pipes libraries as described in step 1.
About this task
Follow these steps to run a Hadoop pipes sample within the MapReduce framework in IBM Spectrum Symphony.
Procedure
- If you want to run pipes with Hadoop versions on PowerLinux systems, complete
these steps before following the rest of this procedure. Otherwise,
go to step 2. Note: While pipes is supported with the Hadoop and CDH versions listed in the previous section, pipes on PowerLinux is supported only with Hadoop 0.20.x and 1.0.x.
To run pipes on PowerLinux, build your Hadoop version for PowerLinux as described in Building Hadoop on IBM PowerLinux (https://www.ibm.com/developerworks/) and compile your pipes libraries as described here.
These steps compile the following pipes libraries to the $HADOOP_TOP/build/c++/Linux-ppc64-64/lib/ folder, where $HADOOP_TOP is the download directory of the source code distribution for the supported Hadoop versions:- libhadooppipes.a
- libhadooputils.a
- libhdfs.la
- libhdfs.so
- If required, edit ${HADOOP_TOP}/src/c++/libhdfs/configure. Change:
case $host_cpu in powerpc)to:case $host_cpu in powerpc*) - Build the libraries.
$ cd ${HADOOP_TOP}
$ ant -Dlibhdfs=true -Dcompile.c++=true compile-c++
$ ant -Dlibhdfs=true -Dcompile.c++=true compile-c++-libhdfs
- Check if the libraries are built. $ ls -l build/c++/Linux-ppc64-64/lib/
total 643 -rw-r--r-- 1 user group 735710 Feb 1 16:33 libhadooppipes.a -rw-r--r-- 1 user group 288328 Feb 1 16:33 libhadooputils.a -rwxr-xr-x 1 user group 1067 Feb 1 16:42 libhdfs.la* lrwxrwxrwx 1 user group 16 Feb 1 16:42 libhdfs.so -> libhdfs.so.0.0.0* lrwxrwxrwx 1 user group 16 Feb 1 16:42 libhdfs.so.0 -> libhdfs.so.0.0.0* -rwxr-xr-x 1 user group 136258 Feb 1 16:42 libhdfs.so.0.0.0*
- Compile Hadoop pipes samples using Apache ANT (if required,
download ANT). For CDH, you would use Apache Maven (if required, download
Maven).
- For Hadoop versions, follow these steps:
- Install ANT in /home/ant:
tar -zxvf apache-ant-1.8.2-bin.tar.gz -C /home/ant/
- Set the environment variables:
export ANT_HOME=/home/ant/apache-ant-1.8.2
export JAVA_HOME=/usr/java/latest
export PATH=$PATH:$JAVA_HOME/bin:$ANT_HOME/bin
- Compile Hadoop samples:
- Versions 2.7.2:
cd $HADOOP_HOME
ant -Dcompile.c++=yes examples
- Version 0.21:
cd $HADOOP_HOME/mapred
$PMR_HOME/version/os_type/samples/pipes_compile.sh $HADOOP_HOME/mapred
ant -Dcompile.c++=yes examples
Note: Running pipes_compile.sh replaces instances of{hadoop-common.version}and{hadoop-hdfs.version}with{hadoop-common.version}-SNAPSHOTand{hadoop-hdfs.version}-SNAPSHOT, respectively in the ivy.xml and build.xml files at $HADOOP_HOME. If you want to roll back the changes, run the following commands:for xml in `find $HADOOP_HOME -name ivy.xml` do DIR_NAME=`dirname $xml` cat $xml | sed 's/{hadoop-common.version}-SNAPSHOT/{hadoop-common.version}/g' | sed "s/{hadoop-hdfs.version}-SNAPSHOT/{hadoop-hdfs.version}/g" > $DIR_NAME/temp.xml rm $xml mv $DIR_NAME/temp.xml $DIR_NAME/ivy.xml done cat $HADOOP_HOME/build.xml | sed 's:${hadoop-hdfs.version}-SNAPSHOT:${hadoop-hdfs.version}:g' > $HADOOP_HOME/build.temp.xml rm $HADOOP_HOME/build.xml mv $HADOOP_HOME/build.temp.xml $HADOOP_HOME/build.xml
- Versions 2.7.2:
- Install ANT in /home/ant:
- For CDH, follow these steps:
- Install ANT in /home/ant:
tar -zxvf apache-ant-1.8.2-bin.tar.gz -C /home/ant/
- Install Maven in /home/mvn:
tar -zxvf apache-maven-3.0.4-bin.tar.gz -C /home/mvn
- Set the environment variables:
export ANT_HOME=/home/ant/apache-ant-1.8.2
export JAVA_HOME=/usr/java/latest
export MVN_HOME=/home/mvn/apache-maven-3.0.4-bin
export PATH=$PATH:$MVN_HOME/bin:$ANT_HOME/bin
- Compile the samples:
cd $HADOOP_HOME
ant -Dcompile.c++=yes examples
- Install ANT in /home/ant:
Once the pipes samples are compiled, you will find the pipes-sort, wordcount-nopipe, wordcount-part, and wordcount-simple samples at this location:- Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:
$HADOOP_HOME/build/c++-examples/Linux-amd64-64/bin/
- Hadoop 0.21.0:
$HADOOP_HOME/mapred/build/c++-examples/Linux-amd64-64/bin/
- For Hadoop versions, follow these steps:
- Run a Hadoop pipes sample within the MapReduce framework
in IBM Spectrum Symphony. The following steps describe how to run "wordcount-simple":
- Upload wordcount-simple to HDFS /pipes.
- Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:
$HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/build/c++-examples/Linux-amd64-64/bin/wordcount-simple /pipes/wordcount-simple
- Hadoop 0.21.0:
$HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/mapred/build/c++-examples/Linux-amd64-64/bin/wordcount-simple /pipes/wordcount-simple
- Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:
- Create two new text files, "file1" and "file2", each
with the following content:
- file1: Hello world
- file2: hello world
- Upload "file1" and "file2" to HDFS /wordcount/input.
$HADOOP_HOME/bin/hadoop fs -copyFromLocal file1 /wordcount/input/file1
$HADOOP_HOME/bin/hadoop fs -copyFromLocal file2 /wordcount/input/file2
- Run "wordcount-simple":
- Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:
mrsh pipes
-D hadoop.pipes.java.recordreader=true
-D hadoop.pipes.java.recordwriter=true
-D mapred.job.name=wordcount
-input hdfs://NameNodeAddress:Port/wordcount/input
-output hdfs://NameNodeAddress:Port/wordcount/pipes
-program hdfs://NameNodeAddress:Port/pipes/wordcount-simple
- Hadoop 0.21.0:
mrsh pipes
-D mapreduce.pipes.isjavarecordreader=true
-D mapreduce.pipes.isjavarecordwriter=true
-D mapred.job.name=wordcount
-input hdfs://NameNodeAddress:Port/wordcount/input
-output hdfs://NameNodeAddress:Port/wordcount/pipes
-program hdfs://NameNodeAddress:Port/pipes/wordcount-simple
- Hadoop 2.7.2 and CDH 3 update 1, 2, or 5:
- Go to the HDFS NameNode (for example, http://NameNode:50070) to check the result.
The output file, which will be under an HDFS path such as /output/, should contain:
- Hello 1
- hello 1
- world 2
- Upload wordcount-simple to HDFS /pipes.