Apache Pig is a platform for analyzing large data sets
and consists of a high-level language for use in data analysis programs
and is coupled with the MapReduce infrastructure for evaluating these
programs. Pig provides a script language, Pig Latin, which can be
compiled to a MapReduce program by Pig. As a result, using Pig can
reduce the time to develop new MapReduce applications.
About this task
Follow these steps to run Pig applications with the MapReduce framework in IBM Spectrum Symphony.
Procedure
-
Download and install Pig. The MapReduce framework in IBM Spectrum Symphony is qualified with Pig versions 0.12.1 and 0.13.0.
- Compile the Pig JAR file against Apache Hadoop MRv2 and
create symbolic links:
- Set Pig environments:
export PIG_HOME=/root/pig
export HADOOP_HOME=/root/hadoop
export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH
export PIG_CLASSPATH=$SOAM_SERVERDIR/../lib/*:$PMR_SERVERDIR/../lib/hadoop-2.7.2/
*:$PMR_SERVERDIR/../lib/*:$PMR_HOME/conf:$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=/root/hadoop/lib/native:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$PMR_SERVERDIR/../lib64/hadoop-2.7.2:$LD_LIBRARY_PATH
- Modify the bin/pig script.
- Locate the following section (line 343) in the file:
# run it
if [ -n "$HADOOP_BIN" ]; then
if [ "$debug" == "true" ]; then
echo "Find hadoop at $HADOOP_BIN"
fi
- Change the section to the following:
# run it
if [[ -n "$HADOOP_BIN" && -z "$PMR_HOME" ]]; then
if [ "$debug" == "true" ]; then
echo "Find hadoop at $HADOOP_BIN"
fi
-
Source the IBM Spectrum Symphony
cluster environment file and then the Pig environment settings defined in step 3. From the
$EGO_TOP directory, run:
- Create a Pig wordcount.pig script
as follows:
%fault INPUT wc-in
%fault OUTPUT out/pig-out-1
A = LOAD '$INPUT';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate group, COUNT(C);
F = ORDER E BY group;
STORE F INTO '$OUTPUT' USING PigStorage();
- Run the Pig script: