Apache Pig

Apache Pig is a platform for analyzing large data sets and consists of a high-level language for use in data analysis programs and is coupled with the MapReduce infrastructure for evaluating these programs. Pig provides a script language, Pig Latin, which can be compiled to a MapReduce program by Pig. As a result, using Pig can reduce the time to develop new MapReduce applications.

Before you begin

Ensure that the MapReduce framework in IBM® Spectrum Symphony is set to use Pig. For the supported versions of Hadoop see Supported distributed files systems for MapReduce or YARN integration. For the supported versions of Pig that the MapReduce framework in IBM Spectrum Symphony has been qualified with, see Supported third-party applications for MapReduce.

About this task

Follow these steps to run Pig applications with the MapReduce framework in IBM Spectrum Symphony.

Procedure

Download and install Pig. The MapReduce framework in IBM Spectrum Symphony is qualified with Pig versions 0.12.1 and 0.13.0.
Compile the Pig JAR file against Apache Hadoop MRv2 and create symbolic links:
- For Pig 0.12.1:
```
ln -s pig-0.12.1-withouthadoop-h2.jar file pig.jar
```
- For Pig 0.13.0:
```
ln -s pig-0.13.0-withouthadoop-h2.jar file pig-h1.jar
```
For Pig documentation, refer to http://pig.apache.org/.

Set Pig environments:

export PIG_HOME=/root/pig
export HADOOP_HOME=/root/hadoop
export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH
export PIG_CLASSPATH=$SOAM_SERVERDIR/../lib/*:$PMR_SERVERDIR/../lib/hadoop-2.7.2/
	*:$PMR_SERVERDIR/../lib/*:$PMR_HOME/conf:$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=/root/hadoop/lib/native:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$PMR_SERVERDIR/../lib64/hadoop-2.7.2:$LD_LIBRARY_PATH

Modify the bin/pig script.

Locate the following section (line 343) in the file:

# run it
if [ -n "$HADOOP_BIN" ]; then
    if [ "$debug" == "true" ]; then
        echo "Find hadoop at $HADOOP_BIN"
    fi

Change the section to the following:

# run it
if [[ -n "$HADOOP_BIN" && -z "$PMR_HOME" ]]; then
    if [ "$debug" == "true" ]; then
        echo "Find hadoop at $HADOOP_BIN"
    fi

Source the IBM Spectrum Symphony cluster environment file and then the Pig environment settings defined in step 3. From the $EGO_TOP directory, run:
```
source profile.platform
```

Create a Pig wordcount.pig script as follows:

%fault INPUT wc-in
%fault OUTPUT out/pig-out-1
A = LOAD '$INPUT';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate group, COUNT(C);
F = ORDER E BY group;
STORE F INTO '$OUTPUT' USING PigStorage();

Run the Pig script:
```
$ pig wordcount.pig
```