Apache Pig

Apache Pig is a platform for analyzing large data sets and consists of a high-level language for use in data analysis programs and is coupled with the MapReduce infrastructure for evaluating these programs. Pig provides a script language, Pig Latin, which can be compiled to a MapReduce program by Pig. As a result, using Pig can reduce the time to develop new MapReduce applications.

Before you begin

Ensure that the MapReduce framework in IBM® Spectrum Symphony is set to use Pig. For the supported versions of Hadoop see Supported distributed files systems for MapReduce or YARN integration. For the supported versions of Pig that the MapReduce framework in IBM Spectrum Symphony has been qualified with, see Supported third-party applications for MapReduce.

About this task

Follow these steps to run Pig applications with the MapReduce framework in IBM Spectrum Symphony.

Procedure

  1. Download and install Pig. The MapReduce framework in IBM Spectrum Symphony is qualified with Pig versions 0.12.1 and 0.13.0.
  2. Compile the Pig JAR file against Apache Hadoop MRv2 and create symbolic links:
    • For Pig 0.12.1:
      ln -s pig-0.12.1-withouthadoop-h2.jar file pig.jar
    • For Pig 0.13.0:
      ln -s pig-0.13.0-withouthadoop-h2.jar file pig-h1.jar

    For Pig documentation, refer to http://pig.apache.org/.

  3. Set Pig environments:
    export PIG_HOME=/root/pig
    export HADOOP_HOME=/root/hadoop
    export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH
    export PIG_CLASSPATH=$SOAM_SERVERDIR/../lib/*:$PMR_SERVERDIR/../lib/hadoop-2.7.2/
    	*:$PMR_SERVERDIR/../lib/*:$PMR_HOME/conf:$HADOOP_HOME/etc/hadoop
    export LD_LIBRARY_PATH=/root/hadoop/lib/native:$LD_LIBRARY_PATH
    export LD_LIBRARY_PATH=$PMR_SERVERDIR/../lib64/hadoop-2.7.2:$LD_LIBRARY_PATH
  4. Modify the bin/pig script.
    1. Locate the following section (line 343) in the file:
      # run it
      if [ -n "$HADOOP_BIN" ]; then
          if [ "$debug" == "true" ]; then
              echo "Find hadoop at $HADOOP_BIN"
          fi
      
    2. Change the section to the following:
      # run it
      if [[ -n "$HADOOP_BIN" && -z "$PMR_HOME" ]]; then
          if [ "$debug" == "true" ]; then
              echo "Find hadoop at $HADOOP_BIN"
          fi
  5. Source the IBM Spectrum Symphony cluster environment file and then the Pig environment settings defined in step 3. From the $EGO_TOP directory, run:
    source profile.platform
  6. Create a Pig wordcount.pig script as follows:
    %fault INPUT wc-in
    %fault OUTPUT out/pig-out-1
    A = LOAD '$INPUT';
    B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
    C = filter B by word matches '\\w+';
    D = group C by word;
    E = foreach D generate group, COUNT(C);
    F = ORDER E BY group;
    STORE F INTO '$OUTPUT' USING PigStorage();
  7. Run the Pig script:
    $ pig wordcount.pig