Running MapReduce workload with HDFS running outside of a Docker container

An IBM® Spectrum Symphony cluster running in a Docker container can support MapReduce workload.

About this task

You can run an IBM Spectrum Symphony cluster within a Docker container, and configure that cluster to support MapReduce workload. The user data is located on the host operating system, By mounting this user data in the Kubernetes YAML file you can run HDFS on the host operating system while MapReduce workload runs within the Docker container.

Procedure

  1. Install and configure Apache HDFS on all hosts. For details on how to install and configure Apache HDFS, see Installing and configuring Apache HDFS.
    Note: If you are using Hadoop version 2.x, define the name of the MapReduce framework in the mapred-site.xml file as follows:
    <configuration>
      <property>       
        <name>mapreduce.framework.name</name>
        <value>Yarn</value>
      </property>
    </configuration>
  2. Create the IBM Spectrum Symphony primary node and compute node with the Kubernetes YAML file. Kubernetes will create a Docker container from the YAML file and run the container start command defined within the YAML file:
    Note: When configuring the Kubernetes YAML files for the primary and compute nodes, mount HADOOP_HOME, JAVA_HOME, and the host user and host group to the container so that container can have the same Hadoop and egoadmin user as the host operating system to avoid permission issues. This will also ensure that the container and the host operating system share the same HDFS configuration.
    1. Create the IBM Spectrum Symphony primary node using the kubectl command:

      kubectl create -f sym-primary-rc.yaml

      The following is an example of the sym-primary-rc.yaml file:

      kind: ReplicationController
      apiVersion: v1
      metadata:
        name: sym-primary
      spec:
        replicas: 1
        selector:
          component: sym-primary
        template:
          metadata:
            labels:
              component: sym-primary
          spec:
            containers:
              - name: sym-primary
                image: sym711
                command: ["/bin/sh", "-c", " source /opt/ibm/platform/profile.platform; egoconfig join `hostname` -f; 
                         egoconfig setentitlement /opt/platform_sym_adv_entitlement.dat;
                         egosh ego start; sudo /usr/sbin/sshd -D"]         
                volumeMounts:
                  - name: hadoop
                    mountPath: /opt/hadoop-2.6.0
                    readOnly: false
                  - name: java
                    mountPath: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/
                    readOnly: true
                  - name: hadoopdir
                    mountPath: /opt/hadoop
                    readOnly: false
                  - name: hadoopuser
                    mountPath: /etc/passwd
                    readOnly: true
                  - name: hadoopgroup
                    mountPath: /etc/group
                    readOnly: true         
                resources:           
                  requests:
                    memory: 4096M           
                  limits:
                    memory: 8192M
            volumes:
              - name: hadoop         
                hostPath:
                  path: /opt/hadoop-2.6.0
              - name: java
                hostPath:
                  path: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/
              - name: hadoopdir         
                hostPath:
                  path: /opt/hadoop
              - name: hadoopuser         
                hostPath:
                  path: /etc/passwd
              - name: hadoopgroup         
                hostPath:
                  path: /etc/group
    2. After the primary node is created and running, get the primary host name and IP address:
      • Run kubectl get pods to get the host name.
        NAME                   READY     STATUS   RESTARTS    AGE
        sym-primary-rc-04yws    1/1       Running  0           46m
      • Run kubectl describe pod sym-primary-rc-04yws|grep IP to get the IP address.
        IP:                10.32.0.1
    3. Manually add the IBM Spectrum Symphony primary host name and IP address to the command in the compute node YAML file. For example:
      command: ["/bin/sh", "-c", " source /opt/ibm/platform/profile.platform; sudo chmod 777 /etc/hosts; 
                echo '10.32.0.1 sym-primary-04yws' >> /etc/hosts; egoconfig join sym-primary-04yws -f; 
                egosh ego start; sudo /usr/sbin/sshd -D"]
    4. Create the IBM Spectrum Symphony compute node:

      kubectl create -f sym-compute-rc.yaml

      replicationcontroller "sym-compute-rc" created

      The following is an example of the sym-compute-rc.yaml file:

      kind: ReplicationController
      apiVersion: v1
      metadata:
        name: sym-compute
      spec:
        replicas: 2
        selector:
          component: sym-compute
        template:
          metadata:
            labels:
              component: sym-compute
          spec:
            containers:
              - name: sym-compute
                image: sym711
                command: ["/bin/sh", "-c", " source /opt/ibm/platform/profile.platform; sudo chmod 777 /etc/hosts; 
                         echo '10.32.0.1 sym-primary-04yws' >> /etc/hosts; egoconfig join sym-primary-04yws -f; 
                         egosh ego start; sudo /usr/sbin/sshd -D"]        
                volumeMounts:
                  - name: hadoop
                    mountPath: /home/jmlv/yarn/hadoop-2.6.0
                    readOnly: false
                  - name: java
                    mountPath: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/
                    readOnly: true
                  - name: hadoopdir
                    mountPath: /opt/hadoop
                    readOnly: false
                  - name: hadoopuser
                    mountPath: /etc/passwd
                    readOnly: true
                  - name: hadoopgroup
                    mountPath: /etc/group
                    readOnly: true         
                resources:           
                  requests:
                    memory: 4096M           
                  limits:
                    memory: 8192M
            volumes:
              - name: hadoop         
                hostPath:
                  path: /home/jmlv/yarn/hadoop-2.6.0
              - name: java
                hostPath:
                  path: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/
              - name: hadoopdir         
                hostPath:
                  path: /opt/hadoop
              - name: hadoopuser         
                hostPath:
                  path: /etc/passwd
              - name: hadoopgroup         
                hostPath:
                  path: /etc/group
  3. Integrate MapReduce and HDFS on all IBM Spectrum Symphony nodes:
    Note: If MapReduce is already integrated during IBM Spectrum Symphony installation, you do not need to do this step.
    1. Modify the ${PMR_HOME}/conf/pmr-env.sh file as follows:
      export JAVA_HOME=/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64
      export HADOOP_VERSION=2_6_0
      export PMR_EXTERNAL_CONFIG_PATH=/opt/hadoop-2.6.0/etc/hadoop
      export JAVA_LIBRARY_PATH=/opt/hadoop-2.6.0
                               /lib/native:/opt/hadoop-2.6.0/lib/native/Linux-amd64-64/:/opt/hadoop-2.6.0/lib/native/Linux-i386-32/
      export CLOUDERA_HOME=/opt/hadoop-2.6.0
    2. Modify the ${PMR_HOME}/conf/core-site.xml.hdfs file as follows:
      <configuration>
        <property>
          <name>fs.default.name</name>
          <value>hdfs://${HDFS_HOST}.eng.platformlab.ibm.com:9000/</value>
        </property>
      </configuration>
  4. Submit MapReduce workload:
    1. Log in to the Docker container that you are submitting MapReduce workload to:

      docker exec -ti $container_id /bin/bash

    2. Put files into HDFS storage:

      ./bin/hdfs dfs -put ./README.txt/readme

      ./bin/hdfs dfs -ls/

      Found 1 item

      -rw-r--r-- 1 hadoop supergroup 1366 2016-04-23 21:47/readme.

    3. Run a work count sample:

      mrsh jar/opt/ibm/platform/soam/mapreduce/7.1.1/linux-x86_64/samples/hadoop-mapreduce-examples-2.4.1.jar wordcount/readme/readme.out

    4. When the job is finished, write the MapReduce output to HDFS:

      ./bin/hdfs dfs -ls /readme.out

      Found 2 items

      -rw-r--r-- 1 hadoop supergroup 0 2016-04-24 02:00 /readme.out/_SUCCESS

      -rw-r--r-- 1 hadoop supergroup 9 2016-04-24 02:00 /readme.out/part-r-00000

  5. If you want to run IBM Spectrum Symphony in Advanced VEM mode, the following three items need be updated:
    1. Before starting Docker, disable SELinux in the /etc/sysconfig/docker Docker configuration file:

      OPTIONS='--selinux-enabled=false

    2. Before starting Kubernetes, enable root privileges for containers in the /etc/kubernetes/config configuration file:

      KUBE_ALLOW_PRIV="--allow-privileged=true"

    3. When creating 'replicationcontrollers' in the YAML file, add the following security parameter for containers:
      securityContext:
        privileged: true

Results

MapReduce is running workload within the Docker container while HDFS is running on the host operating system.