Running MapReduce workload with HDFS running outside of a Docker container

An IBM® Spectrum Symphony cluster running in a Docker container can support MapReduce workload.

About this task

You can run an IBM Spectrum Symphony cluster within a Docker container, and configure that cluster to support MapReduce workload. The user data is located on the host operating system, By mounting this user data in the Kubernetes YAML file you can run HDFS on the host operating system while MapReduce workload runs within the Docker container.

Procedure

Install and configure Apache HDFS on all hosts. For details on how to install and configure Apache HDFS, see Installing and configuring Apache HDFS.
Note: If you are using Hadoop version 2.x, define the name of the MapReduce framework in the mapred-site.xml file as follows:
```
<configuration>
  <property>       
    <name>mapreduce.framework.name</name>
    <value>Yarn</value>
  </property>
</configuration>
```

Create the IBM Spectrum Symphony primary node and compute node with the Kubernetes YAML file. Kubernetes will create a Docker container from the YAML file and run the container start command defined within the YAML file:

Note: When configuring the Kubernetes YAML files for the primary and compute nodes, mount HADOOP_HOME, JAVA_HOME, and the host user and host group to the container so that container can have the same Hadoop and egoadmin user as the host operating system to avoid permission issues. This will also ensure that the container and the host operating system share the same HDFS configuration.

Create the IBM Spectrum Symphony primary node using the kubectl command:

kubectl create -f sym-primary-rc.yaml

The following is an example of the sym-primary-rc.yaml file:

kind: ReplicationController
apiVersion: v1
metadata:
  name: sym-primary
spec:
  replicas: 1
  selector:
    component: sym-primary
  template:
    metadata:
      labels:
        component: sym-primary
    spec:
      containers:
        - name: sym-primary
          image: sym711
          command: ["/bin/sh", "-c", " source /opt/ibm/platform/profile.platform; egoconfig join `hostname` -f; 
                   egoconfig setentitlement /opt/platform_sym_adv_entitlement.dat;
                   egosh ego start; sudo /usr/sbin/sshd -D"]         
          volumeMounts:
            - name: hadoop
              mountPath: /opt/hadoop-2.6.0
              readOnly: false
            - name: java
              mountPath: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/
              readOnly: true
            - name: hadoopdir
              mountPath: /opt/hadoop
              readOnly: false
            - name: hadoopuser
              mountPath: /etc/passwd
              readOnly: true
            - name: hadoopgroup
              mountPath: /etc/group
              readOnly: true         
          resources:           
            requests:
              memory: 4096M           
            limits:
              memory: 8192M
      volumes:
        - name: hadoop         
          hostPath:
            path: /opt/hadoop-2.6.0
        - name: java
          hostPath:
            path: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/
        - name: hadoopdir         
          hostPath:
            path: /opt/hadoop
        - name: hadoopuser         
          hostPath:
            path: /etc/passwd
        - name: hadoopgroup         
          hostPath:
            path: /etc/group

After the primary node is created and running, get the primary host name and IP address:
- Run kubectl get pods to get the host name.
```
NAME                   READY     STATUS   RESTARTS    AGE
sym-primary-rc-04yws    1/1       Running  0           46m
```
- Run kubectl describe pod sym-primary-rc-04yws|grep IP to get the IP address.
```
IP:                10.32.0.1
```

Manually add the IBM Spectrum Symphony primary host name and IP address to the command in the compute node YAML file. For example:

command: ["/bin/sh", "-c", " source /opt/ibm/platform/profile.platform; sudo chmod 777 /etc/hosts; 
          echo '10.32.0.1 sym-primary-04yws' >> /etc/hosts; egoconfig join sym-primary-04yws -f; 
          egosh ego start; sudo /usr/sbin/sshd -D"]

Create the IBM Spectrum Symphony compute node:

kubectl create -f sym-compute-rc.yaml

replicationcontroller "sym-compute-rc" created

The following is an example of the sym-compute-rc.yaml file:

kind: ReplicationController
apiVersion: v1
metadata:
  name: sym-compute
spec:
  replicas: 2
  selector:
    component: sym-compute
  template:
    metadata:
      labels:
        component: sym-compute
    spec:
      containers:
        - name: sym-compute
          image: sym711
          command: ["/bin/sh", "-c", " source /opt/ibm/platform/profile.platform; sudo chmod 777 /etc/hosts; 
                   echo '10.32.0.1 sym-primary-04yws' >> /etc/hosts; egoconfig join sym-primary-04yws -f; 
                   egosh ego start; sudo /usr/sbin/sshd -D"]        
          volumeMounts:
            - name: hadoop
              mountPath: /home/jmlv/yarn/hadoop-2.6.0
              readOnly: false
            - name: java
              mountPath: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/
              readOnly: true
            - name: hadoopdir
              mountPath: /opt/hadoop
              readOnly: false
            - name: hadoopuser
              mountPath: /etc/passwd
              readOnly: true
            - name: hadoopgroup
              mountPath: /etc/group
              readOnly: true         
          resources:           
            requests:
              memory: 4096M           
            limits:
              memory: 8192M
      volumes:
        - name: hadoop         
          hostPath:
            path: /home/jmlv/yarn/hadoop-2.6.0
        - name: java
          hostPath:
            path: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/
        - name: hadoopdir         
          hostPath:
            path: /opt/hadoop
        - name: hadoopuser         
          hostPath:
            path: /etc/passwd
        - name: hadoopgroup         
          hostPath:
            path: /etc/group

Integrate MapReduce and HDFS on all IBM Spectrum Symphony nodes:

Note: If MapReduce is already integrated during IBM Spectrum Symphony installation, you do not need to do this step.

Modify the ${PMR_HOME}/conf/pmr-env.sh file as follows:

export JAVA_HOME=/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64
export HADOOP_VERSION=2_6_0
export PMR_EXTERNAL_CONFIG_PATH=/opt/hadoop-2.6.0/etc/hadoop
export JAVA_LIBRARY_PATH=/opt/hadoop-2.6.0
                         /lib/native:/opt/hadoop-2.6.0/lib/native/Linux-amd64-64/:/opt/hadoop-2.6.0/lib/native/Linux-i386-32/
export CLOUDERA_HOME=/opt/hadoop-2.6.0

Modify the ${PMR_HOME}/conf/core-site.xml.hdfs file as follows:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://${HDFS_HOST}.eng.platformlab.ibm.com:9000/</value>
  </property>
</configuration>

Submit MapReduce workload:
1. Log in to the Docker container that you are submitting MapReduce workload to:
  
  docker exec -ti $container_id /bin/bash
2. Put files into HDFS storage:
  
  ./bin/hdfs dfs -put ./README.txt/readme
  
  ./bin/hdfs dfs -ls/
  
  Found 1 item
  
  -rw-r--r-- 1 hadoop supergroup 1366 2016-04-23 21:47/readme.
3. Run a work count sample:
  
  mrsh jar/opt/ibm/platform/soam/mapreduce/7.1.1/linux-x86_64/samples/hadoop-mapreduce-examples-2.4.1.jar wordcount/readme/readme.out
4. When the job is finished, write the MapReduce output to HDFS:
  
  ./bin/hdfs dfs -ls /readme.out
  
  Found 2 items
  
  -rw-r--r-- 1 hadoop supergroup 0 2016-04-24 02:00 /readme.out/_SUCCESS
  
  -rw-r--r-- 1 hadoop supergroup 9 2016-04-24 02:00 /readme.out/part-r-00000
If you want to run IBM Spectrum Symphony in Advanced VEM mode, the following three items need be updated:
1. Before starting Docker, disable SELinux in the /etc/sysconfig/docker Docker configuration file:
  OPTIONS='--selinux-enabled=false
2. Before starting Kubernetes, enable root privileges for containers in the /etc/kubernetes/config configuration file:
  KUBE_ALLOW_PRIV="--allow-privileged=true"
3. When creating 'replicationcontrollers' in the YAML file, add the following security parameter for containers:
```
securityContext:
  privileged: true
```

Results

MapReduce is running workload within the Docker container while HDFS is running on the host operating system.