Running MapReduce workload with HDFS running outside of a Docker container
An IBM® Spectrum Symphony cluster running in a Docker container can support MapReduce workload.
About this task
You can run an IBM Spectrum Symphony cluster within a Docker container, and configure that cluster to support MapReduce workload. The user data is located on the host operating system, By mounting this user data in the Kubernetes YAML file you can run HDFS on the host operating system while MapReduce workload runs within the Docker container.
Procedure
-
Install and configure Apache HDFS on all hosts. For details on how to install and configure
Apache HDFS, see Installing and configuring Apache HDFS.
Note: If you are using Hadoop version 2.x, define the name of the MapReduce framework in the mapred-site.xml file as follows:
<configuration> <property> <name>mapreduce.framework.name</name> <value>Yarn</value> </property> </configuration>
-
Create the IBM Spectrum Symphony
primary node and compute node
with the Kubernetes YAML file. Kubernetes will create a Docker container from the YAML file and run
the container start command defined within the YAML file:
Note: When configuring the Kubernetes YAML files for the primary and compute nodes, mount HADOOP_HOME, JAVA_HOME, and the host user and host group to the container so that container can have the same Hadoop and egoadmin user as the host operating system to avoid permission issues. This will also ensure that the container and the host operating system share the same HDFS configuration.
-
Create the IBM Spectrum Symphony
primary node using the
kubectl command:
kubectl create -f sym-primary-rc.yaml
The following is an example of the sym-primary-rc.yaml file:
kind: ReplicationController apiVersion: v1 metadata: name: sym-primary spec: replicas: 1 selector: component: sym-primary template: metadata: labels: component: sym-primary spec: containers: - name: sym-primary image: sym711 command: ["/bin/sh", "-c", " source /opt/ibm/platform/profile.platform; egoconfig join `hostname` -f; egoconfig setentitlement /opt/platform_sym_adv_entitlement.dat; egosh ego start; sudo /usr/sbin/sshd -D"] volumeMounts: - name: hadoop mountPath: /opt/hadoop-2.6.0 readOnly: false - name: java mountPath: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/ readOnly: true - name: hadoopdir mountPath: /opt/hadoop readOnly: false - name: hadoopuser mountPath: /etc/passwd readOnly: true - name: hadoopgroup mountPath: /etc/group readOnly: true resources: requests: memory: 4096M limits: memory: 8192M volumes: - name: hadoop hostPath: path: /opt/hadoop-2.6.0 - name: java hostPath: path: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/ - name: hadoopdir hostPath: path: /opt/hadoop - name: hadoopuser hostPath: path: /etc/passwd - name: hadoopgroup hostPath: path: /etc/group
-
After the primary node is
created and running, get the primary host name and IP
address:
- Run kubectl get pods to get the host
name.
NAME READY STATUS RESTARTS AGE sym-primary-rc-04yws 1/1 Running 0 46m
- Run kubectl describe pod sym-primary-rc-04yws|grep IP
to get the IP address.
IP: 10.32.0.1
- Run kubectl get pods to get the host
name.
-
Manually add the IBM Spectrum Symphony
primary host name and IP
address to the command in the compute node YAML file. For example:
command: ["/bin/sh", "-c", " source /opt/ibm/platform/profile.platform; sudo chmod 777 /etc/hosts; echo '10.32.0.1 sym-primary-04yws' >> /etc/hosts; egoconfig join sym-primary-04yws -f; egosh ego start; sudo /usr/sbin/sshd -D"]
-
Create the IBM Spectrum Symphony
compute node:
kubectl create -f sym-compute-rc.yaml
replicationcontroller "sym-compute-rc" created
The following is an example of the sym-compute-rc.yaml file:
kind: ReplicationController apiVersion: v1 metadata: name: sym-compute spec: replicas: 2 selector: component: sym-compute template: metadata: labels: component: sym-compute spec: containers: - name: sym-compute image: sym711 command: ["/bin/sh", "-c", " source /opt/ibm/platform/profile.platform; sudo chmod 777 /etc/hosts; echo '10.32.0.1 sym-primary-04yws' >> /etc/hosts; egoconfig join sym-primary-04yws -f; egosh ego start; sudo /usr/sbin/sshd -D"] volumeMounts: - name: hadoop mountPath: /home/jmlv/yarn/hadoop-2.6.0 readOnly: false - name: java mountPath: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/ readOnly: true - name: hadoopdir mountPath: /opt/hadoop readOnly: false - name: hadoopuser mountPath: /etc/passwd readOnly: true - name: hadoopgroup mountPath: /etc/group readOnly: true resources: requests: memory: 4096M limits: memory: 8192M volumes: - name: hadoop hostPath: path: /home/jmlv/yarn/hadoop-2.6.0 - name: java hostPath: path: /pcc/app/IBM_jdk1.7/Linux-ibm-java-x86_64-70/ - name: hadoopdir hostPath: path: /opt/hadoop - name: hadoopuser hostPath: path: /etc/passwd - name: hadoopgroup hostPath: path: /etc/group
-
Create the IBM Spectrum Symphony
primary node using the
kubectl command:
-
Integrate MapReduce
and HDFS on all IBM Spectrum Symphony
nodes:
Note: If MapReduce is already integrated during IBM Spectrum Symphony installation, you do not need to do this step.
-
Modify the ${PMR_HOME}/conf/pmr-env.sh file as follows:
export JAVA_HOME=/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64 export HADOOP_VERSION=2_6_0 export PMR_EXTERNAL_CONFIG_PATH=/opt/hadoop-2.6.0/etc/hadoop export JAVA_LIBRARY_PATH=/opt/hadoop-2.6.0 /lib/native:/opt/hadoop-2.6.0/lib/native/Linux-amd64-64/:/opt/hadoop-2.6.0/lib/native/Linux-i386-32/ export CLOUDERA_HOME=/opt/hadoop-2.6.0
-
Modify the ${PMR_HOME}/conf/core-site.xml.hdfs file as follows:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://${HDFS_HOST}.eng.platformlab.ibm.com:9000/</value> </property> </configuration>
-
Modify the ${PMR_HOME}/conf/pmr-env.sh file as follows:
-
Submit MapReduce
workload:
-
Log in to the Docker container that you are submitting MapReduce workload
to:
docker exec -ti $container_id /bin/bash
-
Put files into HDFS storage:
./bin/hdfs dfs -put ./README.txt/readme
./bin/hdfs dfs -ls/
Found 1 item
-rw-r--r-- 1 hadoop supergroup 1366 2016-04-23 21:47/readme.
-
Run a work count sample:
mrsh jar/opt/ibm/platform/soam/mapreduce/7.1.1/linux-x86_64/samples/hadoop-mapreduce-examples-2.4.1.jar wordcount/readme/readme.out
-
When the job is finished, write the MapReduce output to HDFS:
./bin/hdfs dfs -ls /readme.out
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2016-04-24 02:00 /readme.out/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 9 2016-04-24 02:00 /readme.out/part-r-00000
-
Log in to the Docker container that you are submitting MapReduce workload
to:
-
If you want to run IBM Spectrum Symphony in Advanced VEM
mode, the following three items need be updated:
- Before starting Docker, disable SELinux in the
/etc/sysconfig/docker Docker configuration
file:
OPTIONS='--selinux-enabled=false
- Before starting Kubernetes, enable root privileges for containers in the
/etc/kubernetes/config configuration
file:
KUBE_ALLOW_PRIV="--allow-privileged=true"
- When creating 'replicationcontrollers' in the YAML file, add the following security
parameter for
containers:
securityContext: privileged: true
- Before starting Docker, disable SELinux in the
/etc/sysconfig/docker Docker configuration
file: