Apache Hadoop 3.0.x Support
Apache Hadoop 3.0.x is supported with HDFS Transparency 3.0.0. When you use Apache Hadoop, the configuration files of HDFS Transparency are located under /var/mmfs/hadoop/etc/hadoop. By default, the logs of HDFS Transparency are located under /var/log/transparency/.
If you want to run Apache Hadoop with HDFS Transparency 3.0, execute the following steps:
- Set ulimit nofile to 64K on all the nodes.
- Set up the ntp server to synchronize the time on all nodes.
- Root password-less access from NameNodes to all other DataNodes.
For more details, see Passwordless ssh access.
- Install HDFS Transparency 3.0.0-x (gpfs.hdfs-protocol-3.0.0-x.<arch>.rpm) on all HDFS Transparency nodes.
- ssh to TransparencyNode1.
- Update the /var/mmfs/hadoop/etc/hadoop/core-site.xml with your NameNode
hostname.
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://c8f2n04:8020</value> </property> </configuration>
- Update the /var/mmfs/hadoop/etc/hadoop/hdfs-site.xml according to your
configuration.
Configuration Default Recommendation Comment dfs.replication 1 1 or 2 or 3 Check your file system by mmlsfs <fs-name> -r and update this configuration according to the value from mmlsfs. dfs.blocksize N/A 134217728 or 268435456 or 536870912 Usually, 128MB (134217728) is used and 512MB (536870912) might be used for IBM Storage Scale System file system. dfs.client.read.shortcircuit false true See Short-circuit read (SSR). For other configurations, take the default value.
- Update the /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml for the gpfs.mnt.dir, gpfs.data.dir and gpfs.storage.type configurations.
- Update the /var/mmfs/hadoop/etc/hadoop/hadoop-env.sh and change export JAVA_HOME= into export JAVA_HOME=<your-real-JDK8-Home-Dir>.
- Update the /var/mmfs/hadoop/etc/hadoop/workers to add DataNodes. One DataNode hostname per line.
- Synchronize all these changes into other DataNodes by executing the following
command:
/usr/lpp/mmfs/bin/mmhadoopctl connector syncconf /var/mmfs/hadoop/etc/hadoop
- Start HDFS Transparency by executing
mmhadoopctl:
/usr/lpp/mmfs/bin/mmhadoopctl connector start
- Check the service status of HDFS Transparency by executing
mmhadoopctl:
/usr/lpp/mmfs/bin/mmhadoopctl connector getstate
Note: If HDFS Transparency is not up on some nodes, login to those nodes and check the logs located under /var/log/transparency. If you do not get any errors, HDFS Transparency should be up by now.If you want to configure the Yarn, execute the following steps:- Download Apache Hadoop 3.0.x from Apache Hadoop website.
- Unzip the packages to /opt/Hadoop-3.0.x on HadoopNode1.
- Log in to HadoopNode1.
- Copy the hadoop-env.sh, hdfs-site.xml, and workers from /var/mmfs/hadoop/etc/hadoop on HDFS Transparency node to HadoopNode1:/opt/hadoop-3.0.x/etc/hadoop/.
- Copy /usr/lpp/mmfs/hadoop/template/mapred-site.xml.template and /usr/lpp/mmfs/hadoop/template/yarn-site.xml.template from HDFS Transparency node into HadoopNode1:/opt/hadoop-3.0.x/etc/hadoop as mapred-site.xml and yarn-site.xml.
- Update /opt/hadoop-3.0.x/etc/hadoop/mapred-site.xml with the correct path
location for yarn.app.mapreduce.am.env, mapreduce.map.env,
and mapreduce.reduce.env configurations.For example, change the value from HADOOP_MAPRED_HOME=/opt/hadoop-3.0.2 to HADOOP_MAPRED_HOME=/opt/hadoop-3.0.xNote: /opt/hadoop-3.0.x is the real location for Hadoop.
- Update /opt/hadoop-3.0.x/etc/hadoop/yarn-site.xml. Especially configuring the correct hostname for yarn.resourcemanager.hostname.
- Synchronize /opt/hadoop-3.0.x from HadoopNode1 to all other Hadoop nodes and keep the same location for all hosts.
- On the Resource Manager node, run the following command to start the Yarn
service:
#cd /opt/hadoop-3.0.x/sbin/ #export YARN_NODEMANAGER_USER=root #export YARN_RESOURCEMANAGER_USER=root #./start-yarn.sh
Note: By default, the logs for Yarn service will be under /opt/hadoop-3.0.x/logs. If you plan to start Yarn services with other user name, you could change the user root in the above command to your target user name. - Run the following command to submit word count
jobs:
#/opt/hadoop-3.0.x/bin/hadoop dfs -put /etc/passwd /passwd #/opt/hadoop-3.0.x/bin/hadoop jar /opt/hadoop-3.0.x/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.2.jar wordcount /passwd /results
The Yarn service works well if the word count job executed successfully.
For more information, see IBM Storage Scale Hadoop performance tuning guide.