Apache Hadoop 3.0.x Support

Apache Hadoop 3.0.x is supported with HDFS Transparency 3.0.0. When you use Apache Hadoop, the configuration files of HDFS Transparency are located under /var/mmfs/hadoop/etc/hadoop. By default, the logs of HDFS Transparency are located under /var/log/transparency/.

If you want to run Apache Hadoop with HDFS Transparency 3.0, execute the following steps:
  1. Set ulimit nofile to 64K on all the nodes.
  2. Set up the ntp server to synchronize the time on all nodes.
  3. Root password-less access from NameNodes to all other DataNodes.

    For more details, see Passwordless ssh access.

  4. Install HDFS Transparency 3.0.0-x (gpfs.hdfs-protocol-3.0.0-x.<arch>.rpm) on all HDFS Transparency nodes.
  5. ssh to TransparencyNode1.
  6. Update the /var/mmfs/hadoop/etc/hadoop/core-site.xml with your NameNode hostname.
    
    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://c8f2n04:8020</value>
      </property>
    </configuration>
    
  7. Update the /var/mmfs/hadoop/etc/hadoop/hdfs-site.xml according to your configuration.
    Configuration Default Recommendation Comment
    dfs.replication 1 1 or 2 or 3 Check your file system by mmlsfs <fs-name> -r and update this configuration according to the value from mmlsfs.
    dfs.blocksize N/A 134217728 or 268435456 or 536870912 Usually, 128MB (134217728) is used and 512MB (536870912) might be used for IBM Storage Scale System file system.
    dfs.client.read.shortcircuit false true See Short-circuit read (SSR).

    For other configurations, take the default value.

  8. Update the /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml for the gpfs.mnt.dir, gpfs.data.dir and gpfs.storage.type configurations.
  9. Update the /var/mmfs/hadoop/etc/hadoop/hadoop-env.sh and change export JAVA_HOME= into export JAVA_HOME=<your-real-JDK8-Home-Dir>.
  10. Update the /var/mmfs/hadoop/etc/hadoop/workers to add DataNodes. One DataNode hostname per line.
  11. Synchronize all these changes into other DataNodes by executing the following command:
    /usr/lpp/mmfs/bin/mmhadoopctl connector syncconf /var/mmfs/hadoop/etc/hadoop
  12. Start HDFS Transparency by executing mmhadoopctl:
    /usr/lpp/mmfs/bin/mmhadoopctl connector start
  13. Check the service status of HDFS Transparency by executing mmhadoopctl:
    /usr/lpp/mmfs/bin/mmhadoopctl connector getstate
    Note: If HDFS Transparency is not up on some nodes, login to those nodes and check the logs located under /var/log/transparency. If you do not get any errors, HDFS Transparency should be up by now.
    If you want to configure the Yarn, execute the following steps:
    1. Download Apache Hadoop 3.0.x from Apache Hadoop website.
    2. Unzip the packages to /opt/Hadoop-3.0.x on HadoopNode1.
    3. Log in to HadoopNode1.
    4. Copy the hadoop-env.sh, hdfs-site.xml, and workers from /var/mmfs/hadoop/etc/hadoop on HDFS Transparency node to HadoopNode1:/opt/hadoop-3.0.x/etc/hadoop/.
    5. Copy /usr/lpp/mmfs/hadoop/template/mapred-site.xml.template and /usr/lpp/mmfs/hadoop/template/yarn-site.xml.template from HDFS Transparency node into HadoopNode1:/opt/hadoop-3.0.x/etc/hadoop as mapred-site.xml and yarn-site.xml.
    6. Update /opt/hadoop-3.0.x/etc/hadoop/mapred-site.xml with the correct path location for yarn.app.mapreduce.am.env, mapreduce.map.env, and mapreduce.reduce.env configurations.
      For example, change the value from HADOOP_MAPRED_HOME=/opt/hadoop-3.0.2 to HADOOP_MAPRED_HOME=/opt/hadoop-3.0.x
      Note: /opt/hadoop-3.0.x is the real location for Hadoop.
    7. Update /opt/hadoop-3.0.x/etc/hadoop/yarn-site.xml. Especially configuring the correct hostname for yarn.resourcemanager.hostname.
    8. Synchronize /opt/hadoop-3.0.x from HadoopNode1 to all other Hadoop nodes and keep the same location for all hosts.
    9. On the Resource Manager node, run the following command to start the Yarn service:
      #cd /opt/hadoop-3.0.x/sbin/
      #export YARN_NODEMANAGER_USER=root
      #export YARN_RESOURCEMANAGER_USER=root
      #./start-yarn.sh
      
      Note: By default, the logs for Yarn service will be under /opt/hadoop-3.0.x/logs. If you plan to start Yarn services with other user name, you could change the user root in the above command to your target user name.
    10. Run the following command to submit word count jobs:
      #/opt/hadoop-3.0.x/bin/hadoop dfs -put /etc/passwd /passwd
      #/opt/hadoop-3.0.x/bin/hadoop 
      jar /opt/hadoop-3.0.x/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.2.jar 
      wordcount /passwd /results
      

      The Yarn service works well if the word count job executed successfully.

      For more information, see IBM Storage Scale Hadoop performance tuning guide.