Installing and configuring Apache HDFS

If you plan to use the Hadoop Distributed File System (HDFS) with MapReduce (available only on Linux® 64-bit hosts) and have not already installed HDFS, follow these steps. We strongly recommend that you set up Hadoop before installing IBM® Spectrum Symphony to avoid manual configuration. If you plan to install HDFS after installing IBM Spectrum Symphony, configure Hadoop for the MapReduce framework in IBM Spectrum Symphony. If you are using other supported distributed file systems, install them and configure them to work with MapReduce framework.

Before you begin

IBM Spectrum Symphony supports the HDFS versions included with Hadoop MapReduce API 2.7.2 within its MapReduce framework.

Note: The MapReduce framework in IBM Spectrum Symphony and Hadoop MapReduce can coexist on the same cluster using one HDFS provided each host in the cluster has enough memory, CPU slots, and disk space configured for both workloads.

Ensure that the hosts in your cluster meet the following software requirements:

Oracle Java™ version 1.6.0_21 or higher installed on all hosts in your cluster (required). For example, to install it from rpm under /usr/java/latest, enter:
rpm -Uvh jre-6u27-linux-i586.rpm
Dedicated Hadoop system user (recommended). To add system user (for example) hadoop to your local machine, enter:
useradd hadoop
SSH configured for internode communication (recommended). To configure SSH keys for the dedicated Hadoop system user (for example, hadoop):
1. Generate an SSH key for the hadoop user:
  su - hadoop
2. Create an RSA key pair with an empty password:
  ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
3. Enable SSH access to your local machine with this newly created key:
  cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
4. Restrict access to the authorized_keys file:
  chmod 600 ~/.ssh/authorized_keys
5. Test the setup and save your local machine’s host key fingerprint to the hadoop user's known_hosts file:
  ssh localhost

About this task

Note: Unless otherwise stated, this procedure uses commands based on the Hadoop 2.7.2 distribution. While you can use these commands on all supported Hadoop versions, you may see warnings about deprecation. To avoid these warnings, use the command corresponding to your Hadoop version.

Procedure

Download the Hadoop .tar.gz package from the Apache Hadoop download website.
Important: While the Hadoop package is available as .rpm and .tar.gz files, this procedure describes how to download the .tar.gz package to install HDFS. When HDFS is installed from the .rpm package, the directory structure is different, where:
- configurations files are under /etc/hadoop/
- bin files are under /usr/bin/
- libraries are under /usr/lib64/
- script files are under /usr/sbin/
- the service binary is under /etc/rc.d/init.d/
- other files are under /usr/share/hadoop/.
Be aware that hadoop/bin does not exist in the .rpm installation. Because of this change in directory structure, some sample applications fail to run.
Select a directory to install Hadoop and untar the package tar ball in that directory. For example, to download the 2.4.x distribution, enter:
```
cd /opt
tar xzvf hadoop-2.4.x.tar.gz
ln -s hadoop-2.4.x hadoop
```
Select a host to be the NameNode (for example, db03b10).

As user hadoop, configure the environment:

The following example is for the 2.4.x distribution:

export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/opt/hadoop-2.4.x
export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH:$HADOOP_HOME/bin
export HADOOP_VERSION=2_4_x

Set up some basic configuration:
1. Specify the Java version used by Hadoop using the ${HADOOP_HOME}/conf/hadoop-env.sh script file. Change, for example:
```
# The Java implementation to use. Required.
# export JAVA_HOME=
```
  to:
```
# The Java implementation to use. Required.
export JAVA_HOME=/usr/java/latest
```
2. Configure the directory where Hadoop will store its data files using the ${HADOOP_HOME}/conf/core-site.xml file. For example:
```
<configuration>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/opt/hadoop/data</value>
        </property> 
        <property>
                <name>fs.default.name</name>
                
                <value>hdfs://db03b10:9000/</value>
        </property>
</configuration>
```
3. Ensure that the base you specified for other temporary directories exists and has the required ownerships and permissions. For example:
```
mkdir -p /opt/hadoop/data
chown hadoop:hadoop /opt/hadoop/data
chmod 755 /opt/hadoop/data
```
  Note: For a multiple user configuration, ensure other OS users (belonging to the same group as the hadoop user) have write permission to the base temporary directory. For example:
  chmod 775 /opt/hadoop/data
4. Set values for the following properties using the ${HADOOP_HOME}/conf/mapred-site.xml.
  - Set the MapReduce framework name. This ensures that MapReduce jobs run using the IBM Spectrum Symphony MapReduce famework (rather than run in local MapReduce mode). For example:
```
<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
</configuration>
```
  - Set the maximum number of map tasks run simultaneously per TaskTracker. For example:
```
<configuration>
        <property>
 <name>mapred.tasktracker.map.tasks.maximum</name>
 <value>7</value>
        </property>
</configuration>
```
  - Set the maximum number of reduce tasks run simultaneously per TaskTracker. For example:
```
<configuration>
        <property>
 <name>mapred.tasktracker.reduce.tasks.maximum</name>
 <value>3</value>
        </property>
</configuration>
```
5. Set the default block replication using the ${HADOOP_HOME}/conf/hdfs-site.xml. For example:
```
<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
</configuration>
```
6. If you plan on using the High-Availability HDFS feature (not available on the Developer Edition), configure the shared file system directory to be used for HDFS NameNode meta data storage in the hdfs-site.xml file. For example:
```
<configuration>
        <property>
                <name>dfs.name.dir</name>
                <value>/share/hdfs/name</value>
        </property>
</configuration>
```
Repeat steps 1 to 5 on every compute host.
Specify primary and compute hosts. For example, in the 2.4.x distribution, the compute file is under the ${HADOOP_HOME}/etc/hadoop directory. Here is an example of editing the file to specify the hosts db03b11 and db03b12 as compute hosts:
```
$ vim compute
$ cat compute
db03b11
db03b12
```
Note: For 1.1.1 distribution, you specify the primary and compute hosts in the sub-directories under the ${HADOOP_HOME}/conf directory.

Format the HDFS via the NameNode:

cd ${HADOOP_HOME}/bin.
/hadoop namenode -format

Start up HDFS. Run the following commands on the NameNode:
```
su hadoop
cd ${HADOOP_HOME}/bin
./start-dfs.sh
```
To start both HDFS and Hadoop MapReduce daemons, use start-all.sh instead.

Note: Commands such as jps or ./hadoop dfsadmin -report display information about the system that is running and can be useful.

For more information, refer to your Hadoop documentation.

What to do next

You can access HDFS data in many different ways, two of which are listed here.

FS shell

The hadoop fs utility is the command-line interface into HDFS and contains Linux-like shell commands. The utility is located under $HADOOP_HOME/bin and uses the following syntax:

hadoop fs options

where options include among others:

-ls path lists context under the specified path
-mkdir path creates a new directory
-rmr path recursively removes the specified path
-put local path copies from the local file system
-get local path copies to the local file system

Web interfaces

Hadoop provides the following web interfaces:

Use the HDFS NameNode web interface to view the general health of the HDFS, health of the data nodes, the filesystem browser, and logs. By default, the NameNode interface is available at http://localhost:50070/.
Use the MapReduce Job Tracker web interface to view jobtracker state, job, task, and attempt drilldowns, scheduling information, history, and logs. By default, the MapReduce Job Tracker interface is available at http://localhost:50030/.
Use the Task Tracker web interface to view running and non-running tasks and logs. By default, the Task Tracker interface is available at http://localhost:50060/.