If you plan to use the Hadoop Distributed File System (HDFS) with MapReduce (available only on Linux® 64-bit hosts) and have not already installed HDFS, follow these steps. We strongly recommend that you set up Hadoop before installing IBM® Spectrum Symphony to avoid manual configuration. If you plan to install HDFS after installing IBM Spectrum Symphony, configure Hadoop for the MapReduce framework in IBM Spectrum Symphony. If you are using other supported distributed file systems, install them and configure them to work with MapReduce framework.
Before you begin
IBM Spectrum Symphony supports the HDFS versions included with Hadoop MapReduce API 2.7.2 within its MapReduce framework.
Note: The MapReduce framework in IBM Spectrum Symphony and Hadoop MapReduce can coexist on the same cluster using
one HDFS provided each host in the cluster has enough memory, CPU
slots, and disk space configured for both workloads.
Ensure
that the hosts in your cluster meet the following software requirements:
About this task
Note: Unless otherwise stated, this procedure
uses commands based on the Hadoop 2.7.2 distribution.
While you can use these commands on all supported Hadoop versions,
you may see warnings about deprecation. To avoid these warnings, use
the command corresponding to your Hadoop version.
Procedure
- Download the Hadoop .tar.gz package
from the Apache Hadoop download website.
Important: While the Hadoop package is available as
.rpm and
.tar.gz files, this procedure
describes how to download the
.tar.gz package
to install HDFS. When HDFS is installed from the
.rpm package, the directory structure is different, where:
- configurations files are under /etc/hadoop/
- bin files are under /usr/bin/
- libraries are under /usr/lib64/
- script files are under /usr/sbin/
- the service binary is under /etc/rc.d/init.d/
- other files are under /usr/share/hadoop/.
Be aware that
hadoop/bin does not exist
in the
.rpm installation. Because of this change
in directory structure, some sample applications fail to run.
- Select a directory to install Hadoop and untar the package
tar ball in that directory. For example, to download the 2.4.x distribution,
enter:
cd /opt
tar xzvf hadoop-2.4.x.tar.gz
ln -s hadoop-2.4.x hadoop
- Select a host to be the NameNode (for example, db03b10).
- As user hadoop, configure the environment:
The following example is for the
2.4.x distribution:
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/opt/hadoop-2.4.x
export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH:$HADOOP_HOME/bin
export HADOOP_VERSION=2_4_x
- Set up some basic configuration:
- Specify the Java version used by Hadoop using the ${HADOOP_HOME}/conf/hadoop-env.sh script file. Change, for example:
# The Java implementation to use. Required.
# export JAVA_HOME=
to:
# The Java implementation to use. Required.
export JAVA_HOME=/usr/java/latest
- Configure the directory where Hadoop will store its
data files using the ${HADOOP_HOME}/conf/core-site.xml file. For example:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/data</value>
</property>
<property>
<name>fs.default.name</name>
<!-- NameNode host -->
<value>hdfs://db03b10:9000/</value>
</property>
</configuration>
- Ensure that the base you specified for other temporary
directories exists and has the required ownerships and permissions.
For example:
mkdir -p /opt/hadoop/data
chown hadoop:hadoop /opt/hadoop/data
chmod 755 /opt/hadoop/data
Note: For a multiple user configuration,
ensure other OS users (belonging to the same group as the
hadoop user) have write permission to the base temporary directory. For
example:
chmod 775 /opt/hadoop/data
-
Set values for the following properties using the ${HADOOP_HOME}/conf/mapred-site.xml.
- Set the MapReduce framework name. This ensures that MapReduce jobs run using the IBM Spectrum Symphony MapReduce famework (rather than run in local MapReduce mode). For example:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
- Set the maximum number of map tasks run simultaneously per TaskTracker. For example:
<configuration>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>7</value>
</property>
</configuration>
- Set the maximum number of reduce tasks run simultaneously per TaskTracker. For example:
<configuration>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>3</value>
</property>
</configuration>
- Set the default block replication using the ${HADOOP_HOME}/conf/hdfs-site.xml. For example:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
- If you plan on using the High-Availability HDFS feature
(not available on the Developer Edition), configure the shared file
system directory to be used for HDFS NameNode meta data storage in
the hdfs-site.xml file. For example:
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/share/hdfs/name</value>
</property>
</configuration>
- Repeat steps 1 to 5 on every compute host.
- Specify primary and compute hosts.
For example, in the 2.4.x distribution,
the compute file is under the ${HADOOP_HOME}/etc/hadoop directory. Here is an
example of editing the file to specify the hosts db03b11 and db03b12 as compute hosts:
$ vim compute
$ cat compute
db03b11
db03b12
Note: For 1.1.1 distribution, you specify the primary and compute hosts in the sub-directories
under the ${HADOOP_HOME}/conf directory.
- Format the HDFS via the NameNode:
cd ${HADOOP_HOME}/bin.
/hadoop namenode -format
- Start up HDFS. Run the following commands on the NameNode:
su hadoop
cd ${HADOOP_HOME}/bin
./start-dfs.sh
To start both HDFS and Hadoop MapReduce
daemons, use start-all.sh instead.
Note: Commands
such as jps or ./hadoop dfsadmin -report display information about the system that is running and can be
useful.
For more information, refer to your Hadoop documentation.
What to do next
You can access HDFS data in many different ways, two of which
are listed here.
- FS shell
- The hadoop fs utility is the command-line interface into HDFS and contains
Linux-like shell commands. The utility is located under
$HADOOP_HOME/bin and uses the following
syntax:
hadoop fs options
where
options include among others:
- -ls
path lists context under the specified path
- -mkdir
path creates a new directory
- -rmr
path recursively removes the specified path
- -put
local path copies from the local file system
- -get
local path copies to the local file system
- Web interfaces
- Hadoop provides the following web interfaces:
- Use the HDFS NameNode web interface to view the general health
of the HDFS, health of the data nodes, the filesystem browser, and
logs. By default, the NameNode interface is available at http://localhost:50070/.
- Use the MapReduce Job Tracker web interface to view jobtracker
state, job, task, and attempt drilldowns, scheduling information,
history, and logs. By default, the MapReduce Job Tracker interface
is available at http://localhost:50030/.
- Use the Task Tracker web interface to view running and non-running
tasks and logs. By default, the Task Tracker interface is available
at http://localhost:50060/.