Cloudera's distribution including Apache Hadoop

The MapReduce framework in IBM® Spectrum Symphony can work with Cloudera's Distribution including Apache Hadoop (CDH). The MapReduce cluster can run in addition to a Cloudera HDFS to provide improved performance, high availability, robust framework, and the ability to reuse existing data in the Cloudera HDFS. Optionally, you can activate high availability (HA) within the MapReduce framework to automate a failover process for all HDFS daemons on all nodes (NameNode, Secondary NameNode, and DataNodes).

Before you begin

Ensure the follow taks are complete:

The Cloudera CDH5 cluster is installed.
The MapReduce cluster is set to use CDH version 5.0.2.
If you require failover, HA for the HDFS NameNode is activated. Use the HA feature for HDFS in a Cloudera HDFS cluster to start all HDFS processes, monitor process health, and in case of process or host failure, provide failover. With the HA feature, MapReduce jobs continue to run even after a HDFS failover.

About this task

If you are yet to install IBM Spectrum Symphony, you can configure the Cloudera HDFS as the distribution file system while installing IBM Spectrum Symphony.

If IBM Spectrum Symphony is already installed, complete the following steps to run the MapReduce framework with the Cloudera HDFS:

Procedure

Generate a client configuration package so that the MapReduce framework can explicitly understand existing CHD5 configurations when submitting jobs accessing the Cloudera HDFS.
1. Log in to the Cloudera SCM Admin Console.
2. Under Services, select the HDFS service and click Generate Client Configuration.
  
  A confirmation screen displays.
3. Confirm generation of the client configuration file; click Generate Client Configuration.
4. Download the compressed client configuration file; click Download Result Data.
5. Save the compressed file to your local drive; click Save in the File Download dialog.
6. Unpack the compressed file to ensure that it contains the following configuration files:
  - core-site.xml
  - hadoop-env.sh
  - hdfs-site.xml
  - log4j.properties
  - ssl-client.xml.examples
7. Set the HADOOP_CONF_DIR environment variable to the location where you saved the configuration files.
8. If your Cloudera HDFS was installed by the SCM Express utility, add the Cloudera native library path to LD_LIBRARY_PATH. For example, with a 64-bit library path, run the following command:
```
export LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64:$LD_LIBRARY_PATH
```
(Optional) If you have configured HA, start the service on the Cloudera HDFS NameNode using the following command:
```
egosh service start NameNode
```
Services on the SecondaryNode and the DataNode start automatically.
Submit the WordCount sample job. For example:
1. Source the environment in the IBM Spectrum Symphony installation directory:
```
. $PMR_HOME/../../profile.platform
```
2. Edit the enabled application profile to set the environment variable HADOOP_VERSION to the correct Cloudera version and register the profile.
  For example:
  1. Open $PMR_SERVERDIR/../profile/MapReduceversion.xml.
  2. Set HADOOP_VERSION. For example, for CDH5:
    <env name="HADOOP_VERSION">cdh5_0_2</env>
  3. Set CLOUDERA_HOME. For example, for CDH5:
    <env name="CLOUDERA_HOME">/opt/cloudera</env>
  4. Register the profile:
    soamreg $PMR_SERVERDIR/../profile/MapReduceversion.xml
3. On the client side, set the following values in $PMR_HOME/conf/pmr-env.sh:
  - Set the value of HADOOP_VERSION to the correct Cloudera version.
  - Set CLOUDERA_HOME to the installed Cloudera path, such as /opt/cloudera.
4. Submit the job to work with the Cloudera HDFS using this syntax:
```
mrsh jar /usr/lib/hadoop/hadoop-examples.jar wordcount hdfs://${HDFS_NAMENODE}:${HDFS_PORT_NUMBER}/input hdfs://${HDFS_NAMENODE}:${HDFS_PORT_NUMBER}/output
```
  where:
  - HDFS_NAMENODE specifies the existing Cloudera HDFS NameNode address.
  - HDFS_PORT_NUMBER specifies the existing Cloudera HDFS port number.