IBM Spectrum Scale file placement optimizer

The IBM Spectrum Scale is a high-performance cluster file system, providing concurrent access to one or more file systems from multiple nodes. Within IBM Spectrum Scale, the File Placement Optimizer (FPO) feature provides a flexible file system that can be deployed as an alternative to HDFS, bringing full Portable Operating System Interface (POSIX) file system compliance, no single point of failure, and enhanced security. FPO is available with IBM Spectrum Scale on Linux®. Follow these steps to install IBM Spectrum Scale and run MapReduce jobs in your IBM Spectrum Scale FPO cluster.

Procedure

Install IBM Spectrum Scale on Linux 64-bit or on Linux on POWER.

Refer to the IBM Spectrum Scale documentation for instructions on completing the following steps. The IBM Spectrum Scale documentation.
1. Install IBM Spectrum Scale on all nodes in your cluster, as described in the IBM Spectrum Scale Concepts, Planning, and Installation Guide.
2. Configure IBM Spectrum Scale FPO, as described in the IBM Spectrum Scale Administration and Programming Reference and the IBM Spectrum Scale Advanced Administration Guide.
3. Mount your IBM Spectrum Scale files system on all nodes (for example: /mnt/gpfs).
Install IBM® Spectrum Symphony:
1. Install Hadoop on the same hosts as your IBM Spectrum Scale cluster, as described in Hadoop documentation.
2. Install IBM Spectrum Symphony also on the same hosts.
  Note: Remember the following details for your IBM Spectrum Symphony installation:
  - Do not set up the DFS_GUI_HOSTNAME and DFS_GUI_PORT environment variables. Both variables apply only to the Hadoop Distributed File System (HDFS).
  - Do not activate the high availability (HA) feature for the HDFS NameNode. This feature applies only to HDFS.
3. On all hosts, change the ownership of the IBM Spectrum Scale mount point to the IBM Spectrum Symphony cluster administrator. For example:
  
  # chown -R $CLUSTERADMIN:$CLUSTERADMIN /mnt/gpfs
Configure your IBM Spectrum Scale FPO cluster:
1. Configure all the hosts in your cluster to integrate the IBM Spectrum Scale FPO connector with IBM Hadoop, as described in the IBM Spectrum Scale FPO Connector readme file included in the FPO distribution. The readme file is available at /usr/lpp/mmfs/fpo/hadoop-distribution_level/.
2. As the cluster administrator, run the following command on all hosts in your cluster:
  
  ln –s /usr/lpp/mmfs/fpo/hadoop-distribution_level/*.jar $PMR_SERVERDIR/../lib
3. Based on your IBM Spectrum Scale FPO file system configuration, configure the ComputeHosts resource group to include all hosts that have IBM Spectrum Scale data disks:
  1. Access the cluster management console, which is available by default at https://host_name:8443/platform.
    If you are not sure which host the cluster management console is running on, run the egosh service list command and check the resource for the WEBGUI service.
  2. Log in as the cluster administrator with the default credentials (user Admin password Admin).
    For security in a production environment, ensure that you change the password of the Admin account.
  3. Go to Resources > Resource Planning > Resource Groups.
  4. In the Resource Group Name column, click ComputeHosts.
    By default, all hosts that are not in the management host group are manually allocated to the ComputeHosts resource group. This allocation is done in the Filter display of member hosts section, by adding the query select(!mg) to the Host filtered by resource requirement option.
  5. Decide if the list of hosts in the ComputeHosts resource group meet your requirements:
    - To view the list of hosts in the ComputeHosts resource group, click Refresh Host List. If the list is correct, leave it as is. Otherwise, change the hosts in the list.
    - To change the hosts in the ComputeHosts resource group, choose Static (List of Names) from the Resource Selection Method list.
      The page refreshes to display a list of possible hosts. Select the hosts that you want from the list and click Apply.
4. If you are setting the mapred.local.dir property to a GPFS directory, add the following properties to the $PMR_HOME/conf/pmr-site.xml configuration file on all your hosts:
  - pmr.io.read.enhancement: Used internally by the system. Valid values are true and false.
```
<property>
  <name>pmr.io.read.enhancement</name>
  <value>false</value>
</property>
```
  - pmr.io.enhancement: Specifies whether memory caching in the shuffle service must be enabled. Valid values are normal, socket, fc, and fs. Set the value to normal to specify intermediate data of a map task to be saved as it is in Hadoop.
```
<property>
  <name>pmr.io.enhancement</name>
  <value>normal</value>
</property>
```
Submit MapReduce jobs:
1. Submit MapReduce jobs by specifying the "gpfs:///" schema in the data input path, output path, or both.
  For example, if input data files in IBM Spectrum Scale are mounted under local path /mnt/gpfs/input/ and you want to store the MapReduce job's result in IBM Spectrum Scale mounted under the local path /mnt/gpfs/output/, submit a MapReduce job by using a command similar to the following:
  # mrsh jar jarfile gpfs:///input gpfs:///output
  
  Note: Do not include the IBM Spectrum Scale mount directory (/mnt/gpfs in this example) in the input or output path. This directory is added automatically.