Hadoop connector

This section further describes how the Hadoop connector enables Hadoop for IBM® Spectrum Scale.

Hadoop connector support for IBM Spectrum Scale

The IBM Spectrum Scale™ Hadoop connector, which must be installed on each Hadoop node, implements Hadoop file system APIs and the FileContext class so it can access the IBM Spectrum Scale.

The IBM Spectrum Scale Hadoop connector is composed of these parts:

hadoop-gpfs-*.jar
libgpfshadoop.so
gpfs-connector-daemon

Install the connector on each node in the Hadoop cluster.

Figure 1. GPFS Hadoop connector overview

This figure is explained in the paragraph that precedes it.

Note: Specify the /usr/lpp/mmfs/bin/mmhadoopctl command to deploy the connector on Hadoop and to control the connector daemon.

Hadoop support for the IBM Spectrum Scale storage mode

IBM Spectrum Scale has two storage modes that Hadoop can access:

Centralized Storage Mode
Local Storage Mode, or File Placement Optimizer (FPO) Mode

IBM Spectrum Scale allows Hadoop applications to access centralized storage data like the SAN Volume Controller, which means that storage is attached to a dedicated storage server (or accessed directly by SAN Volume Controller). All Hadoop replica nodes can access the storage as a IBM Spectrum Scale client. You can share a cluster between Hadoop and any other application.

Figure 2. Hadoop on centralized storage

IBM Spectrum Scale also provides a local storage mode, which is FPO (for Hadoop). FPO is an implementation of shared-nothing architecture that enables each node to operate independently, which reduces the impact of failure events that occur across multiple nodes.

Figure 3. Hadoop on Local storage (FPO)

IBM Spectrum Scale cluster planning for Hadoop applications

When you are creating a IBM Spectrum Scale file system for a Hadoop application:

IBM Spectrum Scale allows Hadoop to run on a file system with multiple storage pools. For purposes of storage pool planning, these storage pools can be either generic or FPO (with allowWriteAffinity=yes). With central storage, IBM Spectrum Scale can run Hadoop on a typical file system configuration that has a mixed storage pool with both metadata and data.

You must consider replica and block size planning:

Table 1. Replica and block size planning
Mode	Data type	Block size	Replica
FPO mode	Metadata	A small size is suggested. For example, 256 KB.	2 or 3 Fewer replicas apply to disks with hardware protection such as RAID.
FPO mode	Data	A large size is suggested. A typical value is 2 MB.	3 Replica 2 is only considered with extra disk protection such as RAID.
Central Storage mode	Metadata	It depends on the application I/O pattern and whether the application is using RAID or not.	1 or 2
Central Storage mode	Data	It depends on the application I/O pattern and whether the application is using RAID or not.	1 or 2

You can use any advanced features in IBM Spectrum Scale, such as Local Read-Only Cache (LROC), to improve performance.
Hadoop can run on a remote file system that is mounted from another cluster.

Hadoop cluster planning

In an IBM Spectrum Scale shared storage environment, do not use an NSD server as a computing node because NSD servers generally maintain a heavy I/O workload.

Remember: In a Scale Central Storage (SCS) environment, any node can be used as a Hadoop replica node.

IBM Spectrum Scale functions corresponding to HDFS

To enable HDFS-4685 access control lists (ACLs):

Enable GPFS™ Posix ACL support:

# mmchfs bigfs -k posix        
# mmlsfs bigfs -k          

flag                value                    description     
----                -----                    -----------   
-k                  posix                    ACL semantics in effect

Install the dependent packages:
- acl
- libacl
- libacl-devel (required by Hadoop 2.4 connector only)

In this example, the IBM Spectrum Scale file system name is bigfs.

To enable the HDFS-2006 ability to store extended attributes per file, install the dependent libattr package.

To enable the remote file system in the Hadoop connector, add the following property:

<property> 
<name>gpfs.remote.cluster.enabled</name> 
<value>true</value> 
</property>