Hadoop connector
This section further describes how the Hadoop connector enables Hadoop for IBM® Spectrum Scale.
Hadoop connector support for IBM Spectrum Scale
The IBM Spectrum Scale™ Hadoop connector, which must be installed on each Hadoop node, implements Hadoop file system APIs and the FileContext class so it can access the IBM Spectrum Scale.
The IBM
Spectrum Scale Hadoop connector
is composed of these parts:
- hadoop-gpfs-*.jar
- libgpfshadoop.so
- gpfs-connector-daemon
Figure 1. GPFS Hadoop
connector overview
Note: Specify the /usr/lpp/mmfs/bin/mmhadoopctl command
to deploy the connector on Hadoop and to control the connector daemon.
Hadoop support for the IBM Spectrum Scale storage mode
IBM
Spectrum Scale has two storage
modes that Hadoop can access:
- Centralized Storage Mode
- Local Storage Mode, or File Placement Optimizer (FPO) Mode
IBM
Spectrum Scale allows
Hadoop applications to access centralized storage data like the SAN
Volume Controller, which means that storage is attached to a dedicated
storage server (or accessed directly by SAN Volume Controller). All
Hadoop replica nodes can access the storage as a IBM Spectrum Scale client. You
can share a cluster between Hadoop and any other application.
Figure 2. Hadoop on centralized storage
IBM
Spectrum Scale also
provides a local storage mode, which is FPO (for Hadoop). FPO is an
implementation of shared-nothing architecture that enables each node
to operate independently, which reduces the impact of failure events
that occur across multiple nodes.
Figure 3. Hadoop
on Local storage (FPO)
IBM Spectrum Scale cluster planning for Hadoop applications
When you are creating a IBM Spectrum Scale file system for a Hadoop application:
- IBM Spectrum Scale allows Hadoop to run on a file system with multiple storage pools. For purposes of storage pool planning, these storage pools can be either generic or FPO (with allowWriteAffinity=yes). With central storage, IBM Spectrum Scale can run Hadoop on a typical file system configuration that has a mixed storage pool with both metadata and data.
- You must consider replica and block size planning:
Table 1. Replica and block size planning Mode Data type Block size Replica FPO mode Metadata A small size is suggested. For example, 256 KB. 2 or 3 Fewer replicas apply to disks with hardware protection such as RAID.
Data A large size is suggested. A typical value is 2 MB. 3 Replica 2 is only considered with extra disk protection such as RAID.
Central Storage mode Metadata It depends on the application I/O pattern and whether the application is using RAID or not. 1 or 2 Data It depends on the application I/O pattern and whether the application is using RAID or not. 1 or 2 - You can use any advanced features in IBM Spectrum Scale, such as Local Read-Only Cache (LROC), to improve performance.
- Hadoop can run on a remote file system that is mounted from another cluster.
Hadoop cluster planning
In an IBM
Spectrum Scale shared storage
environment, do not use an NSD server as a computing node because
NSD servers generally maintain a heavy I/O workload.
Remember: In
a Scale Central Storage (SCS) environment, any node can be used as
a Hadoop replica node.
IBM Spectrum Scale functions corresponding to HDFS
To
enable HDFS-4685 access control lists (ACLs):
- Enable GPFS™ Posix ACL support:
# mmchfs bigfs -k posix # mmlsfs bigfs -k flag value description ---- ----- ----------- -k posix ACL semantics in effect
- Install the dependent packages:
- acl
- libacl
- libacl-devel (required by Hadoop 2.4 connector only)
To enable the HDFS-2006 ability to store extended attributes per file, install the dependent libattr package.
To enable the remote file system in the Hadoop connector, add the following property:
<property>
<name>gpfs.remote.cluster.enabled</name>
<value>true</value>
</property>