For HDFS Transparency version 2.7.0-x

Short-circuit local read can only be enabled on Hadoop 2.7.0. HDFS Transparency versions 2.7.0-x does not support this feature in Hadoop 2.7.1/2.7.2. IBM® BigInsights® IOP 4.1 uses Hadoop version 2.7.1. Therefore, short circuit cannot be enabled over IBM BigInsights IOP 4.1 if HDFS Transparency 2.7.0-x is used. For more information on how to enable short-circuit read on other Hadoop versions, contact scale@us.ibm.com.

Configuring short-circuit local read

To configure short-circuit local reads, enable libhadoop.so and use the DFS Client shipped by the IBM Storage Scale HDFS transparency. The package name is gpfs.hdfs-protocol. You cannot use standard HDFS DFS client to enable the short-circuit mode over the HDFS transparency.

To enable libhadoop.so, compile the native library on the target machine or use the library shipped by IBM Storage Scale HDFS transparency. To compile the native library on the specific machine, do the following steps:

  1. Download the Hadoop source code from Hadoop community. Unzip the package and cd to that directory.
  2. Build by mvn: $ mvn package -Pdist,native -DskipTests -Dtar
  3. Copy hadoop-dist/target/hadoop-2.7.1/lib/native/libhadoop.so.* to $YOUR_HADOOP_PREFIX/lib/native/

    To use the libhadoop.so delivered by the HDFS transparency, copy /usr/lpp/mmfs/hadoop/lib/native/libhadoop.so to $YOUR_HADOOP_PREFIX /lib/native/libhadoop.so.

    The shipped libhadoop.so is built on x86_64, ppc64 or ppc64le respectively.
    Note: This step must be performed on all nodes running the Hadoop tasks.

Enabling DFS client

To enable DFS client, perform the following procedure:

  1. On each node that accesses IBM Storage Scale in the short-circuit mode, back up hadoop-hdfs-2.7.0.jar using $ mv $YOUR_HADOOP_PREFIX/share/hadoop/hdfs/hadoop-hdfs-2.7.0.jar
    $YOUR_HADOOP_PREFIX/share/hadoop/hdfs/hadoop-hdfs-2.7.0.jar.backup
    .
  2. Link hadoop-gpfs-2.7.0.jar to classpath using $ln -s /usr/lpp/mmfs/hadoop/share/hadoop/hdfs/hadoop-gpfs-2.7.0.jar $YOUR_HADOOP_PREFIX/share/hadoop/hdfs/hadoop-gpfs-2.7.0.jar
  3. Update the core-site.xml file with the following information:
    <property>
      <name>fs.hdfs.impl</name>
      <value>org.apache.hadoop.gpfs.DistributedFileSystem</value>
    </property>

Short-circuit reads make use of a UNIX domain socket. This is a special path in the file system that allows the client and the DataNodes to communicate. You need to set a path to this socket. The DataNode needs to be able to create this path. However, users other than the HDFS user or root must not be able to create this path. Therefore, paths under /var/run or /var/lib folders are often used.

The client and the DataNode exchange information through a shared memory segment on the /dev/shm path. Short-circuit local reads need to be configured on both the DataNode and the client. Here is an example configuration.
<configuration>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
</configuration>
Synchronize all these changes on the entire cluster and if needed, restart the service.
Note: The /var/lib/hadoop-hdfs and dfs.domain.socket.path must be created manually by the root user before running the short-circuit read. The /var/lib/hadoop-hdfs must be owned by the root user. If not, the DataNode service fails when starting up.
#mkdir -p  /var/lib/hadoop-hdfs
#chown root:root /var/lib/hadoop-hdfs
#touch   /var/lib/hadoop-hdfs/${dfs.dome.socket.path}
#chmod 666 /var/lib/hadoop-hdfs/${dfs.dome.socket.path}

The permission control in short-circuit reads is similar to the common user access in HDFS. If you have the permission to read the file, then you can access it through short-circuit read.