Create filesets for MapReduce intermediate and temporary data

To efficiently store MapReduce intermediate and temporary data, use filesets and policies to better emulate local disk behavior.

Note: If MapReduce intermediate and temporary data is not stored on IBM Storage Scale, mapred.cluster.local.dir in MRv1 or yarn.nodemanager.log-dirs and yarn.nodemanager.local-dirs in Hadoop Yarn does not point to the IBM Storage Scale directory, you do not need to go through this section.

Create an independent fileset

Consider using --inode-space new [--inode-limit MaxNumInodes[:NumInodesToPreallocate] to create an independent fileset. This can improve the performance for the fileset but requires calculation for MaxNumInodes and NumInodesToPreallocate. MaxNumInodes must be eight times the number of files expected on the fileset, and NumInodesToPreallocate must be half the value of MaxNumInodes. See the mmcrfileset man page to understand this option.

Use the mmcrfileset command to create two filesets, one for local intermediate data and one for temporary data:

# mmcrfileset gpfs-fpo-fs mapred-local-fileset
# mmcrfileset gpfs-fpo-fs mapred-tmp-fileset

After the fileset is created, it must be linked to a directory under this IBM Storage Scale file system mount point. This example uses /mnt/gpfs/mapred/local for intermediate data and /mnt/gpfs/tmp for temporary data. As /mnt/gpfs/mapred/local is a nested directory, the directory structure must exist before linking the fileset. These two directories are required for configuring Hadoop.

# mkdir -p $(dirname /mnt/gpfs/mapred/local)
# mmlinkfileset gpfs-fpo-fs mapred-local-fileset -J /mnt/gpfs/mapred/local 
# mmlinkfileset gpfs-fpo-fs mapred-tmp-fileset -J /mnt/gpfs/tmp 

Use the mmlsfileset command to display fileset information:

# mmlsfileset gpfs-fpo-fs -L

The next step to setting up the filesets is to apply the IBM Storage Scale policy so the filesets act like local directories on each node. This policy instructs IBM Storage Scale not to replicate the data for these two filesets, and since these filesets are stored on the data pool, they can use FPO features that keeps local writes on local disks. Metadata must still be replicated three times, which can result in performance overhead. File placement policies are evaluated in the order they are entered, so ensure that the policies for the filesets appear before the default rule.

# cat policyfile
rule 'R1' SET POOL 'datapool' REPLICATE (1,1) FOR FILESET ('mapred-local-fileset') 
rule 'R2' SET POOL 'datapool' REPLICATE (1,1) FOR FILESET ('mapred-tmp-fileset') 
rule default SET POOL 'datapool'
# mmchpolicy gpfs-fpo-fs policyfile -I yes

Use the mmlspolicy command to display the currently active rule definition:

# mmlspolicy gpfs-fpo-fs –L

In each of these filesets, create a subdirectory for each node that run Hadoop jobs. Based on the sample environment, this script creates these subdirectories:

# cat mk_gpfs_local_dirs.sh
#!/bin/sh for nodename in $(mmlsnode -N all); do
 mkdir -p /mnt/gpfs/tmp/${nodename}
 mkdir -p /mnt/gpfs/mapred/local/${nodename}
 done 

After that, on ${nodename}, link /mnt/gpfs/tmp/${nodename} /hadoop/tmp; link /mnt/gpfs/mapred/local/${nodename} /hadoop/local. Then, in Hadoop cluster, configure /hadoop/tmp as hadoop.tmp.dir in all Hadoop nodes; configure /hadoop/local as mapred.cluster.local.dir in MRv1 or yarn.nodemanager.log-dirs and yarn.nodemanager.local-dirs in Hadoop Yarn for Hadoop nodes.

To check that the rules are working properly, you can write some test files and verify their replication settings. For example:

Create some files:

# echo "test" > /mnt/gpfs/mapred/local/testRep1 
# echo "test" > /mnt/gpfs/testRep3 

Use the mmlsattr command to check the replication settings

# mmlsattr /mnt/gpfs/mapred/local/testRep1 
replication factors
metadata(max) data(max) file [flags]
 -------------------------------------
 1 ( 3) 1 ( 3) /mnt/gpfs/mapred/local/testRep1
 # mmlsattr /mnt/gpfs/testRep3
replication factors
metadata(max) data(max) file [flags] 
------------------------------------- 
3 ( 3) 3 ( 3) /mnt/gpfs/testRep3