Create filesets for MapReduce
intermediate and temporary data
To efficiently store MapReduce
intermediate and temporary data, use
filesets and policies to better emulate local disk behavior.
Create an independent fileset
Consider using --inode-space new [--inode-limit MaxNumInodes[:NumInodesToPreallocate] to create an independent fileset. This can improve the performance for the fileset but requires calculation for MaxNumInodes and NumInodesToPreallocate. MaxNumInodes must be eight times the number of files expected on the fileset, and NumInodesToPreallocate must be half the value of MaxNumInodes. See the mmcrfileset man page to understand this option.
Use the mmcrfileset command to create two filesets, one for local intermediate data and one for temporary data:
# mmcrfileset gpfs-fpo-fs mapred-local-fileset
# mmcrfileset gpfs-fpo-fs mapred-tmp-fileset
After the fileset is created, it must be linked to a directory under this IBM Storage Scale file system mount point. This example uses /mnt/gpfs/mapred/local for intermediate data and /mnt/gpfs/tmp for temporary data. As /mnt/gpfs/mapred/local is a nested directory, the directory structure must exist before linking the fileset. These two directories are required for configuring Hadoop.
# mkdir -p $(dirname /mnt/gpfs/mapred/local)
# mmlinkfileset gpfs-fpo-fs mapred-local-fileset -J /mnt/gpfs/mapred/local
# mmlinkfileset gpfs-fpo-fs mapred-tmp-fileset -J /mnt/gpfs/tmp
Use the mmlsfileset command to display fileset information:
# mmlsfileset gpfs-fpo-fs -L
The next step to setting up the filesets is to apply the IBM Storage Scale policy so the filesets act like local directories on each node. This policy instructs IBM Storage Scale not to replicate the data for these two filesets, and since these filesets are stored on the data pool, they can use FPO features that keeps local writes on local disks. Metadata must still be replicated three times, which can result in performance overhead. File placement policies are evaluated in the order they are entered, so ensure that the policies for the filesets appear before the default rule.
# cat policyfile
rule 'R1' SET POOL 'datapool' REPLICATE (1,1) FOR FILESET ('mapred-local-fileset')
rule 'R2' SET POOL 'datapool' REPLICATE (1,1) FOR FILESET ('mapred-tmp-fileset')
rule default SET POOL 'datapool'
# mmchpolicy gpfs-fpo-fs policyfile -I yes
Use the mmlspolicy command to display the currently active rule definition:
# mmlspolicy gpfs-fpo-fs –L
In each of these filesets, create a subdirectory for each node that run Hadoop jobs. Based on the sample environment, this script creates these subdirectories:
# cat mk_gpfs_local_dirs.sh
#!/bin/sh for nodename in $(mmlsnode -N all); do
mkdir -p /mnt/gpfs/tmp/${nodename}
mkdir -p /mnt/gpfs/mapred/local/${nodename}
done
After that, on ${nodename}, link /mnt/gpfs/tmp/${nodename} /hadoop/tmp; link /mnt/gpfs/mapred/local/${nodename} /hadoop/local. Then, in Hadoop cluster, configure /hadoop/tmp as hadoop.tmp.dir in all Hadoop nodes; configure /hadoop/local as mapred.cluster.local.dir in MRv1 or yarn.nodemanager.log-dirs and yarn.nodemanager.local-dirs in Hadoop Yarn for Hadoop nodes.
To check that the rules are working properly, you can write some test files and verify their replication settings. For example:
Create some files:
# echo "test" > /mnt/gpfs/mapred/local/testRep1
# echo "test" > /mnt/gpfs/testRep3
Use the mmlsattr command to check the replication settings
# mmlsattr /mnt/gpfs/mapred/local/testRep1
replication factors
metadata(max) data(max) file [flags]
-------------------------------------
1 ( 3) 1 ( 3) /mnt/gpfs/mapred/local/testRep1
# mmlsattr /mnt/gpfs/testRep3
replication factors
metadata(max) data(max) file [flags]
-------------------------------------
3 ( 3) 3 ( 3) /mnt/gpfs/testRep3