DFSIO

DFSIO is shipped with Hadoop distro.

The following are the options for DFSIO:
$yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.jar TestDFSIO

Usage: TestDFSIO [genericOptions] -read [-random | -backward | -skip [-skipSize Size]] |
-write | -append | -truncate | -clean [-compression codecClassName] [-nrFiles N]
[-size Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir]

Usually, we care only about the -read, -write, -nrFiles, -size and -bufferSize options.

The option -read is used to evaluate the read performance. The option -write is used to evaluate the write performance. The option -nrFiles is used to specify the number of files you want to generate. The option -size is used to specify the file size for each file.

So, the total data that TestDFSIO will read/write is (nrFiles * size). DFSIO is simple in logic and it will start the nrFiles map tasks running over the whole Hadoop/Yarn cluster.

1st tuning guide: nrFiles and task number

While evaluating TestDFSIO, we need to consider the Yarn’s configuration. If the maximum tasks per wave is TotalMapTaskPerWave, your nrFiles should be TotalMapTaskPerWave.

If it is IBM Storage Scale FPO, the file size -size should be at least 512MB or more (you could try 1GB, 2GB and 4GB). If it is shared storage or IBM Storage Scale System, the file size -size should be 1GB or more (try 1GB, 2GB and 4GB).

Usually, according to experience, we take as many map tasks as possible for DFSIO read. For DFS write, we recommend to try the map tasks according to logic processors even if you have more free memory.

2nd tuning guide: nrFiles * size

The total data size (nrFiles * size) should be at least 4 times of the total physical memory size of all HDFS nodes. For example, if you want to compare the performance data of DFSIO between native HDFS and IBM Storage Scale, if you have 10 nodes for native HDFS DataNode (100GB physical memory size per DataNode), then your (nrFiles * size) over native HDFS should be 4 * (100GB per node * 10 DataNodes), ~ 4000GB. And then, the same (nrFiles * size) for IBM Storage Scale.

3rd tuning guide: -bufferSize

Try -bufferSize with the block size of IBM Storage Scale. This is the IO buffer size for task to write/read data.