Teragen

When benchmarking Teragen, execute the following command:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar 
teragen -Dmapreduce.job.maps=<JOB_MAP_NUMBER> -Ddfs.blocksize=<BLOCKSIZE>  
<DATA_RECORDS>  /<OUTPUT_PATH>

In the above command, <DATA_RECORDS> is used to specify the number of records in your evaluation. One record is 100 bytes. So, 1000 000 000 0 records mean the data size 1000 000 000 0 * 100 bytes, which is closed to 1TB.

<OUTPUT_PATH> is the data output directory. Change this accordingly.

Teragen has no Reduce task and the value <JOB_MAP_NUMBER> is the one you need to plan with some efforts. Refer the yarn.nodemanager.resource.memory-mb, yarn.scheduler.minimum-allocation-vcores, yarn.app.mapreduce.am.resource.cpu-vcores, yarn.app.mapreduce.am.resource.mb in Table 1.

The <JOB_MAP_NUMBER> should be equal to (MaxTaskPerNode_mem * YarnNodeManagerNumber - 1). Also, the value ((<DATA_RECORDS> * 100 )/(1024*1024))/ <JOB_MAP_NUMBER> should not be very small and it should be close to dfs.blocksize or multiple times of dfs.blocksize. If the value ((<DATA_RECORDS> * 100 )/(1024*1024))/ <JOB_MAP_NUMBER> is very small, your <DATA_RECORDS> is very small for your cluster.

If yarn.scheduler.capacity.resource-calculator is changed by enabling CPU scheduling from Ambari, the smaller value between MaxTaskPerNode_mem and MaxTaskPerNode_vcore will be effective. Under this situation, you could try to make MaxTaskPerNode_vcore and MaxTaskPerNode_mem close to each other. If not, that usually indicates you have free resources that are not yet utilized.

If you take the same block size (dfs.blocksize from core-site.xml) for your TeraGen job, you do not need to specify -Ddfs.blocksize=<BLOCKSIZE>. If you want to take different dfs.blocksize for the job, you could specify the -Ddfs.blocksize=<BLOCKSIZE> option.