Teragen
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
teragen -Dmapreduce.job.maps=<JOB_MAP_NUMBER> -Ddfs.blocksize=<BLOCKSIZE>
<DATA_RECORDS> /<OUTPUT_PATH>
In the above command, <DATA_RECORDS> is used to specify the number of records in your evaluation. One record is 100 bytes. So, 1000 000 000 0 records mean the data size 1000 000 000 0 * 100 bytes, which is closed to 1TB.
<OUTPUT_PATH> is the data output directory. Change this accordingly.
The <JOB_MAP_NUMBER> should be equal to (MaxTaskPerNode_mem * YarnNodeManagerNumber - 1). Also, the value ((<DATA_RECORDS> * 100 )/(1024*1024))/ <JOB_MAP_NUMBER> should not be very small and it should be close to dfs.blocksize or multiple times of dfs.blocksize. If the value ((<DATA_RECORDS> * 100 )/(1024*1024))/ <JOB_MAP_NUMBER> is very small, your <DATA_RECORDS> is very small for your cluster.
If yarn.scheduler.capacity.resource-calculator is changed by enabling CPU scheduling from Ambari, the smaller value between MaxTaskPerNode_mem and MaxTaskPerNode_vcore will be effective. Under this situation, you could try to make MaxTaskPerNode_vcore and MaxTaskPerNode_mem close to each other. If not, that usually indicates you have free resources that are not yet utilized.
If you take the same block size (dfs.blocksize from core-site.xml) for your TeraGen job, you do not need to specify -Ddfs.blocksize=<BLOCKSIZE>. If you want to take different dfs.blocksize for the job, you could specify the -Ddfs.blocksize=<BLOCKSIZE> option.