Cache-aware scheduling

MapReduce jobs that read common data input splits benefit from caching-aware scheduling. The first iteration of a job caches input splits; subsequent iterations reuse the cached input split as input for the job. This way, these subsequent iterations do not require retrieving the same data from distributed file systems (such as from HDFS). Cache-aware scheduling enables a job to get its input split from either on-disk and in-memory caching.

Cache-aware scheduling supports applications submitted through a Hadoop API, which is displayed as the mapred.mapper.new-api property on the job configuration page. The org.apache.hadoop.mapreduce package includes this API; the API uses the org.apache.hadoop.mapreduce.Job class to submit jobs.