Configuring cache-aware scheduling
To enable a MapReduce job to retrieve its input split from the cache, set caching options in the MRSS definition file (mrss.xml). Once complete, submit a job, and then use the cluster management console to view the number of map tasks that retrieve their input splits from the cache.
Before you begin
For the most performance gains, configure cache-aware scheduling
as follows:
- Enable services to be pre-started through the preStartApplication element in the Consumer section of the application profile.
- Configure PMR_MRSS_INPUTCACHE_MAX_MEMSIZE_MB to use all your free memory. To determine your total free memory,
follow this formula:
memory available - memory used by SI - memory used by DataNode - memory used by intermediate data = total free memory
- Run iterative jobs with large input files, which can occupy most of the free memory on each compute host.
- Run iterative jobs with more than a few iterations.
Note: To uniquely identify an input split and to name the
index of the cache, cache-aware scheduling uses the toString() output
string of the input split subclass, which is specified at job submission.
It is important, therefore, to ensure that you name the file uniquely.