How cache-aware scheduling works

Cache-aware scheduling enables a MapReduce job to get its input split from the cache. When you submit jobs using the same input file (for example, using Kmeans jobs), cache-aware scheduling eliminates loading the same data multiple times from a distributed file system (such as from HDFS), and therefore, improves job performance. IBM® Spectrum Symphony intelligently allocates hosts, based on the locality of the input split cache for tasks, and schedules waiting tasks, on idle hosts, with the best cache match.

Before you enable ache-aware scheduling for MapReduce jobs, note the following points:
  1. Each host holds the input split of all its running map tasks in memory. When you configure caching, ensure that each host has sufficient memory to cache the data input splits.
  2. Cache-aware scheduling enforces the write-once protocol, which executes a write-update on the first write, and a write-invalidate on all subsequent writes.
With cache-aware scheduling, IBM Spectrum Symphony saves data input splitting for a map task to system memory (in-memory caching), and to a file on the local disk (on-disk caching). Configure the location on the local disk to store the map file of the input split using the PMR_MRSS_CACHE_PATH environment variable in the $EGO_ESRVDIR/esc/conf/services/mrss.xml file. Specify memory size using the PMR_MRSS_INPUTCACHE_MAX_MEMSIZE_MB environment variable. IBM Spectrum Symphony uses the following logic is for memory cache:
  • If the size of the total memory cache does not exceed the configured size, cache files are mapped to system memory and used as in-memory cache.
  • If the size of the total memory cache exceeds the configured size, the cache files are not mapped to system memory, and are instead, used as on-disk cache.

If two hosts have the specified split cache entry for a map task, IBM Spectrum Symphony chooses the host with in-memory cache over the host with on-disk cache. If the split cache is not accessed by any job of any application for a duration, the input split and its local disk files are deleted, as defined in the PMR_MRSS_INPUTCACHE_CLEAN_INTERVAL environment variable.