How cache-aware scheduling works
Cache-aware scheduling enables a MapReduce job to get its input split from the cache. When you submit jobs using the same input file (for example, using Kmeans jobs), cache-aware scheduling eliminates loading the same data multiple times from a distributed file system (such as from HDFS), and therefore, improves job performance. IBM® Spectrum Symphony intelligently allocates hosts, based on the locality of the input split cache for tasks, and schedules waiting tasks, on idle hosts, with the best cache match.
- Each host holds the input split of all its running map tasks in memory. When you configure caching, ensure that each host has sufficient memory to cache the data input splits.
- Cache-aware scheduling enforces the write-once protocol, which executes a write-update on the first write, and a write-invalidate on all subsequent writes.
- If the size of the total memory cache does not exceed the configured size, cache files are mapped to system memory and used as in-memory cache.
- If the size of the total memory cache exceeds the configured size, the cache files are not mapped to system memory, and are instead, used as on-disk cache.
If two hosts have the specified split cache entry for a map task, IBM Spectrum Symphony chooses the host with in-memory cache over the host with on-disk cache. If the split cache is not accessed by any job of any application for a duration, the input split and its local disk files are deleted, as defined in the PMR_MRSS_INPUTCACHE_CLEAN_INTERVAL environment variable.