Distributed cache
Some applications need side data for processing, which is defined as extra read-only data needed by a job to process the main dataset. This data is stored in distributed cache.
Overview
To save network bandwidth, files are normally copied to a compute node once per job. To improve performance, instead of downloading or unarchiving the same set of cache files for multiple jobs, distributed cache can also be shared by multiple jobs running on the same node. With cross-job distributed cache sharing, a file is downloaded from, for example, HDFS, to the local store by the first job on a compute node. Distributed cache may be used by a job setup task, a mapper, a reducer, or a job cleanup task. When a set of distributed cache files are needed by a job, that job will first look up whether there is a local copy already downloaded by another job at the same node. If there is, it will use that copy and thus save the effort of downloading from HDFS, and unarchiving an archived set of files.
Public and private distributed cache
Distributed cache can be shared by multiple jobs running on the same node or multiple jobs running under the same user on the same node; the former is known as host-level or public distributed cache since jobs running under any user at the node can use it while the latter is known as user-level or private distributed cache since only jobs running under the specified user can use it. Whether a distributed cache file is private or public is determined by its permissions on the file system, typically HDFS, where the files are uploaded. If the file does not have read access by all users on the node, or if the directory path leading to the file does not have executable access by all users on the node, the file is private. If the file has read access by all users on the node and the directory path leading to the file has executable access by all users on the node, the file is public.
- Public distributed cache: SymMR_work_dir/distcache/_public_/
- Private distributed cache: SymMR_work_dir/distcache/$user/
If mapred.local.dir defines a comma-separated list of directories on different disks, Symphony chooses local disks to create the distributed cache directories. MRSS establishes a symbolic link from the task working directory to the distributed cache directory.
Distributed cache lifecycle management
MRSS manages the distributed cache lifecycle, including creation, usage, cleanup, and failover recovery.
When a job is submitted, Symphony copies the files specified by the -files and -archives options, and -Dmapred.cache.files and -Dmapred.cache.archives options to the SSM's file system (normally HDFS). Then, before a task is run, the first task of the job at the compute node copies the files from the SSM's file system to a local disk (the cache) so tasks of this job can access the files.
MRSS maintains a reference count for the number of jobs using each file in the cache. Before the first task of the job has run, the file’s reference count is incremented by one. After the last task of the job has run, the count is decreased by one. Only when the count reaches zero is it eligible for deletion, since no jobs for this node are using it. The amount of time that the count can remain at zero before deletion is 600 seconds.
During failover MRSS tries to create distributed cache entries, which define the relationship between jobs and distributed cache files, for all local cache files via a footprint file.