Clean up intermediate data for MapReduce jobs
Within the MapReduce framework in IBM® Spectrum Symphony, you can remove intermediate data that is remains when a job is aborted or terminated. Intermediate data relates to files generated by map tasks on the local disk that are used as input for the reduce task.
When you trigger a cleanup for aborted and terminated jobs, intermediate files are removed from
compute hosts, thus saving disk space. The intermediate data that is removed includes:
- Data related to map and reduce tasks stored at the location defined by the following properties
specific to your Hadoop version in $PMR_HOME/conf/pmr-site.xml:
- 0.21: mapreduce.cluster.local.dir
- 2.7.2: mapred.local.dir
- Data at the host level for each job stored under $PMR_HOME/work/userdata/application/job_ID/.
- Footprint file for each job that is created every time a job is submitted under $PMR_HOME/work/footprint/application/job_ID/.
Note: You cannot use IBM Spectrum Symphony Developer Edition to clean
up job data on hosts inside the production cluster. The clean up process requires the service
controller library (libesc.so), which does not exist in the IBM Spectrum Symphony
Developer Edition environment.
To clean up intermediate data for aborted or terminated jobs, from the command line,
run:
mrsh cleanup
Attention: To reduce overhead for the SD daemon, the job status for all applications is
stored in an SD cache, with an expiry limit of 15 minutes. This caching means that if the job status
is changed within 15 minutes, you cannot clean up data because the job is still in an open state,
such as running or suspended.