Cleaning up data for MapReduce jobs
Remove intermediate data that remains when a MapReduce job is aborted or killed. Intermediate data relates to files generated by map tasks on the local disk that are used as input for the reduce task.
When you trigger a cleanup for aborted and killed jobs, intermediate files are removed from
compute hosts, thus saving disk space. The intermediate data that is removed includes:
- Data related to map and reduce tasks stored at the location defined by the mapreduce.cluster.local.dir properties in $PMR_HOME/conf/pmr-site.xml.
- Data at the host level for each job stored under $PMR_HOME/work/userdata/application/job_id/.
- Footprint file for each job that is created every time a job is submitted under $PMR_HOME/work/footprint/application/job_id/.
To clean up intermediate data for aborted jobs, killed jobs, or both from the command line:
- Ensure that the service director and shuffle service daemons are running using the egosh service list command. Output for the command should show SD and MRSS in the Run state.
- Log on to the system for a specific duration using the soamlogon command. For example:
soamlogon -u user_name -x password
where:- -u user_name specifies the name of the user for this command session.
- -x password specifies the user password for this command session.
- Clean up intermediate data using the mrsh cleanup command. Attention: To reduce overhead for the SD daemon, the job status for all applications is stored in an SD cache, with an expiry limit of 15 minutes. This caching means that if the job status is changed within 15 minutes, you cannot clean up data because the job will still be in an open state, such as running or suspended.