Hadoop properties supported for MapReduce jobs in IBM Spectrum Symphony
This section lists the Hadoop properties supported within the MapReduce framework in IBM® Spectrum Symphony.
- For a single job:
- from the mrsh utility, use the -D option during job submission.
- from the cluster management console, use the Add button on the Submit Job window during job submission.
- For all jobs submitted from the host, use the pmr-site.xml configuration
file.Important: Any Hadoop parameter defined in pmr-site.xml takes precedence over the corresponding parameter defined in Hadoop configuration files (such as mapred-site.xml or core-site.xml). Use pmr-site.xml to define Hadoop parameters only if you did not set HADOOP_HOME (before installing IBM Spectrum Symphony) or PMR_EXTERNAL_CONFIG_PATH (after installing IBM Spectrum Symphony). Both environment variables import settings from your Hadoop installation to IBM Spectrum Symphony.
hadoop.tmp.dir
A base for other temporary directories.Default: /tmp/hadoop-${user.name}
Supported Hadoop versions: 2.7.2: hadoop.tmp.dir
mapreduce.job.committer.setup.cleanup.needed
Specifies whether a job requires setup and cleanup. Valid values are true or false. This property is not supported within the MapReduce framework in IBM Spectrum Symphony. In Hadoop, a value of false for this property means that setup and cleanup tasks are not created for a job. In IBM Spectrum Symphony, irrespective of this property's value, setup and cleanup tasks for a job are always created.Default: true
Supported Hadoop versions: 2.7.2: mapreduce.job.committer.setup.cleanup.needed
io.sort.factor
Specifies the number of streams to merge at once while sorting files. This value determines the number of open file handles.Default: 10
Supported Hadoop versions: 2.7.2: mapreduce.task.io.sort.factor
io.sort.mb
Specifies the size of the buffer memory (in megabytes) to use while sorting files.Default: 100
Supported Hadoop versions: 2.7.2: mapreduce.task.io.sort.mb
io.sort.spill.percent
The soft limit in either the buffer or record collection buffers. Once the limit is reached, a thread will begin to spill the contents to disk in the background.Default: 0.80
Supported Hadoop versions: 2.7.2: mapreduce.map.sort.spill.percent
mapred.local.dir
Specifies the local directory where intermediate data files are stored in the MapReduce framework. The path may be a comma-separated list of directories on different devices in order to spread disk I/O. Directories that do not exist are ignored.Default: ${hadoop.tmp.dir}/mapred/local
Supported Hadoop versions: 2.7.2: mapreduce.cluster.local.dir
mapred.job.priority
Specifies the scheduling priority for a MapReduce job.- VERY_LOW=1
- LOW=2500
- NORMAL=5000
- HIGH=7500
- VERY_HIGH=10000
Default: NORMAL
Supported Hadoop versions: 2.7.2: mapreduce.job.priority
mapred.job.map.memory.mb
Specifies the maximum virtual memory for a map task. If the task's memory usage exceeds the limit, the task is killed. If this limit is not configured, the value configured for mapred.task.maxvmem is used.Default: -1
Supported Hadoop versions: 2.7.2: mapreduce.map.memory.mb
mapred.job.reduce.memory.mb
Specifies the maximum virtual memory for a reduce task. If the task's memory usage exceeds the limit, the task is killed. If this limit is not configured, the value configured for mapred.task.maxvmem is used.Default: -1
Supported Hadoop versions: 2.7.2: mapreduce.reduce.memory.mb
mapred.map.tasks
Specifies the default number of map tasks per job. Ignored when mapreduce.jobtracker.address is local.Default: None (set based on user examples)
Supported Hadoop versions: 2.7.2: mapreduce.job.maps
mapred.reduce.tasks
Specifies the default number of reduce tasks per job. Typically, this property is set to 99 percent of the cluster's reduce capacity, so if a node fails the reduces can still be executed in a single wave. Ignored when mapreduce.jobtracker.address is local.Default: None (set based on user examples)
Supported Hadoop versions: 2.7.2: mapreduce.job.reduces
mapred.map.max.attempts
(For advanced users) Specifies the maximum number of attempts per map task. The framework tries to execute a map task this many times before giving up on it.Default: 4
Supported Hadoop versions: 2.7.2: mapreduce.map.maxattempts
mapred.reduce.max.attempts
(For advanced users) Specifies the maximum number of attempts per reduce task. The framework tries to execute a reduce task this many times before giving up on it.Default: 4
Supported Hadoop versions: 2.7.2: mapreduce.reduce.maxattempts
mapred.reduce.parallel.copies
Specifies the default number of parallel transfers run by reducers to fetch output from maps during the copy (shuffle) phase.Default: 5
Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.parallelcopies
mapred.map.child.log.level
Specifies the logging level for a map task, which could be: OFF, FATAL, ERROR, WARN, INFO, DEBUG, TRACE, or ALL.Default: INFO
Supported Hadoop versions: 2.7.2: mapreduce.map.log.level
mapred.reduce.child.log.level
Specifies the logging level for a reduce task, which could be: OFF, FATAL, ERROR, WARN, INFO, DEBUG, TRACE, or ALL.Default: INFO
Supported Hadoop versions: 2.7.2: mapreduce.reduce.log.level
mapred.job.shuffle.merge.percent
Specifies the threshold at which an in-memory merge is initiated (expressed as a percentage of the total memory allocated to storing in-memory map outputs) as defined by mapred.job.shuffle.input.buffer.percent.Default: 0.90
Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.merge.percent
mapred.job.shuffle.input.buffer.percent
Specifies the percentage of memory to be allocated from the maximum heap size for storing map outputs during the shuffle phase.Default: 0.90
Supported Hadoop versions: 2.7.2: mapreduce.reduce.input.buffer.percent
mapred.job.reduce.input.buffer.percent
Specifies the percentage of memory to be allocated from the maximum heap size for retaining map outputs during the reduce phase. When the shuffle ends, any remaining map outputs in memory must consume memory less than this threshold before the reduce can begin.Default: 0.0
Supported Hadoop versions: 2.7.2: mapreduce.reduce.input.buffer.percent
mapred.map.tasks.speculative.execution
Specifies whether multiple instances of some map tasks may be executed in parallel. Valid values are true or false.Default: true
Supported Hadoop versions: 2.7.2: mapreduce.map.speculative
mapred.reduce.tasks.speculative.execution
Specifies whether multiple instances of some reduce tasks may be executed in parallel. Valid values are true or false.Default: true
Supported Hadoop versions: 2.7.2: mapreduce.reduce.speculative
map.sort.class
Specifies the default sort class for sorting keys.Default:
org.apache.hadoop.util.QuickSort
Supported Hadoop versions: 2.7.2: map.sort.class
jobclient.output.filter
Specifies the filter which controls the output of task user logs sent to the cluster management console of the JobClient and could be NONE, KILLED, FAILED, SUCCEEDED, or ALL.Default: FAILED
Supported Hadoop versions: 2.7.2: mapreduce.client.output.filter
mapred.reduce.slowstart.completed.maps
Specifies the fraction of the number of maps in the job that must be completed before reduces are scheduled for the job.Default: 0.05
Supported Hadoop versions: 2.7.2: mapreduce.job.reduce.slowstart.completedmaps
mapred.output.compress
Specifies whether the final job output must be compressed. Valid values are true or false.Default: false
Supported Hadoop versions: 2.7.2: mapreduce.output.fileoutputformat.compress
mapred.output.compression.type
If emitting sequence files for your output and final job output is to be compressed, specifies the type of compression, which could be NONE, RECORD (to compress individual records), or BLOCK (to compress groups of records).Default: RECORD
Supported Hadoop versions: 2.7.2: mapreduce.output.fileoutputformat.compress.type
mapred.output.compression.codec
If the final job output is to be compressed, specifies the class name of the compression codec.Default: org.apache.hadoop.io.compress.DefaultCodec
Supported Hadoop versions: 2.7.2: mapreduce.output.fileoutputformat.compress.codec
mapred.compress.map.output
Specifies whether map output must be compressed (using SequenceFile) as it is being written to disk. Valid values are true or false.Default: false
Supported Hadoop versions: 2.7.2: mapreduce.map.output.compress
mapred.map.output.compression.codec
If the map output is to be compressed, specifies the class name of the compression codec.Default: org.apache.hadoop.io.compress.DefaultCodec
Supported Hadoop versions: 2.7.2: mapreduce.map.output.compress.codec
mapreduce.client.completion.pollinterval
Specifies the interval (in milliseconds) between which the JobClient polls the application manager for updates about job status. Set this to a lesser value to make tests run faster on a single node system. Adjusting this value in production may lead to unwanted client-server traffic.Default: 5000
Supported Hadoop versions: 2.7.2: mapreduce.client.completion.pollinterval
mapreduce.client.progressmonitor.pollinterval
Specifies the interval (in milliseconds) between which the JobClient reports status to the cluster management console and checks for job completion. Set this to a lesser value to make tests run faster on a single node system. Adjusting this value in production may lead to unwanted client-server traffic.Default: 1000
Supported Hadoop versions: 2.7.2: mapreduce.client.progressmonitor.pollinterval
mapreduce.reduce.shuffle.connect.timeout
(For advanced users) Specifies the maximum duration (in milliseconds) that a reduce task tries connecting to the task tracker to fetch map output.Default: 180000
Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.connect.timeout
mapreduce.reduce.shuffle.read.timeout
(For advanced users) The maximum duration (in milliseconds) that a reduce task waits for map output data to be available for reading after obtaining a connection.Default: 180000
Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.read.timeout
mapreduce.reduce.shuffle.copy.mapout.unit
(For advanced users) Specifies the number of map output data units that the reduce gets in one connection.Default: 5
Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.copy.mapout.unit
mapreduce.reduce.shuffle.retry-delay.max.ms
Specifies the length of time (in milliseconds) that the reducer will delay before retrying to fetch inter data from MapReduce output files.Default: 60000
Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.retry-delay.max.ms