Hadoop properties supported for MapReduce jobs in IBM Spectrum Symphony

This section lists the Hadoop properties supported within the MapReduce framework in IBM® Spectrum Symphony.

Make Hadoop properties take effect in any of the following ways:

For a single job:
- from the mrsh utility, use the -D option during job submission.
- from the cluster management console, use the Add button on the Submit Job window during job submission.
For all jobs submitted from the host, use the pmr-site.xml configuration file.
Important: Any Hadoop parameter defined in pmr-site.xml takes precedence over the corresponding parameter defined in Hadoop configuration files (such as mapred-site.xml or core-site.xml). Use pmr-site.xml to define Hadoop parameters only if you did not set HADOOP_HOME (before installing IBM Spectrum Symphony) or PMR_EXTERNAL_CONFIG_PATH (after installing IBM Spectrum Symphony). Both environment variables import settings from your Hadoop installation to IBM Spectrum Symphony.

hadoop.tmp.dir

A base for other temporary directories.

Default: /tmp/hadoop-${user.name}

Supported Hadoop versions: 2.7.2: hadoop.tmp.dir

mapreduce.job.committer.setup.cleanup.needed

Specifies whether a job requires setup and cleanup. Valid values are true or false. This property is not supported within the MapReduce framework in IBM Spectrum Symphony. In Hadoop, a value of false for this property means that setup and cleanup tasks are not created for a job. In IBM Spectrum Symphony, irrespective of this property's value, setup and cleanup tasks for a job are always created.

Default: true

Supported Hadoop versions: 2.7.2: mapreduce.job.committer.setup.cleanup.needed

io.sort.factor

Specifies the number of streams to merge at once while sorting files. This value determines the number of open file handles.

Default: 10

Supported Hadoop versions: 2.7.2: mapreduce.task.io.sort.factor

io.sort.mb

Specifies the size of the buffer memory (in megabytes) to use while sorting files.

Default: 100

Supported Hadoop versions: 2.7.2: mapreduce.task.io.sort.mb

io.sort.spill.percent

The soft limit in either the buffer or record collection buffers. Once the limit is reached, a thread will begin to spill the contents to disk in the background.

Default: 0.80

Supported Hadoop versions: 2.7.2: mapreduce.map.sort.spill.percent

mapred.local.dir

Specifies the local directory where intermediate data files are stored in the MapReduce framework. The path may be a comma-separated list of directories on different devices in order to spread disk I/O. Directories that do not exist are ignored.

Default: ${hadoop.tmp.dir}/mapred/local

Supported Hadoop versions: 2.7.2: mapreduce.cluster.local.dir

mapred.job.priority

Specifies the scheduling priority for a MapReduce job.

The MapReduce framework supports up to 10,000 levels of prioritization, where 1 is the lowest priority level. Set your job priority to any of the following predefined settings, where:

VERY_LOW=1
LOW=2500
NORMAL=5000
HIGH=7500
VERY_HIGH=10000

Default: NORMAL

Supported Hadoop versions: 2.7.2: mapreduce.job.priority

mapred.job.map.memory.mb

Specifies the maximum virtual memory for a map task. If the task's memory usage exceeds the limit, the task is killed. If this limit is not configured, the value configured for mapred.task.maxvmem is used.

Default: -1

Supported Hadoop versions: 2.7.2: mapreduce.map.memory.mb

mapred.job.reduce.memory.mb

Specifies the maximum virtual memory for a reduce task. If the task's memory usage exceeds the limit, the task is killed. If this limit is not configured, the value configured for mapred.task.maxvmem is used.

Default: -1

Supported Hadoop versions: 2.7.2: mapreduce.reduce.memory.mb

mapred.map.tasks

Specifies the default number of map tasks per job. Ignored when mapreduce.jobtracker.address is local.

Default: None (set based on user examples)

Supported Hadoop versions: 2.7.2: mapreduce.job.maps

mapred.reduce.tasks

Specifies the default number of reduce tasks per job. Typically, this property is set to 99 percent of the cluster's reduce capacity, so if a node fails the reduces can still be executed in a single wave. Ignored when mapreduce.jobtracker.address is local.

Default: None (set based on user examples)

Supported Hadoop versions: 2.7.2: mapreduce.job.reduces

mapred.map.max.attempts

(For advanced users) Specifies the maximum number of attempts per map task. The framework tries to execute a map task this many times before giving up on it.

Default: 4

Supported Hadoop versions: 2.7.2: mapreduce.map.maxattempts

mapred.reduce.max.attempts

(For advanced users) Specifies the maximum number of attempts per reduce task. The framework tries to execute a reduce task this many times before giving up on it.

Default: 4

Supported Hadoop versions: 2.7.2: mapreduce.reduce.maxattempts

mapred.reduce.parallel.copies

Specifies the default number of parallel transfers run by reducers to fetch output from maps during the copy (shuffle) phase.

Default: 5

Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.parallelcopies

mapred.map.child.log.level

Specifies the logging level for a map task, which could be: OFF, FATAL, ERROR, WARN, INFO, DEBUG, TRACE, or ALL.

Default: INFO

Supported Hadoop versions: 2.7.2: mapreduce.map.log.level

mapred.reduce.child.log.level

Specifies the logging level for a reduce task, which could be: OFF, FATAL, ERROR, WARN, INFO, DEBUG, TRACE, or ALL.

Default: INFO

Supported Hadoop versions: 2.7.2: mapreduce.reduce.log.level

mapred.job.shuffle.merge.percent

Specifies the threshold at which an in-memory merge is initiated (expressed as a percentage of the total memory allocated to storing in-memory map outputs) as defined by mapred.job.shuffle.input.buffer.percent.

Default: 0.90

Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.merge.percent

mapred.job.shuffle.input.buffer.percent

Specifies the percentage of memory to be allocated from the maximum heap size for storing map outputs during the shuffle phase.

Default: 0.90

Supported Hadoop versions: 2.7.2: mapreduce.reduce.input.buffer.percent

mapred.job.reduce.input.buffer.percent

Specifies the percentage of memory to be allocated from the maximum heap size for retaining map outputs during the reduce phase. When the shuffle ends, any remaining map outputs in memory must consume memory less than this threshold before the reduce can begin.

Default: 0.0

Supported Hadoop versions: 2.7.2: mapreduce.reduce.input.buffer.percent

mapred.map.tasks.speculative.execution

Specifies whether multiple instances of some map tasks may be executed in parallel. Valid values are true or false.

Default: true

Supported Hadoop versions: 2.7.2: mapreduce.map.speculative

mapred.reduce.tasks.speculative.execution

Specifies whether multiple instances of some reduce tasks may be executed in parallel. Valid values are true or false.

Default: true

Supported Hadoop versions: 2.7.2: mapreduce.reduce.speculative

map.sort.class

Specifies the default sort class for sorting keys.

Default: org.apache.hadoop.util.QuickSort

Supported Hadoop versions: 2.7.2: map.sort.class

jobclient.output.filter

Specifies the filter which controls the output of task user logs sent to the cluster management console of the JobClient and could be NONE, KILLED, FAILED, SUCCEEDED, or ALL.

Default: FAILED

Supported Hadoop versions: 2.7.2: mapreduce.client.output.filter

mapred.reduce.slowstart.completed.maps

Specifies the fraction of the number of maps in the job that must be completed before reduces are scheduled for the job.

Default: 0.05

Supported Hadoop versions: 2.7.2: mapreduce.job.reduce.slowstart.completedmaps

mapred.output.compress

Specifies whether the final job output must be compressed. Valid values are true or false.

Default: false

Supported Hadoop versions: 2.7.2: mapreduce.output.fileoutputformat.compress

mapred.output.compression.type

If emitting sequence files for your output and final job output is to be compressed, specifies the type of compression, which could be NONE, RECORD (to compress individual records), or BLOCK (to compress groups of records).

Default: RECORD

Supported Hadoop versions: 2.7.2: mapreduce.output.fileoutputformat.compress.type

mapred.output.compression.codec

If the final job output is to be compressed, specifies the class name of the compression codec.

Default: org.apache.hadoop.io.compress.DefaultCodec

Supported Hadoop versions: 2.7.2: mapreduce.output.fileoutputformat.compress.codec

mapred.compress.map.output

Specifies whether map output must be compressed (using SequenceFile) as it is being written to disk. Valid values are true or false.

Default: false

Supported Hadoop versions: 2.7.2: mapreduce.map.output.compress

mapred.map.output.compression.codec

If the map output is to be compressed, specifies the class name of the compression codec.

Default: org.apache.hadoop.io.compress.DefaultCodec

Supported Hadoop versions: 2.7.2: mapreduce.map.output.compress.codec

mapreduce.client.completion.pollinterval

Specifies the interval (in milliseconds) between which the JobClient polls the application manager for updates about job status. Set this to a lesser value to make tests run faster on a single node system. Adjusting this value in production may lead to unwanted client-server traffic.

Default: 5000

Supported Hadoop versions: 2.7.2: mapreduce.client.completion.pollinterval

mapreduce.client.progressmonitor.pollinterval

Specifies the interval (in milliseconds) between which the JobClient reports status to the cluster management console and checks for job completion. Set this to a lesser value to make tests run faster on a single node system. Adjusting this value in production may lead to unwanted client-server traffic.

Default: 1000

Supported Hadoop versions: 2.7.2: mapreduce.client.progressmonitor.pollinterval

mapreduce.reduce.shuffle.connect.timeout

(For advanced users) Specifies the maximum duration (in milliseconds) that a reduce task tries connecting to the task tracker to fetch map output.

Default: 180000

Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.connect.timeout

mapreduce.reduce.shuffle.read.timeout

(For advanced users) The maximum duration (in milliseconds) that a reduce task waits for map output data to be available for reading after obtaining a connection.

Default: 180000

Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.read.timeout

mapreduce.reduce.shuffle.copy.mapout.unit

(For advanced users) Specifies the number of map output data units that the reduce gets in one connection.

Default: 5

Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.copy.mapout.unit

mapreduce.reduce.shuffle.retry-delay.max.ms

Specifies the length of time (in milliseconds) that the reducer will delay before retrying to fetch inter data from MapReduce output files.

Default: 60000

Supported Hadoop versions: 2.7.2: mapreduce.reduce.shuffle.retry-delay.max.ms