Data compression
Data compression offers two major benefits: it reduces the space needed to store files and it speeds up data transfer across the network, or to/from disk.
Within the MapReduce framework in IBM® Spectrum Symphony, you can compress the output from map tasks as well as final job output (reducer output) before storing them. The benefits of compressing the map output as it is written to disk include faster writing to disk, disk space savings, and reduction in the amount of data to transfer to the reducer.
You can choose which compression algorithm you want to use. If the job outputs use SequenceFile type, they can also be compressed using different methods such as RECORD, BLOCK, or NONE.
Data compression can be configured using a configuration file or the command line during job submission. The default is no compression for both map task output and job output.
Configure data compression from the mrsh utility
To configure map output compression from the command line, add the following option to your job submission command:
-Dmapred.compress.map.output=true
To configure the codec for map output compression, add one of the following options to your job submission command:
- -Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
- -Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
To configure job output compression, add the following option to your job submission command:
-Dmapred.output.compress=true
To configure the compression type for job output compression, add the following option to your job submission command (default compression type is RECORD):
-Dmapred.output.compression.type=BLOCK | RECORD | NONE
To configure the codec for job output compression, add one of the following options to your job submission command:
- -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
- -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
Configuring data compression in a configuration file
- Open the mapred-site.xml configuration file at $HADOOP_HOME/conf.
- Add a property parameter for each compression option you want
to configure. For example:
<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.DefaultCodec</value> </property> - If you did not set the HADOOP_HOME variable to your Hadoop configuration directory before installing IBM Spectrum Symphony or if you did not set PMR_EXTERNAL_CONFIG_PATH to your Hadoop configuration directory after installing IBM Spectrum Symphony, copy the mapred-site.xml file to $PMR_HOME/conf.