HDFS erasure coding

HDFS erasure coding is an alternative storage strategy to traditional HDFS three-way block replication. The key advantage of using erasure coding is that your files will consume less space.

You can enable erasure coding on a per directory basis. For more information, see HDFS erasure coding.

Db2® Big SQL is compatible with erasure coding. For example, the HDFS files that underlie Db2 Big SQL tables can use erasure coding or three-way block replication, or a combination of both strategies.

There are performance considerations when choosing erasure coding. Although this approach uses only about half the HDFS space compared to three-way replication, additional CPU and network overhead with erasure coding can impact Db2 Big SQL performance. When table data is stored by using HDFS erasure coding (instead of three-way block replication), Db2 Big SQL query workloads can take 10% or more longer, on average, to complete, and individual queries can take several times longer to complete. This appears to be associated with a small increase in CPU resource consumption, but a significant increase in network traffic. The performance impact of switching to erasure coding is also both workload specific and cluster specific.

Before widely adopting this storage strategy, it is recommended that you take the following actions:
  • For Intel based clusters, ensure that ISA-L native libraries are enabled. Use the hadoop checknative command to confirm this.
  • Assess the impact of erasure coding on performance by testing a subset of your tables and workloads.