Major compaction and data locality

Major compactions are necessary for StoreFile cleanup. Ideally, you should run a major compaction after each start of HBase. A major compaction ensures that all data that is owned by a region server is local to that server.

What is data locality and compaction?

The term data locality refers to putting the data close to where it is needed. To have data locality, your cluster must share space with the Hadoop datanode, and the HBase Region Servers. The Hadoop datanode stores the data that the Region Server is managing, The data migrates to the Region Server node when the Region is compacted and re-written. When the data of the Region is local, it stays local for as long as that Region Server serves that Region.

When something is written to HBase, it is first written to an in-memory store, called the MemStore. When the MemStore reaches a certain size, it is flushed to disk into a StoreFile. The store files that are created on disk are immutable. Sometimes the store files are merged together, which is done by a process called compaction.

There are two kinds of compaction: major and minor. Minor compactions merge a small number of files. Major compaction merges all of the files in a region. Major compactions also remove deletes or expired versions.

Why tune the compaction?

By default, major compactions run every 24 hours and merge together all store files into one. After a major compaction runs, there is a single StoreFile for each store. Compactions can cause HBase to block writes to prevent JVM heap exhaustion. Major compactions can be triggered manually, which is the recommended procedure. You can schedule major compactions to occur automatically when the usage on the cluster is low.

After a compaction, if a new store file is greater than a certain size (based on the property hbase.hregion.max.filesize), the region is split into two new regions.

MapReduce tasks run close to the data that they process. This data locality is possible because large files in the distributed file system (DFS) are broken into smaller blocks. Each block maps to a task that is run to process the contained data. Larger block sizes mean that there are fewer map tasks to run, because the number of mappers is driven by the number of blocks that need processing. Hadoop knows where blocks are located and runs the map tasks directly on the node that hosts it. For HBase, the mappers run on the nodes of the region the mappers are scanning. Therefore, data locality of the region with its data is important to get real locality.

Modifying the configuration to effect compaction

Disable the automatic major compactions by updating file hbase-site.xml:

<property> 
  <name>hbase.hregion.majorcompaction</name> 
  <value>0</value> 
</property>
Decrease the region server size by updating file hbase-site.xml:

<property> 
  <name>hbase.hregion.max.filesize</name> 
  <value> 1073741824 </value> 
</property>
Db2 Big SQL determines the number of mappers based on the number of regions.
You can trigger major compactions by running one of the commands in Table 1 from the HBase shell. Open the HBase shell:
cd $HBASE_HOME
./bin/hbase shell
Table 1. Triggering major compactions
Compaction goal Code
Compact all regions in a table called t1
hbase> major_compact 't1' 
Compact an entire region called r1
hbase> major_compact 'r1'          
Compact a single column family within a region called r1, where the column family is named c1
hbase> major_compact 'r1', 'c1'
Compact a single column family, c1, within a table, t1
hbase> major_compact 't1', 'c1'