Properties that can optimize HBase table scans

You can optimize scans of HBase tables by modifying some properties.

hbase.client.scanner.caching

This parameter, which is set in file hbase-site.xml, is the number of rows that are fetched when calling next on a scanner if it is not served from (local, client) memory. Higher caching values enable faster scanners but use more memory and some calls of next can take longer times when the cache is empty.

This value is important if data in your HBase table is used without any HBase row key based lookups, or when your query looks for wide range scans (wide rowkey lookups).

You can modify this parameter in these ways:

As a property in hbase-site.xml:

<property>
  <name>hbase.client.scanner.caching</name>
  <value>10000</value>
</property>

From a SET command:

SET HADOOP PROPERTY hbase.client.scanner.caching=10000

Region size tuning (hbase.hregion.max.filesize)

HBase region size is important because when accessing HBase data, a map reduce split is a region.

If the region size is too large, there is not sufficient parallelism in the map reduce jobs. If the region size is too small, there are many wasted cycles in creating and tearing down the map reduce tasks.

An optimal size depends on the workload and the cluster configuration.

This value is important if data in your HBase table is used without any HBase row key based lookups, or when your query looks for wide range scans (wide rowkey lookups).