Best practices for loading data in HBase

The HBase record table is where you persist the record data. While creating the record table, there are several elements that can impact performance.

The considerations are:

Number of regions per HBase record table
HBase row key design - generating UUIDs
Number of column families
Splitting
Durability

If any of these elements are not set properly, you might see poor performance when loading the data. You might also see hotspotting (hotspotting is when only one of the nodes uses high-CPU utilization despite having multiple nodes in the cluster). Hotspotting can occur simply when loading data or even when running Big Match batch jobs like Derive, Compare, or Link.

Number of regions per HBase record table

HBase tables have a concept of regions. At a high level, HBase regions are where HBase data is kept in HFiles. When you create an HBase table, you can either explicitly define the number of regions or you can allow the table to dictate internally. It is suggested that you define the number of regions explicitly. Explicit definition can improve the performance and stability of the other steps. It also helps the Hadoop cluster in general because explicitly defining the number of regions can assist in preventing issues like hotspotting. The number of regions should be equal to the total number of CPUs that all the region servers have. An easy way to find that out is through the lscpu command.

To properly set the number of regions, you can calculate that setting by using the following formula: (number of region servers - 1) * number of logical CPUs.

For example, if you have one master node and two region servers, each having 8 logical CPUs, then create (2 - 1) * 8 regions.

create 'testin', {NAME => 'pf', COMPRESSION => 'SNAPPY'}, {NUMREGIONS => 8, SPLITALGO => 'HexStringSplit', DURABILITY => 'ASYNC_WAL'}

HBase row key design - generating UUIDs

To properly create the UUIDs to avoid hotspotting, follow the HBase row key design patterns as outlined in the row key design link in related links at the end of this topic.

Hotspots make one node do all the work, thus resulting in a long loading process.

One of the most important best practices to generate UUID is to have them randomly generated for each record. Given that there are 122 random bits allocated to generate the UUID (total of 128 bits: 2 bits to indicate RFC 41222 and 4 bits to indicate the version, such as 0100 = randomly generated), the chance of any given two UUIDs having the same value is 4 x 10^16.

It is suggested to use the Java built-in utility (java.util.UUID) on your Linux environment that generates random UUIDs, which then can be merged with the data files for each record. Another approach is to use the record number as reference (converted into bytes) to generate the UUID. This second method also ensures unique creation of UUIDs by enforcing a reference point (record number). After the UUID files are generated, the paste command can be used to combine UUID files with the data file and the split -l command can split one large file into multiple smaller files.

An example of calling the UUID utility is as follows:

[code]
import java.util.UUID;

public class GenerateUUID {
  
  public static final void main(String... args) {
    long count = Long.parseLong(args[0]);
    for (long i = 0L ; i < count; i++) {
	    UUID idOne = UUID.randomUUID();
	    log(idOne);
    }
  }
  
  private static void log(Object aObject){
    System.out.println( String.valueOf(aObject) );
  }
} 
[code]

Number of column families

There are various approaches that you can take to define the number of column families. The most common approach is to have one column family and have all the columns in that one-column family. However, you can also have an alternative approach where you put all your inquiry columns in one family and all the non-essential columns in another column family.

Splitting

Splitting is another way of improving performance in HBase. To manually define splitting, you must know your data well. If you do not, then you can split using a default splitting approach that is provided by HBase called “HexStringSplit”. HexStringSplit automatically optimizes the number of splits for your HBase operations. Again if you, are not splitting properly, then you might face hotspotting. You can also use another approach that is called Bucketing to split data. See the Apache HBase related link for more details.

Durability

HBase has the concept of Write Ahead Logs (WAL). Before committing any changes in data to StoreFiles, it is written in MemStores. The WALs make sure that they are logged properly; if the WALs are not set properly then write operation to HBase fails. If you set this property to asynchronous mode, then log writing is not synchronized and improves performance. Hence, keep it in ASYNC mode.