Best practices for loading data in HBase
The HBase record table is where you persist the record data. While creating the record table, there are several elements that can impact performance.
- Number of regions per HBase record table
- HBase row key design - generating UUIDs
- Number of column families
- Splitting
- Durability
Number of regions per HBase record table
HBase tables have a concept of regions. At a high level, HBase regions are where HBase data is kept in HFiles. When you create an HBase table, you can either explicitly define the number of regions or you can allow the table to dictate internally. It is suggested that you define the number of regions explicitly. Explicit definition can improve the performance and stability of the other steps. It also helps the Hadoop cluster in general because explicitly defining the number of regions can assist in preventing issues like hotspotting. The number of regions should be equal to the total number of CPUs that all the region servers have. An easy way to find that out is through the lscpu command.
To properly set the number of regions, you can calculate that setting by using the following formula: (number of region servers - 1) * number of logical CPUs.
create 'testin', {NAME => 'pf', COMPRESSION => 'SNAPPY'}, {NUMREGIONS => 8, SPLITALGO => 'HexStringSplit', DURABILITY => 'ASYNC_WAL'}
HBase row key design - generating UUIDs
To properly create the UUIDs to avoid hotspotting, follow the HBase row key design patterns as outlined in the row key design link in related links at the end of this topic.
Hotspots make one node do all the work, thus resulting in a long loading process.
One of the most important best practices to generate UUID is to have them randomly generated for each record. Given that there are 122 random bits allocated to generate the UUID (total of 128 bits: 2 bits to indicate RFC 41222 and 4 bits to indicate the version, such as 0100 = randomly generated), the chance of any given two UUIDs having the same value is 4 x 10^16.
It is suggested to use the Java built-in utility (java.util.UUID) on your Linux environment that generates random UUIDs, which then can be merged with the data files for each record. Another approach is to use the record number as reference (converted into bytes) to generate the UUID. This second method also ensures unique creation of UUIDs by enforcing a reference point (record number). After the UUID files are generated, the paste command can be used to combine UUID files with the data file and the split -l command can split one large file into multiple smaller files.
[code]
import java.util.UUID;
public class GenerateUUID {
public static final void main(String... args) {
long count = Long.parseLong(args[0]);
for (long i = 0L ; i < count; i++) {
UUID idOne = UUID.randomUUID();
log(idOne);
}
}
private static void log(Object aObject){
System.out.println( String.valueOf(aObject) );
}
}
[code]
Number of column families
There are various approaches that you can take to define the number of column families. The most common approach is to have one column family and have all the columns in that one-column family. However, you can also have an alternative approach where you put all your inquiry columns in one family and all the non-essential columns in another column family.
Splitting
Splitting is another way of improving performance in HBase. To manually define splitting, you must know your data well. If you do not, then you can split using a default splitting approach that is provided by HBase called “HexStringSplit”. HexStringSplit automatically optimizes the number of splits for your HBase operations. Again if you, are not splitting properly, then you might face hotspotting. You can also use another approach that is called Bucketing to split data. See the Apache HBase related link for more details.
Durability
HBase has the concept of Write Ahead Logs (WAL). Before committing any changes in data to StoreFiles, it is written in MemStores. The WALs make sure that they are logged properly; if the WALs are not set properly then write operation to HBase fails. If you set this property to asynchronous mode, then log writing is not synchronized and improves performance. Hence, keep it in ASYNC mode.