Hints for designing HBase tables

Column families

An HBase table is made of column families which are the logical and physical grouping of columns. The columns in one family are stored separately from the columns in another family. If you have data that is not often queried, assign that data to a separate column family.

The column family and column qualifier names are repeated for each row. Therefore, keep the names as short as possible to reduce the amount of data that HBase stores and reads. For example, use f:q instead of mycolumnfamily:mycolumnqualifier.

Because column families are stored in separate HFiles, keep the number of column families as small as possible. You also want to reduce the number of column families to reduce the frequency of MemStore flushes, and the frequency of compactions. And, by using the smallest number of column families possible, you can improve the LOAD time and reduce disk consumption.

The following example illustrates a best practice for defining column families:


CREATE HBASE TABLE ORDERS 
 (O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS CHAR(1),
  O_TOTALPRICE FLOAT, O_ORDERDATE string, O_ORDERPRIORITY CHAR(15),
  O_CLERK CHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79)
 )
COLUMN MAPPING (  
  KEY MAPPED BY
    (O_ORDERDATE,O_CUSTKEY,O_ORDERKEY),  
    f:d MAPPED BY
     (O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,
      O_SHIPPRIORITY,O_COMMENT,O_CLERK)
  )...;

Encoding

You map data in an HBase table to a logical SQL table. This is called column mapping. Column mapping assigns a schema and types to the data, which ties the data to a relational model. As part of the column mapping, you decide how to interpret the bytes and convert them into the types that are assigned through the mapping, which is called encoding. The default encoding is STRING. You can also define BINARY encoding or USE SERDE encoding, which defines a custom serialization/deserialization process.

STRING encoding: STRING encoding is portable and is easy to use when you want to map existing data. Delimited data is a very common form of storing data and it can be easily mapped using Db2 Big SQL string encoding. However, parsing strings is expensive and queries with data encoded as strings can be slow. Because HBase uses a lexicographic ordering, numeric data is not collated correctly.
BINARY encoding: BINARY encoding is sortable. So numeric data collate properly. It handles separators internally and avoids issues of separators that exist within data by escaping the data. Queries on data encoded as BINARY have faster response times.
custom encoding: By using a serializer/deserializer (SerDe), you can read in data, and write it back out to the HBase table in any custom format. However, be aware that by using custom encoding, you cannot use some optimization techniques. Indexes cannot be defined on custom encoded columns. Predicates on custom encoded columns cannot be evaluated by the region server.

Column mapping: one to many

You can map a single HBase entity (row key or a column) to multiple SQL columns. This kind of mapping is called one to many. HBase stores a lot of information for each value. If you stored each SQL column individually, the required storage space would be very large. For the best performance, put columns that are queried together into a single dense HBase column to help reduce the data that is fetched from HBase. A dense column is a single HBase column that maps to multiple SQL columns.

For example, if table T1 has nine columns with 1.5 million rows. and you use a one-to-one mapping, this table requires 522 MB of storage. However, if table T1 uses a one-to-many mapping, the table requires only 276 MB of storage.

Queries on columns that are mapped one-to-many perform better than one-to-one queries. A one-to-many model maximizes predicate pushdown. When querying back, the query returns the entire key for each value. For example, assume the following query:

SELECT * FROM T1;

Table 1. Comparing column mapping models
Column mapping	Response time
One to one mapping	1m 31 s
One to many mapping	1m 2s

For the best performance, put the column that you use the most in a query predicate as the leading part of the column, since only SQL predicates that are mapped to a leading HBase column are pushed down. If you have other columns on which you usually query, define indexes on those other columns. An index can be defined on a single column or a set of columns.

Design your rowkeys in such a way that query predicates can be pushed down. Only predicates on the leading rowkeys are pushed down.

Do not use monotonically increasing values or time series data as row keys. For example, in the ORDERS table, do not use the O_ORDERKEY as the leading rowkey. This design can cause a single region to be a hot spot since all of the key values are in close proximity to each other and will belong to the same region.