Hints for designing HBase tables
You can optimize your HBase queries with the proper design layout.
Column families
An HBase table is made of column families which are the logical and physical grouping of columns. The columns in one family are stored separately from the columns in another family. If you have data that is not often queried, assign that data to a separate column family.
The column family and column qualifier names are repeated for each row. Therefore, keep the names as short as possible to reduce the amount of data that HBase stores and reads. For example, use f:q instead of mycolumnfamily:mycolumnqualifier.
Because column families are stored in separate HFiles, keep the number of column families as small as possible. You also want to reduce the number of column families to reduce the frequency of MemStore flushes, and the frequency of compactions. And, by using the smallest number of column families possible, you can improve the LOAD time and reduce disk consumption.
CREATE HBASE TABLE ORDERS
(O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS CHAR(1),
O_TOTALPRICE FLOAT, O_ORDERDATE string, O_ORDERPRIORITY CHAR(15),
O_CLERK CHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79)
)
COLUMN MAPPING (
KEY MAPPED BY
(O_ORDERDATE,O_CUSTKEY,O_ORDERKEY),
f:d MAPPED BY
(O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,
O_SHIPPRIORITY,O_COMMENT,O_CLERK)
)...;Encoding
- STRING encoding
-
STRING encoding is portable and is easy to use when you want to map existing data. Delimited data is a very common form of storing data and it can be easily mapped using Db2 Big SQL string encoding. However, parsing strings is expensive and queries with data encoded as strings can be slow. Because HBase uses a lexicographic ordering, numeric data is not collated correctly.
- BINARY encoding
-
BINARY encoding is sortable. So numeric data collate properly. It handles separators internally and avoids issues of separators that exist within data by escaping the data. Queries on data encoded as BINARY have faster response times.
- custom encoding
-
By using a serializer/deserializer (SerDe), you can read in data, and write it back out to the HBase table in any custom format. However, be aware that by using custom encoding, you cannot use some optimization techniques. Indexes cannot be defined on custom encoded columns. Predicates on custom encoded columns cannot be evaluated by the region server.
Column mapping: one to many
You can map a single HBase entity (row key or a column) to multiple SQL columns. This kind of mapping is called one to many. HBase stores a lot of information for each value. If you stored each SQL column individually, the required storage space would be very large. For the best performance, put columns that are queried together into a single dense HBase column to help reduce the data that is fetched from HBase. A dense column is a single HBase column that maps to multiple SQL columns.
For example, if table T1 has nine columns with 1.5 million rows. and you use a one-to-one mapping, this table requires 522 MB of storage. However, if table T1 uses a one-to-many mapping, the table requires only 276 MB of storage.
SELECT * FROM T1;
| Column mapping | Response time |
|---|---|
| One to one mapping | 1m 31 s |
| One to many mapping | 1m 2s |
Design your rowkeys in such a way that query predicates can be pushed down. Only predicates on the leading rowkeys are pushed down.
Do not use monotonically increasing values or time series data as row keys. For example, in the ORDERS table, do not use the O_ORDERKEY as the leading rowkey. This design can cause a single region to be a hot spot since all of the key values are in close proximity to each other and will belong to the same region.