Hints for designing indexes

In HBase, the row key provides the same data retrieval benefits as a primary index. So, when you create a secondary index, use elements that are different from the row key.

Secondary indexes allow you to have a secondary way to read an HBase table. They provide a way to efficiently access records by means of some piece of information other than the primary key. Secondary indexes require additional cluster space and processing because the act of creating a secondary index requires both space and processing cycles to update.

The LOAD USING command can be used to bulk load data into HBase tables in the distributed file system (DFS) by using the appropriate target table properties. However, any indexes that exist prior to the LOAD USING process are not updated. Consider dropping your indexes before you issue the LOAD USING command. You can then re-create the indexes after the LOAD USING command is complete. This action ensures that map reduce is used to populate the indexes. The benefits of using the parameter hbase.load.method on your LOAD USING command exceed any inconvenience of dropping and then recreating the index.

When you insert a large amount of data into a table, Db2 Big SQL automatically uses a feature called bulk insert. This feature lets Db2 Big SQL populate tables and any indexes on those tables more efficiently.

You cannot create an index on the following elements:
  • An external table.
  • Custom encoded columns.
  • Columns that are the leading part or parts of a row key.
You can use a SPLIT option on the CREATE INDEX statement to decide when and how to split the indexed data for better performance. There are several options for splitting the data. The following examples show two of the options:
Using an algorithm
The following statement creates an index by using the SPLIT algorithm. The algorithm uses the input value of 4 to evenly divide the space of possible keys by four regions.
 
CREATE INDEX idx1 ON TABLE hbtableX (C1) AS 'hbase' 
  SPLITALGO [ 'UniformSplit', 4];
The UniformSplit algorithm is useful when the keys are nearly uniform random bytes. The results use less space, but they might not be as readable.
Using an explicit SPLIT
In the following example, the splits are defined explicitly. Each value is a key on which to split. The value can be a string or a numeric value.

CREATE INDEX idx2 ON TABLE hbtable(name,score) as 'hbase'
  SPLITS [5, 10, 15, 20];
This statement results in five splits; the number of explicit splits plus one.
You can create an index on a row key column, but consider the following restrictions:
  • You cannot create an index on a column that is mapped entirely to the key of an HBase table.
  • You cannot create an index on columns that are defined in the same order as the row key in the COLUMN MAPPING clause. Since these columns form the leading part of the row key, the data retrieval benefits are already available.
The following examples, which include a CREATE TABLE statement and a CREATE INDEX statement, illustrates some of the suggested strategies of creating indexes:
CREATE HBASE TABLE rkindex 
  (k0 varchar, k1 int, k2 string, k3 int, 
   c0 string, c1 int, c2 int)
COLUMN MAPPING (
  KEY MAPPED BY (k0, k1, k2, k3), 
  cf:cq MAPPED BY (c0, c1, c2));
If the index is made of several columns that are mapped by a row key, but the row key mapping columns are in a different order from the index columns, then you can create an index on the row key columns.
CREATE INDEX ix_k0k2k3k1 ON TABLE rkindex (k0,k2,k3,k1) AS 'HBASE';
In the next example of CREATE INDEX, based on the same table as in the previous example, the index is valid because the row key column that is used in the index is only a small part of the set of columns that are mapped to the row key. Therefore, the row key column k0 used alone in the index is not enough to determine uniqueness.
CREATE INDEX ix_k0c0c1c2 ON TABLE rkindex (k0,c0,c1,c2) AS 'HBASE';