HBase basics

HBase is a sparse, distributed, persistent multidimensional sorted map database management system that runs on top of a distributed file system (HDFS). Learn about the architecture and terminology.

HBase is built on top of the distributed file system (DFS), which can store large files. HBase provides fast record lookups and updates for large tables. Diagram of the HBase architecture.

The ZooKeeper cluster acts as a coordination service for the entire HBase cluster.

HBase contains two primary services:
Master server
The master server co-ordinates the cluster and performs administrative operations, such as assigning regions and balancing the loads.
Region server
The region servers do the real work. A subset of the data of each table is handled by each region server. Clients talk to region servers to access data in HBase.
Regions

Region servers manage a set of regions.

An HBase table is made up of a set of regions. Regions are the basic unit of work in HBase. It is what is used as a split by the map reduce framework. The region contains store objects that correspond to column families. There is one store instance for each column family. Store objects create one or more StoreFiles, which are wrappers around the actual storage file called the HFile.

The region also contains a MemStore, which is in-memory storage and is used as a write cache. Rows are written to the MemStore. The data in the MemStore is ordered. If the MemStore becomes full, it is persisted to an HFile on disk

To improve performance, it is important to get an even distribution of data among regions, which ensures the best parallelism in map tasks.

HFiles

HFiles are the physical representation of data in HBase. Clients do not read HFiles directly but go through region servers to get to the data.

HBase internally puts the data in indexed StoreFiles that exist on HDFS for high-speed lookups.

Everything in HBase is stored as bytes and there are no types. There is no schema since each row in HBase can have a different set of columns.

An HBase table contains column families, which are the logical and physical grouping of columns. There are column qualifiers inside of a column family, which are the columns. Column families contain columns with time stamped versions. Columns only exist when they are inserted, which makes HBase a sparse database. All column members of the same column family have the same column family prefix. The column family and column qualifier mappings are case sensitive. Therefore, cf:cq is not the same as cf:CQ. Each column value is identified by a key. The row key is the implicit primary key. Rows are sorted by the row key. An HBase column can be specified by using the following format:
hbase-family:hbase-column-name
For example, cfd:cqnm and cfd:cqv in the following table, are both members of the cfd column family.
Table 1. Logical representation of an HBase table
Row key Value
11111

cfd: {‘cqnm’: ‘name1’, ‘cqv’: 1111}
cfi: {‘cqdesc’: ‘desc11111’}
22222

cfd: {‘cqnm’: ‘name2’, ‘cqv’: 2013 @ ts = 2013, ‘cq_val’: 2012 @ ts = 2012 }
HFile
Column family=cfd

11111 cfd  cqnm name1  @ ts1 
11111 cfd  cqv  1111   @ ts1 
22222 cfd  cqnm name2  @ ts1  
22222 cfd  cqv  2013   @ ts1 
22222 cfd  cqv  2012   @ ts2 
HFile
Column family=cfi

11111 cfi  cqdesc desc11111 @ ts1 

The table that is represented by Table 1 has two column families: cfd and cfi. The cfd family has two columns with qualifiers cqnm and cqv. A column in HBase is referenced by using family:qualifier. The cfi column family has one column: cqdesc.