HBase basics
HBase is a sparse, distributed, persistent multidimensional sorted map database management system that runs on top of a distributed file system (HDFS). Learn about the architecture and terminology.
HBase is built on top of the distributed file system (DFS), which can store large files.
HBase provides fast record lookups and updates for large tables.
The ZooKeeper cluster acts as a coordination service for the entire HBase cluster.
- Master server
- The master server co-ordinates the cluster and performs administrative operations, such as assigning regions and balancing the loads.
- Region server
- The region servers do the real work. A subset of the data of each table is
handled by each region server. Clients talk to region servers to access data
in HBase.
- Regions
-
Region servers manage a set of regions.
An HBase table is made up of a set of regions. Regions are the basic unit of work in HBase. It is what is used as a split by the map reduce framework. The region contains store objects that correspond to column families. There is one store instance for each column family. Store objects create one or more StoreFiles, which are wrappers around the actual storage file called the HFile.
The region also contains a MemStore, which is in-memory storage and is used as a write cache. Rows are written to the MemStore. The data in the MemStore is ordered. If the MemStore becomes full, it is persisted to an HFile on disk
To improve performance, it is important to get an even distribution of data among regions, which ensures the best parallelism in map tasks.
- HFiles
-
HFiles are the physical representation of data in HBase. Clients do not read HFiles directly but go through region servers to get to the data.
HBase internally puts the data in indexed StoreFiles that exist on HDFS for high-speed lookups.
Everything in HBase is stored as bytes and there are no types. There is no schema since each row in HBase can have a different set of columns.
hbase-family:hbase-column-name
For
example, cfd:cqnm and cfd:cqv in the following table, are both members of the cfd column
family.
Row key | Value |
---|---|
11111 |
cfd: {‘cqnm’: ‘name1’, ‘cqv’: 1111} cfi: {‘cqdesc’: ‘desc11111’} |
22222 |
cfd: {‘cqnm’: ‘name2’, ‘cqv’: 2013 @ ts = 2013, ‘cq_val’: 2012 @ ts = 2012 } |
- HFile
- Column
family=cfd
11111 cfd cqnm name1 @ ts1 11111 cfd cqv 1111 @ ts1 22222 cfd cqnm name2 @ ts1 22222 cfd cqv 2013 @ ts1 22222 cfd cqv 2012 @ ts2
- HFile
- Column
family=cfi
11111 cfi cqdesc desc11111 @ ts1
The table that is represented by Table 1 has two column families: cfd and cfi. The cfd family has two columns with qualifiers cqnm and cqv. A column in HBase is referenced by using family:qualifier. The cfi column family has one column: cqdesc.