HBase is a scalable, distributed, column-oriented dynamic-schema database for structured data. It manages large-scale data (petabytes and beyond) distributed across thousands of commodity servers reliably and efficiently. Modeled after Google's Bigtable database, HBase is a subproject of the Apache Software Foundation's Hadoop project.
Note: At the time of this writing, the latest release of HBase was V0.19.3. The information in this article applies to that release.
HBase data is modeled as a multidimensional map in which values (the table cells) are indexed by four keys:
value = Map(TableName, RowKey, ColumnKey, Timestamp) |
where:
TableNameis a stringRowKeyandColumnKeyare binary values (Java typebyte[])Timestampis a 64-bit integer (Java typelong)valueis an uninterpreted array of bytes (Java™ typebyte[])
Binary data is encoded in Base64 for transmission over the wire.
The row key is the primary key of the table and is typically a string. Rows are sorted by row key in lexicographic order.
Information stored in a table is structured into column families, which you
can think of as categories. Each column family can have an arbitrary number
of members identified by labels (or qualifiers). The
column key is the concatenation of the family name,
the : symbol, and the label. For example, for family
info and a member date, the
column key is info:date.
An HBase table schema defines the column families, but applications can create new members on the fly when you insert a row into the table. For a column family, different rows in the table can have a different number of members. In other words, HBase supports a dynamic schema model.
Table 1 shows a simple example of an HBase table called
Persons with two column families: name and
contact.
Table 1. Persons table with two column families
| Row key | Timestamp | Column family | |
|---|---|---|---|
| name | contact | ||
| 000001 | t3 | contact:http research.google.com/people/jeff/ | |
| t2 | name:first Jeffrey | ||
| t1 | name:last Dean | ||
| 000002 | t5 | name:first Gabriel | |
| t4 | name:last Mateescu | ||
An empty cell has no value associated with the cell's key. In Table 1,
the cell associated with the key (000002, contact:http, t4)
is empty. Empty cells are not stored in HBase; reading an empty cell is similar
to extracting from a map a value by a nonexistent key. HBase tables are thus
suited for sparse rows.
For any row, you can access only one member of one column family at a time (unlike a relational database, where one query can access cells from multiple columns in a row). You can view the members of a column family in a row as subrows.
Tables are decomposed in table regions, equivalent to the Bigtable tablets. A region contains the rows in a certain range. Decomposing a table into regions is a key mechanism for efficiently handling large tables.
Consider the problem of representing information about scientific articles. The articles and their authors are resources. In the Resource Description Framework (RDF), knowledge about resources is represented by assertions (see Resources), where an assertion is a triple:
(subject, predicate, object). |
The predicate defines a relation between the subject (the resource the assertion is referring to) and the object. For example, you could represent the statement, "The article has the title Bigtable," as:
(The article, has title, Bigtable). |
The subject of an assertion is a resource that must be identified by a URI. The predicate must be defined in a vocabulary, so it is associated with the namespace URI of the vocabulary. The object of an assertion can be identified by a URI or by a literal; if it is the subject of another assertion, it must be identified by a URI.
You express knowledge about an article as assertions and represent the assertions in RDF/XML.
RDF description of a journal article
Consider the journal article about Bigtable:
F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, "Bigtable: A Distributed Storage System for Structured Data", ACM Trans. Comput. Syst. 26 (2), June 2008.
You can describe this article through a set of statements, such as:
- The Bigtable journal article has the title "Bigtable: A Distributed Storage System for Structured Data."
- The Bigtable journal article is written by Fay Chang.
where:
- The Bigtable journal article is the subject in both statements.
- has the title is the predicate in the first statement.
- "Bigtable: A Distributed Storage System for Structured Data" is the object in the first statement.
- is written by is the predicate in the second statement.
- Fay Chang is the object in the second statement.
To represent these statements in RDF/XML, you must determine the URI of the subject and the names of the predicates in an appropriate namespace. For the article URI, use the Digital Object Identifier (DOI) URI of the Bigtable paper, http://doi.acm.org/10.1145/1365815.1365816, and reformulate the first statement as follows:
The article with the URI "http://doi.acm.org/10.1145/1365815.1365816" has the title "Bigtable: A Distributed Storage System for Structured Data."
For the predicates, use terms from the vocabularies in Table 2.
Table 2. Namespace URIs and prefixes
| Prefix | Namespace URI | Description |
|---|---|---|
| rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# | RDF vocabulary terms |
| dc | http://purl.org/dc/elements/1.1/ | Dublin Core elements |
| dcterms | http://purl.org/dc/terms/ | Dublin Core terms |
| eprint | http://purl.org/eprint/terms/ | Eprints terms |
| foaf | http://xmlns.com/foaf/0.1/ | FOAF vocabulary terms |
Based on these vocabularies, you can formulate a statement about the Bigtable journal article in RDF/XML as shown in Listing 1.
Listing 1. Article description in RDF/XML
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/" >
<rdf:Description rdf:about="http://doi.acm.org/10.1145/1365815.1365816">
<dc:title>Bigtable: A Distributed Storage System for Structured Data</title>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Fay_Chang"/>
</rdf:Description>
</rdf:RDF>
|
Here, the object for the <dc:title> predicate
is a literal, while the object for <dc:creator>
is a URI.
The complete description of the article, with information about all the authors,
the publication date, and the publisher's name is shown in
Listing 2, where the <dc:type>
predicate defines the article type.
Listing 2. Full RDF/XML article description
$ cat rdf/Bigtable.xml
<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/" >
<rdf:Description rdf:about="http://doi.acm.org/10.1145/1365815.1365816">
<dc:title>Bigtable: A Distributed Storage System for Structured Data</title>
<dc:type>http://purl.org/eprint/type/JournalArticle</dc:type>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Fay_Chang"/>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Jeffrey_Dean"/>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Sanjay_Ghemawat"/>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Wilson_Hsieh"/>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Deborah_Wallach"/>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Mike_Burrows"/>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Tushar_Chandra"/>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Andrew_Fikes"/>
<dc:creator
rdf:resource="http://purl.org/sweb/Authors/google/research/Robert_Gruber"/>
<dc:publisher>ACM, New York, NY, USA</dc:publisher>
<dcterms:issued>2008-06</dcterms:issued>
<dc:subject>distributed databases</dc:subject>
<dcterms:isPartOf rdf:resource="urn:ISSN:0734-2071" />
<dcterms:bibliographicCitation>
ACM Trans. Comput. Syst., 26 (2) 26 pages (2008)
</dcterms:bibliographicCitation>
</rdf:Description>
</rdf:RDF>
|
You use the Friend of a Friend (FOAF) vocabulary for describing the authors. Listing 3 shows assertions about the subject Jeffrey Dean.
Listing 3. RDF/XML description of an author
$ cat rdf/Jeffrey.xml
<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:eprint="http://purl.org/eprint/terms/" >
<rdf:Description
rdf:about="http://purl.org/sweb/Authors/google/research/Jeffrey_Dean">
<foaf:Person>
<foaf:givenname>Jeffrey</foaf:givenname>
<foaf:family_name>Dean</foaf:family_name>
<foaf:homepage rdf:resource="http://research.google.com/people/jeff/" />
</foaf:Person>
<eprint:affiliatedInstitution>Google, Inc.</eprint:affiliatedInstitution>
</rdf:Description>
</rdf:RDF>
|
Modeling a semantic Web with HBase
The first step in modeling a semantic Web with HBase is mapping RDF to HBase tables. To store the RDF/XML descriptions of the articles and authors, you create two tables called articles and authors. Design these tables keeping in mind that you want to support queries about the affiliation of the authors.
The row keys of the Articles table are derived from the DOI of the article. For
example, the row key of the Bigtable paper is
doi.org.acm_10.1145_1365815_1365816. The schema
has three column families:
infofor information such as the title, publication name, and date of publicationauthorsfor the URIs of the authorsaffiliationsfor the authors' affiliations
The row keys of the authors table are derived from the URIs of the authors. For
example, the URI of Jeffrey Dean (see Listing 3) is converted
to the key google_research_Jeffrey_Dean. The schema
has two column families: info for storing information
about the authors, such as the name and home page, and affiliations
for the author's affiliation history.
One way to interact with HBase is via the REST API. Create the tables with the HTTP requests shown in Listing 4.
Listing 4. Creating the articles and authors tables
$ cat tables/Articles.xml
<?xml version="1.0" encoding="UTF-8" ?>
<table>
<name>Articles</name>
<columnfamilies>
<columnfamily>
<name>info</name>
</columnfamily>
<columnfamily>
<name>authors</name>
</columnfamily>
<columnfamily>
<name>affiliations</name>
</columnfamily>
</columnfamilies>
</table>
$ cat tables/Authors.xml
<?xml version="1.0" encoding="UTF-8" ?>
<table>
<name>Authors</name>
<columnfamilies>
<columnfamily>
<name>info</name>
</columnfamily>
<columnfamily>
<name>affiliations</name>
</columnfamily>
</columnfamilies>
</table>
$ cat tables/Articles.xml | curl -X POST -T - http://localhost:60010/api/
$ cat tables/Authors.xml | curl -X POST -T - http://localhost:60010/api/
|
Inserting data into the tables
Populate the authors and articles tables with the information assembled in the section "RDF description of a journal article.: Listing 5 shows how to insert information about the author Jeffrey Dean into the authors table (values are in Base64).
Listing 5. Populating the authors table
$ more rows/Jeffrey_Dean_info.xml
<?xml version="1.0" encoding="UTF-8" ?>
<column>
<name>info:name</name>
<value>SmVmZnJleSBEZWFuCg==</value>
</column>
$ more rows/Jeffrey_Dean_affiliation.xml
<?xml version="1.0" encoding="UTF-8" ?>
<column>
<name>affiliations:</name>
<value>R29vZ2xlCg==</value>
</column>
$ cat rows/Jeffrey_Dean_info.xml | \
curl -X POST -T - http://localhost:60010/api/Authors/row/google_research_Jeffrey_Dean
$ cat rows/Jeffrey_Dean_affiliation.xml | \
curl -X POST -T - http://localhost:60010/api/Authors/row/google_research_Jeffrey_Dean
|
Perform a POST request for each insertion and
omit the timestamps because HBase assigns default timestamps.
Listing 6 shows how you populate the articles table with
information about the Bigtable journal article.
Listing 6. Populating the articles table
$ more rows/Bigtable_info.xml
<?xml version="1.0" encoding="UTF-8" ?>
<column>
<name>info:title</name>
<value>QmlndGFibGU6IEEgRGlzdHJpYnV0ZWQgU3RvcmFnZSBTeXN0ZW0gZm9yIFN0cnVjdHVyZWQgRGF0
YQo==
</value>
</column>
$ more rows/Bigtable_author_2.xml
<?xml version="1.0" encoding="UTF-8" ?>
<column>
<name>authors:2</name>
<value>SmVmZnJleSBEZWFuCg==</value>
</column>
$ cat rows/Bigtable_info.xml | curl -X POST \
-T - http://localhost:60010/api/Articles/row/doi.org.acm_10.1145_1365815_136581
$ cat rows/Bigtable_author_2.xml | curl -X POST \
-T - http://localhost:60010/api/Articles/row/doi.org.acm_10.1145_1365815_136581
|
Once you have populated the authors and articles tables, you can perform batch
operations on the data. Here, you're looking for the affiliation of the authors of
the Bigtable paper. A batch process will build this information in the column
family affiliations of the articles table by scanning
the tables and extracting information from the same column family in the authors
table. For the Bigtable paper, the action of the batch process is equivalent to the
code shown in Listing 7, where all the authors of the article
are affiliated with Google.
Listing 7. Adding processed information to the articles table
$ more rows/Bigtable_affiliations.xml
<?xml version="1.0" encoding="UTF-8" ?>
<column>
<name>affiliations:</name>
<value>R29vZ2xlCg==</value>
</column>
$ cat rows/Bigtable_affiliations.xml | curl -X POST \
-T - http://localhost:60010/api/Articles/row/doi.org.acm_10.1145_1365815_136581
|
To get the affiliation of the authors of the Bigtable paper, perform a
GET on the affiliations:
column of the Bigtable paper row.
Listing 8. Extracting information from the articles table
$ curl -X GET http://localhost:60010/api/Articles/row/\ doi.org.acm_10.1145_1365815_136581?column=affiliations: <?xml version="1.0" encoding="UTF-8" ?> <row> <count> 1 </count> <column> <name> YWZmaWxpYXRpb25zOg== </name> <value> R29vZ2xlCg== </value> <timestamp> 1250049020108 </timestamp> </column> |
Decoding the Base64 values gives affiliations: for
YWZmaWxpYXRpb25zOg== and
Google for R29vZ2xlCg==.
You can use HBase to answer more difficult questions than that discussed in this
article. For example, you can process a "Jeopardy!" clue such as, "The
author of this journal article about Bigtable has worked for the World Health
Organization," and figure out the response. To this end, create a table called
Keywords whose row keys are the keywords of the journal articles. This
table includes a column family journal_articles used
to store the DOIs of the articles in which the keyword occurs.
After storing the keywords for the Bigtable paper, the keywords table will include a
row with the row key Bigtable, the column key
journal_articles:1, and the cell value
doi.acm.org_10.1145_1365815_1365816. To answer
the quiz:
- Look up the keyword
Bigtablein the Keywords table, and get the DOI of the article. - Look up the article in the articles table by the DOI obtained in step 1, and get the URIs of the authors.
- Look up the authors in the authors table by URI, and extract the affiliations, both present and past.
- From the result set obtained in step 3, select the row whose column family
affiliationshas a memberWorld Health Organization.
The response is, "Who is Jeffrey Dean?"
HBase and Bigtable promote a new way of thinking about the data-processing pipeline. The SQL-like process of extracting and transforming the data in a monolithic system is replaced with a divide-and-conquer approach, in which the database supports Create, Read, Update, Delete (CRUD) operations, while complex transformations are delegated to external components designed for parallel processing. For example, parallel processing could be done with MapReduce applications, and high throughput could be obtained with a distributed and replicated file system, such as the Hadoop Distributed File System (HDFS) or the Google File System.
In the absence of table joins, de-normalization is often used in HBase to keep related information in one table. This article illustrated the approach.
Although HBase still needs performance improvement, it shows real promise of becoming a mainstream solution.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample XML | os-hbase-source_hbase.zip | 11KB | HTTP |
Information about download methods
Learn
-
Check out Bigtable: A Distributed
Storage System for Structured Data (F. Chang, J. Dean, S. Ghemawat,
W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber,
ACM, 2008) to read about the simple data model that Bigtable provides.
-
"An introduction to
RDF" offers a solid overview of RDF/XML.
- IBM InfoSphere BigInsights Basic Edition -- IBM's Hadoop distribution -- is an integrated, tested and pre-configured, no-charge download for anyone who wants to experiment with and learn about Hadoop.
- Find free courses on Hadoop fundamentals, stream computing, text analytics, and more at Big Data University.
-
Check out the Apache HBase Project.
-
Find answers to your HBase questions at the HBase Wiki.
-
Visit the HBase architecture
wiki to learn more about the underlying HBase architecture.
-
Check out the HBase REST
wiki for answers to your HBase API questions.
-
"HBasics: An
Introduction to Hadoop HBase" offers a good introduction
to HBase from the HBase User Group Meeting.
-
Check out the FOAF Vocabulary Specification
0.91 to learn more about FOAF.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Stay current with developerWorks' Technical events and webcasts.
-
Follow developerWorks on Twitter.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
- Download IBM InfoSphere BigInsights Basic Edition at no charge and build a solution that turns large, complex volumes of data into insight by combining Apache Hadoop with unique technologies and capabilities from IBM.
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
- Download
IBM product evaluation versions
or explore
the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from
DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
-
Participate in developerWorks blogs and get involved in the developerWorks community.
Gabriel Mateescu builds distributed systems for managing and executing data- and compute-intensive applications, such as bioinformatics and high-energy physics simulations. He has worked on several projects, including the LHC Computing Grid, the Distributed European Infrastructure for Supercomputing Applications (DEISA), GridCanada, and NIH MIDAS. You can reach Gabriel at gabriel@vt.edu.



