Finding the way through the semantic Web with HBase

Use HBase and Bigtable to create and mine a semantic Web

The Hadoop Database (HBase) is well suited for creating a semantic Web and for extracting existing or computed knowledge. Learn how to represent RDF/XML assertions in an HBase database for scientific articles, and discover how HBase and Bigtable are promoting a new approach to storing and processing data.

Gabriel Mateescu, Senior Developer, Virginia Bioinformatics Institute at Virginia Tech

Gabriel Mateescu builds distributed systems for managing and executing data- and compute-intensive applications, such as bioinformatics and high-energy physics simulations. He has worked on several projects, including the LHC Computing Grid, the Distributed European Infrastructure for Supercomputing Applications (DEISA), GridCanada, and NIH MIDAS. You can reach Gabriel at gabriel@vt.edu.



15 September 2009

Also available in Russian Japanese

HBase is a scalable, distributed, column-oriented dynamic-schema database for structured data. It manages large-scale data (petabytes and beyond) distributed across thousands of commodity servers reliably and efficiently. Modeled after Google's Bigtable database, HBase is a subproject of the Apache Software Foundation's Hadoop project.

Frequently used acronyms

  • API: Application programming interface
  • DOI: Digital Object Identifier
  • HTTP: Hypertext Transfer Protocol
  • REST: Representational State Transfer
  • SQL: Structured Query Language
  • URI: Uniform Resource Identifier
  • XML: Extensible Markup Language

Note: At the time of this writing, the latest release of HBase was V0.19.3. The information in this article applies to that release.

The HBase data model

HBase data is modeled as a multidimensional map in which values (the table cells) are indexed by four keys:

value = Map(TableName, RowKey, ColumnKey, Timestamp)

where:

  • TableName is a string
  • RowKey and ColumnKey are binary values (Java type byte[])
  • Timestamp is a 64-bit integer (Java type long)
  • value is an uninterpreted array of bytes (Java™ type byte[])

Binary data is encoded in Base64 for transmission over the wire.

The row key is the primary key of the table and is typically a string. Rows are sorted by row key in lexicographic order.

Information stored in a table is structured into column families, which you can think of as categories. Each column family can have an arbitrary number of members identified by labels (or qualifiers). The column key is the concatenation of the family name, the : symbol, and the label. For example, for family info and a member date, the column key is info:date.

An HBase table schema defines the column families, but applications can create new members on the fly when you insert a row into the table. For a column family, different rows in the table can have a different number of members. In other words, HBase supports a dynamic schema model.

An HBase table example

Table 1 shows a simple example of an HBase table called Persons with two column families: name and contact.

Table 1. Persons table with two column families
Row keyTimestampColumn family
namecontact
000001t3contact:http research.google.com/people/jeff/
t2name:first Jeffrey
t1name:last Dean
000002t5name:first Gabriel
t4name:last Mateescu

An empty cell has no value associated with the cell's key. In Table 1, the cell associated with the key (000002, contact:http, t4) is empty. Empty cells are not stored in HBase; reading an empty cell is similar to extracting from a map a value by a nonexistent key. HBase tables are thus suited for sparse rows.

For any row, you can access only one member of one column family at a time (unlike a relational database, where one query can access cells from multiple columns in a row). You can view the members of a column family in a row as subrows.

Tables are decomposed in table regions, equivalent to the Bigtable tablets. A region contains the rows in a certain range. Decomposing a table into regions is a key mechanism for efficiently handling large tables.


RDF and the semantic Web

Consider the problem of representing information about scientific articles. The articles and their authors are resources. In the Resource Description Framework (RDF), knowledge about resources is represented by assertions (see Resources), where an assertion is a triple:

(subject, predicate, object).

The predicate defines a relation between the subject (the resource the assertion is referring to) and the object. For example, you could represent the statement, "The article has the title Bigtable," as:

(The article, has title, Bigtable).

The subject of an assertion is a resource that must be identified by a URI. The predicate must be defined in a vocabulary, so it is associated with the namespace URI of the vocabulary. The object of an assertion can be identified by a URI or by a literal; if it is the subject of another assertion, it must be identified by a URI.

You express knowledge about an article as assertions and represent the assertions in RDF/XML.

RDF description of a journal article

Consider the journal article about Bigtable:

F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, "Bigtable: A Distributed Storage System for Structured Data", ACM Trans. Comput. Syst. 26 (2), June 2008.

You can describe this article through a set of statements, such as:

  • The Bigtable journal article has the title "Bigtable: A Distributed Storage System for Structured Data."
  • The Bigtable journal article is written by Fay Chang.

where:

  • The Bigtable journal article is the subject in both statements.
  • has the title is the predicate in the first statement.
  • "Bigtable: A Distributed Storage System for Structured Data" is the object in the first statement.
  • is written by is the predicate in the second statement.
  • Fay Chang is the object in the second statement.

To represent these statements in RDF/XML, you must determine the URI of the subject and the names of the predicates in an appropriate namespace. For the article URI, use the Digital Object Identifier (DOI) URI of the Bigtable paper, http://doi.acm.org/10.1145/1365815.1365816, and reformulate the first statement as follows:

The article with the URI "http://doi.acm.org/10.1145/1365815.1365816" has the title "Bigtable: A Distributed Storage System for Structured Data."

For the predicates, use terms from the vocabularies in Table 2.

Table 2. Namespace URIs and prefixes
PrefixNamespace URIDescription
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#RDF vocabulary terms
dchttp://purl.org/dc/elements/1.1/Dublin Core elements
dctermshttp://purl.org/dc/terms/Dublin Core terms
eprinthttp://purl.org/eprint/terms/Eprints terms
foafhttp://xmlns.com/foaf/0.1/FOAF vocabulary terms

Based on these vocabularies, you can formulate a statement about the Bigtable journal article in RDF/XML as shown in Listing 1.

Listing 1. Article description in RDF/XML
<rdf:RDF 

   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
   xmlns:dc="http://purl.org/dc/elements/1.1/" > 

   <rdf:Description rdf:about="http://doi.acm.org/10.1145/1365815.1365816"> 
       <dc:title>Bigtable: A Distributed Storage System for Structured Data</title> 
       <dc:creator 
            rdf:resource="http://purl.org/sweb/Authors/google/research/Fay_Chang"/> 

  </rdf:Description> 

</rdf:RDF>

Here, the object for the <dc:title> predicate is a literal, while the object for <dc:creator> is a URI.

The complete description of the article, with information about all the authors, the publication date, and the publisher's name is shown in Listing 2, where the <dc:type> predicate defines the article type.

Listing 2. Full RDF/XML article description
$ cat rdf/Bigtable.xml 
<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF 

  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
  xmlns:dc="http://purl.org/dc/elements/1.1/" 
  xmlns:dcterms="http://purl.org/dc/terms/" > 

  <rdf:Description rdf:about="http://doi.acm.org/10.1145/1365815.1365816"> 
    <dc:title>Bigtable: A Distributed Storage System for Structured Data</title>
    <dc:type>http://purl.org/eprint/type/JournalArticle</dc:type> 
    <dc:creator 
      rdf:resource="http://purl.org/sweb/Authors/google/research/Fay_Chang"/> 
    <dc:creator 
      rdf:resource="http://purl.org/sweb/Authors/google/research/Jeffrey_Dean"/> 
    <dc:creator 
      rdf:resource="http://purl.org/sweb/Authors/google/research/Sanjay_Ghemawat"/> 
    <dc:creator 
      rdf:resource="http://purl.org/sweb/Authors/google/research/Wilson_Hsieh"/> 
    <dc:creator 
      rdf:resource="http://purl.org/sweb/Authors/google/research/Deborah_Wallach"/> 
    <dc:creator 
      rdf:resource="http://purl.org/sweb/Authors/google/research/Mike_Burrows"/> 
    <dc:creator 
      rdf:resource="http://purl.org/sweb/Authors/google/research/Tushar_Chandra"/> 
    <dc:creator 
      rdf:resource="http://purl.org/sweb/Authors/google/research/Andrew_Fikes"/> 
    <dc:creator 
      rdf:resource="http://purl.org/sweb/Authors/google/research/Robert_Gruber"/> 
    <dc:publisher>ACM, New York, NY, USA</dc:publisher> 
    <dcterms:issued>2008-06</dcterms:issued> 
    <dc:subject>distributed databases</dc:subject> 
    <dcterms:isPartOf rdf:resource="urn:ISSN:0734-2071" /> 
    <dcterms:bibliographicCitation>
ACM Trans. Comput. Syst., 26 (2) 26 pages (2008)
    </dcterms:bibliographicCitation> 

  </rdf:Description> 

</rdf:RDF>

You use the Friend of a Friend (FOAF) vocabulary for describing the authors. Listing 3 shows assertions about the subject Jeffrey Dean.

Listing 3. RDF/XML description of an author
$ cat rdf/Jeffrey.xml
<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF 

  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
  xmlns:dc="http://purl.org/dc/elements/1.1/" 
  xmlns:foaf="http://xmlns.com/foaf/0.1/" 
  xmlns:eprint="http://purl.org/eprint/terms/" > 

  <rdf:Description 
       rdf:about="http://purl.org/sweb/Authors/google/research/Jeffrey_Dean"> 
    <foaf:Person> 
       <foaf:givenname>Jeffrey</foaf:givenname> 
       <foaf:family_name>Dean</foaf:family_name> 
       <foaf:homepage rdf:resource="http://research.google.com/people/jeff/" /> 
    </foaf:Person> 
    <eprint:affiliatedInstitution>Google, Inc.</eprint:affiliatedInstitution> 

  </rdf:Description> 

</rdf:RDF>

Modeling a semantic Web with HBase

The first step in modeling a semantic Web with HBase is mapping RDF to HBase tables. To store the RDF/XML descriptions of the articles and authors, you create two tables called articles and authors. Design these tables keeping in mind that you want to support queries about the affiliation of the authors.

The row keys of the Articles table are derived from the DOI of the article. For example, the row key of the Bigtable paper is doi.org.acm_10.1145_1365815_1365816. The schema has three column families:

  • info for information such as the title, publication name, and date of publication
  • authors for the URIs of the authors
  • affiliations for the authors' affiliations

The row keys of the authors table are derived from the URIs of the authors. For example, the URI of Jeffrey Dean (see Listing 3) is converted to the key google_research_Jeffrey_Dean. The schema has two column families: info for storing information about the authors, such as the name and home page, and affiliations for the author's affiliation history.

Creating the HBase tables

One way to interact with HBase is via the REST API. Create the tables with the HTTP requests shown in Listing 4.

Listing 4. Creating the articles and authors tables
$ cat tables/Articles.xml
<?xml version="1.0" encoding="UTF-8" ?>
<table>
  <name>Articles</name>
  <columnfamilies>
    <columnfamily>
       <name>info</name>
    </columnfamily>
    <columnfamily>
        <name>authors</name>
    </columnfamily>
    <columnfamily>
      <name>affiliations</name>
    </columnfamily>
  </columnfamilies>
</table>

$ cat tables/Authors.xml
<?xml version="1.0" encoding="UTF-8" ?>
<table>
  <name>Authors</name>
  <columnfamilies>
    <columnfamily>
       <name>info</name>
    </columnfamily>
    <columnfamily>
      <name>affiliations</name>
    </columnfamily>
  </columnfamilies>
</table>

$ cat tables/Articles.xml | curl  -X POST -T -  http://localhost:60010/api/
$ cat tables/Authors.xml | curl  -X POST -T -  http://localhost:60010/api/

Inserting data into the tables

Populate the authors and articles tables with the information assembled in the section "RDF description of a journal article.: Listing 5 shows how to insert information about the author Jeffrey Dean into the authors table (values are in Base64).

Listing 5. Populating the authors table
$ more rows/Jeffrey_Dean_info.xml 
<?xml version="1.0" encoding="UTF-8" ?>

<column>  
   <name>info:name</name>
   <value>SmVmZnJleSBEZWFuCg==</value>
</column>

$ more rows/Jeffrey_Dean_affiliation.xml 
<?xml version="1.0" encoding="UTF-8" ?>
<column>
   <name>affiliations:</name>
   <value>R29vZ2xlCg==</value>
</column>

$ cat rows/Jeffrey_Dean_info.xml |  \
    curl -X POST  -T - http://localhost:60010/api/Authors/row/google_research_Jeffrey_Dean

$ cat rows/Jeffrey_Dean_affiliation.xml |  \
    curl -X POST -T - http://localhost:60010/api/Authors/row/google_research_Jeffrey_Dean

Perform a POST request for each insertion and omit the timestamps because HBase assigns default timestamps. Listing 6 shows how you populate the articles table with information about the Bigtable journal article.

Listing 6. Populating the articles table
$ more rows/Bigtable_info.xml 
<?xml version="1.0" encoding="UTF-8" ?>
<column>
  <name>info:title</name>
  <value>QmlndGFibGU6IEEgRGlzdHJpYnV0ZWQgU3RvcmFnZSBTeXN0ZW0gZm9yIFN0cnVjdHVyZWQgRGF0
YQo==
  </value>
</column>

$ more rows/Bigtable_author_2.xml
<?xml version="1.0" encoding="UTF-8" ?>
<column>
  <name>authors:2</name>
  <value>SmVmZnJleSBEZWFuCg==</value>
</column>

$ cat rows/Bigtable_info.xml  |  curl -X POST \
      -T - http://localhost:60010/api/Articles/row/doi.org.acm_10.1145_1365815_136581
$ cat rows/Bigtable_author_2.xml  | curl -X POST \
      -T - http://localhost:60010/api/Articles/row/doi.org.acm_10.1145_1365815_136581

Mining the tables

Download the code

The source code for this article is available from Download.

Once you have populated the authors and articles tables, you can perform batch operations on the data. Here, you're looking for the affiliation of the authors of the Bigtable paper. A batch process will build this information in the column family affiliations of the articles table by scanning the tables and extracting information from the same column family in the authors table. For the Bigtable paper, the action of the batch process is equivalent to the code shown in Listing 7, where all the authors of the article are affiliated with Google.

Listing 7. Adding processed information to the articles table
$ more rows/Bigtable_affiliations.xml 
<?xml version="1.0" encoding="UTF-8" ?>
<column>
  <name>affiliations:</name>
  <value>R29vZ2xlCg==</value>
</column>

$ cat rows/Bigtable_affiliations.xml  |  curl -X POST \
      -T - http://localhost:60010/api/Articles/row/doi.org.acm_10.1145_1365815_136581

To get the affiliation of the authors of the Bigtable paper, perform a GET on the affiliations: column of the Bigtable paper row.

Listing 8. Extracting information from the articles table
$ curl -X GET  http://localhost:60010/api/Articles/row/\
doi.org.acm_10.1145_1365815_136581?column=affiliations:
<?xml version="1.0" encoding="UTF-8" ?>
<row>
  <count>
1
  </count>
  <column>
  <name>
YWZmaWxpYXRpb25zOg==
  </name>
  <value>
R29vZ2xlCg==
  </value>
  <timestamp>
1250049020108
  </timestamp>
 </column>

Decoding the Base64 values gives affiliations: for YWZmaWxpYXRpb25zOg== and Google for R29vZ2xlCg==.


A simple example

You can use HBase to answer more difficult questions than that discussed in this article. For example, you can process a "Jeopardy!" clue such as, "The author of this journal article about Bigtable has worked for the World Health Organization," and figure out the response. To this end, create a table called Keywords whose row keys are the keywords of the journal articles. This table includes a column family journal_articles used to store the DOIs of the articles in which the keyword occurs.

After storing the keywords for the Bigtable paper, the keywords table will include a row with the row key Bigtable, the column key journal_articles:1, and the cell value doi.acm.org_10.1145_1365815_1365816. To answer the quiz:

  1. Look up the keyword Bigtable in the Keywords table, and get the DOI of the article.
  2. Look up the article in the articles table by the DOI obtained in step 1, and get the URIs of the authors.
  3. Look up the authors in the authors table by URI, and extract the affiliations, both present and past.
  4. From the result set obtained in step 3, select the row whose column family affiliations has a member World Health Organization.

The response is, "Who is Jeffrey Dean?"


Conclusion

HBase and Bigtable promote a new way of thinking about the data-processing pipeline. The SQL-like process of extracting and transforming the data in a monolithic system is replaced with a divide-and-conquer approach, in which the database supports Create, Read, Update, Delete (CRUD) operations, while complex transformations are delegated to external components designed for parallel processing. For example, parallel processing could be done with MapReduce applications, and high throughput could be obtained with a distributed and replicated file system, such as the Hadoop Distributed File System (HDFS) or the Google File System.

In the absence of table joins, de-normalization is often used in HBase to keep related information in one table. This article illustrated the approach.

Although HBase still needs performance improvement, it shows real promise of becoming a mainstream solution.


Download

DescriptionNameSize
Sample XMLos-hbase-source_hbase.zip11KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, Big data and analytics
ArticleID=427494
ArticleTitle=Finding the way through the semantic Web with HBase
publish-date=09152009