Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Java theory and practice: Make database queries without the database

Borrowing a data model can simplify development

Brian Goetz (brian@quiotix.com), Principal Consultant, Quiotix
Brian Goetz has been a professional software developer for over 18 years. He is a Principal Consultant at Quiotix, a software development and consulting firm located in Los Altos, California, and he serves on several JCP Expert Groups. See Brian's published and upcoming articles in popular industry publications.

Summary:  When you have a hammer, everything looks like a nail (as the old saying goes). But what if you don't have a hammer? Well, sometimes, you can borrow a hammer. Then, hammer in hand, you can bang the virtual nails with the borrowed hammer, return it, and no one is the wiser. In this month's Java theory and practice, Brian Goetz demonstrates how data manipulation hammers such as SQL or XQuery can be applied to ad-hoc data.

View more content in this series

Date:  31 May 2005
Level:  Introductory
Also available in:   Russian  Japanese

Activity:  404 views
Comments:  

I recently advised on a project that involved a fair bit of Web crawling. As the program crawled the various Web sites, it would build a database of the sites and pages crawled, the links each page contained, the results of analysis on each pages, and so on. The end result was to be a set of reports detailing which sites and pages were crawled, which were followed, which links were broken, which pages had errors, page metrics that were calculated, and so on. At the outset, no one knew exactly what reports were going to be needed or what format they should take -- just that there would need to be reports. This suggested that the report development phase was going to be iterative, with several rounds of feedback, tweaking, and possibly starting over with a different organization. The only specified reporting requirement was that the reports should be presented as XML, or possibly HTML. As a result, the process of developing and modifying the reports would have to be lightweight, because the reporting requirements were going to be "discovered dynamically," rather than specified up-front.

No database needed

The "obvious" approach to the problem was to put everything in a SQL database -- pages, links, metrics, HTTP result codes, timing results, and other metadata. The problem lent itself well to a relational representation, especially as it was not necessary to store the contents of the visited pages, just their structure and metadata.

So far, this project looked like a typical database application, and there was no shortage of candidate persistence strategies. But maybe it was possible to avoid the complexity of using a database for persistence -- the crawler was only going to visit a few tens of thousand of pages. This number was few enough that it was possible to keep the whole database in memory, and persistence, if needed, could be achieved through serialization. (Yes, the load and save operations would take a long time, but they did not have to be performed frequently.) Laziness had yielded a dividend -- not having to deal with persistence can greatly reduce the amount of time to develop an application, resulting in a significant savings in terms of development effort. Building and manipulating an in-memory data structure is just a lot easier than having to go to the database for every operation that adds, fetches, or analyzes data. No matter which persistence model you choose, it always constrains how you structure any code that touches the data.

The in-memory data structure had a sort of tree structure, as shown in Listing 1, rooted at the home pages of the various sites being crawled, so the Visitor pattern was ideal for searching it or extracting data from it. (It was not too hard to build a base Visitor class that prevented getting stuck in cycles of links -- A links to B, B links to C, C links back to A.)


Listing 1. Simplified schema for Web crawler

public class Site {
    Page homepage;
    Collection<Page> pages;
    Collection<Link> links;
}

public class Page {
    String url;
    Site site;
    PageMetrics metrics;
}

public class Link {
    Page linkFrom;
    Page linkTo;
    String anchorText;
}

The crawler application had a dozen or so Visitors, which did things like selecting pages for further analysis, selecting pages that had unfollowed links, selecting pages with errors, tabulating the "most linked" pages, and so on. Because all of these operations were so simple, the Visitor pattern, shown in Listing 2, worked quite well, and because the data structure could fit into memory, even exhaustive searches were cheap enough:


Listing 2. Visitor pattern for Web crawler database

public interface Visitor {
    public void visitSite(Site site);
    public void visitLink(Link link);
}

Oops, forgot about reporting

The Visitor strategy worked just great for accessing the data, until it was time to run the reports. One of the benefits of using a database for persistence is that the power of SQL really shines when it comes to generating reports -- you can pretty much let the database do all the work. Even better, prototyping reports with SQL is easy -- you run your prototype report, and if the results aren't what you need, you tweak the SQL query or write a new query, and try again. The edit-compile-run cycle, when all you are changing is a SQL query, can be pretty quick. If the SQL isn't stored in the program, you can even skip the compile part of the cycle, which gives the ability to rapidly prototype your reports. Once you've found the reports you want, then it is easy enough to build them into your application.

So, the in-memory data structure, which was great for adding new results, finding a specific result, and all sorts of ad-hoc traversals, had become a liability when it came to reporting. For any report whose structure was not similar to that of the database, a Visitor would have to create a whole new data structure to hold the report data. Therefore, each type of report would require its own report-specific intermediate data structure for holding the results, a visitor for populating the intermediate data structure, and then post-processing code for turning the intermediate data structure into the final report. This approach seemed like a lot of work, especially when most of the prototyped reports were going to be thrown away. For example, let's say you wanted a report that listed all the pages from other sites linked to from a given site, and for each of those external pages, a list of pages on the site that link to it, and then sort the report by number of links, so the most-linked pages appear first. This plan basically turns the data structure inside-out. To accomplish this sort of data transformation with a Visitor, you need to take the list of external page links reachable from a given site and sort them by the linked-to page, as shown in Listing 3:


Listing 3. Visitor tabulates most-linked pages, along with pages that link to them

public class InvertLinksVisitor {
    public Map<Page, Set<Page>> map = ...;
    
    public void visitLink(Link link) {
        if (link.linkFrom.site.equals(targetSite) 
            && !link.linkTo.site.equals(targetSite)) {
            if (!map.containsKey(link.linkTo))
                map.put(link.linkTo, new HashSet<Page>());
            map.get(link.linkTo).add(link.linkFrom);
        }
    }
}

The Visitor in Listing 3 produces a map that associates each external page with a list of internal pages that link to it. To prepare the report, you would then have to sort the entries by the size of the associated page set, and then create the report. While none of these steps is all that difficult, the amount of report-specific code for each report is significant, and given that rapid report prototyping was an important goal (because there were no stated reporting requirements), the overhead of trying out a new report was higher than ideal. Many reports required multiple passes on the data to select, aggregate, and sort the data.


My kingdom for a data model

At this point, the lack of a formal data model, which describes the data gathered and against which you could more easily express selection and aggregation queries, was starting to look like a liability. Perhaps laziness was not as efficient as first hoped. But, while this application had no formal data model, maybe we could "borrow one" by streaming the data into an in-memory database and querying against that. Two possibilities immediately sprang to mind; HSQLDB, the open-source in-memory SQL database, and XQuery. I didn't need the persistence that a database offered, but I did want the query language.

HSQLDB is an embeddable database engine written in the Java™ language. It has table types for both in-memory and disk-based tables, and was designed for being embedded entirely within an application, eliminating the administrative overhead associated with most real databases. To load the data into HSQLDB, it would only be necessary to write a Visitor that traversed the in-memory data structure and generated the appropriate INSERT statements for each entity to be stored. Then you could execute SQL queries against the in-memory database tables to do the reports and throw away the "database" when finished.

Oops, forgot how annoying relational databases are

The HSQLDB approach was workable, but it quickly became clear that I would have to pay the object-relational impedance mismatch penalty not once, but twice -- once when converting the tree-structured database into a relational data model, and again when trying to turn the results of flat relational queries into structured XML or HTML result sets. And post-processing the JDBC ResultSet into a DOM representation of an XML or HTML document was still nontrivial and would require some customized coding for each report. So while an in-memory SQL database did enable simplifying the queries, the extra coding required to get the data in and out of the database would have eaten up all the savings.


XQuery to the rescue

The other data query alternative that was easily available was XQuery. XQuery has the advantage that it is designed for producing XML or HTML documents as the result of its queries, so no postprocessing would be required on query results. This idea was attractive -- only one layer of coding for each report, rather than two or more. So the first task was to build an XML document that represented the whole data set. It was straightforward to design a simple XML data model and to write a Visitor that traversed the data structure and appended each element to a DOM Document. (There was never a need to write this document out. It could be kept in memory for querying and then discarded when finished. When the underlying data changed, it could simply be regenerated.) Then all that was needed was to write XQuery queries, which would both select and aggregate the data for the report, and format them into the final desired format (XML or HTML). The queries could be stored in separate files for rapid prototyping and so that multiple report formats could be supported. The code to evaluate the query using Saxon is shown in Listing 4:


Listing 4. Code to execute XQuery query and serialize results to an XML or HTML document

  String query = readFile(queryFile + ".xq");
  Configuration c = new Configuration();
  StaticQueryContext qp = new StaticQueryContext(c);
  XQueryExpression xe = qp.compileQuery(query);
  DynamicQueryContext dqc = new DynamicQueryContext(c);
  dqc.setContextNode(new DocumentWrapper(document, z.getName(), c));
  List result = xe.evaluate(dqc);

  FileOutputStream os = new FileOutputStream(fileName);
  XMLSerializer serializer = new XMLSerializer (os, format);
  serializer.asDOMSerializer();

  for(Iterator i = result.iterator(); i.hasNext(); ) {
      Object o = i.next();
      if (o instanceof Element)
          serializer.serialize((Element) o);
      else if (o instanceof Attr) {
          Element e = document.createElement("scalar");
          e.setTextContent(((Attr) o).getNodeValue());
          serializer.serialize(e);
      }
      else {
          Element e = document.createElement("scalar");
          e.setTextContent(o.toString());
          serializer.serialize(e);
      }
  }
  os.close(); 

The structure of the XML document representing the database was a little different from the in-memory data structure; each <site> element had nested <page> elements, each <page> element had nested <link> elements, and each <link> element had <link-to> and <link-from> elements. This representation turned out to be convenient enough for most reports.

Listing 5 shows a sample XQuery report that handles selection of links, sorting, and presentation. It has several advantages over the Visitor approach -- not only is it less code (because the query language supports selection, aggregation, and sorting), but all the code for the report -- selection, aggregation, sorting, and presentation -- is in one place.


Listing 5. XQuery code that produces the entire most-linked-pages report

<html>
<head><title>Most-linked pages</title></head>
<body>
<ul>
{
  let $links := //link[link-to/@siteUrl ne $targetSite
                       and link-from/@siteUrl eq $targetSite]
  for $page in distinct-values($links/link-to/@url)
  let $linkingPages := $links[link-to/@url eq $page]/link-from/@url
  order by count($linkingPages)
  return 
    <li>Page {$page}, {count($linkingPages)} links 
    <ul> {
      for $p in $linkingPages return <li>Linked from {$p/@url}</li>
    }
    </ul></li>
}
</ul> </body> </html>


Summary

From a development costs perspective, the XQuery approach turned out to be a significant savings. The tree-structure was ideal for building and searching the data, but not for reporting; the XML approach was well-suited for reporting (because we could leverage the power of XQuery), but would have been both more inconvenient and worse performing for implementing the entire application. Because the data set was a manageable size -- no more than a few tens of megabytes -- it was possible to convert the data from one form to another as was most convenient from a development perspective. A larger data set, one that could not be stored entirely in memory easily, would have required the entire application to be built around a database. While many good tools for dealing with persistent data are available, they all involve a lot more work than simply manipulating an in-memory data structure. If your data set is of a suitable size, you can have the best of both worlds.


Resources

  • Brian first got bitten by the XQuery bug a few months ago, as chronicled in "Screen-scraping with XQuery" (developerWorks, March 2005) . Read the complete Java theory and practice series by Brian Goetz, one of developerWorks' longest-running and most popular columns.

  • Howard Katz's "An introduction to XQuery" (developerWorks, June 2001) covers the basics and history of the XQuery standardization effort.

  • The tutorial, "Process XML using XML Query" (developerWorks, September 2002), by Nicholas Chase dives deeper into the uses and syntax of XQuery.

  • Check out the Saxon XQuery and XSLT processor.

  • You can try out the free community edition of the Mark Logic server, a content database that lets you search large document bases with XQuery.

  • To learn more about Java technology, visit the developerWorks Java zone. You'll find technical documentation, how-to articles, education, downloads, product information, and more.

  • Visit the New to Java technology site for the latest resources to help you get started with Java programming.

  • Get involved in the developerWorks community by participating in developerWorks blogs.

  • Browse for books on these and other technical topics.

About the author

Brian Goetz has been a professional software developer for over 18 years. He is a Principal Consultant at Quiotix, a software development and consulting firm located in Los Altos, California, and he serves on several JCP Expert Groups. See Brian's published and upcoming articles in popular industry publications.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, XML
ArticleID=84030
ArticleTitle=Java theory and practice: Make database queries without the database
publish-date=05312005

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers