 | Level: Introductory Brian Goetz (brian@quiotix.com), Principal Consultant, Quiotix
31 May 2005 When you have a hammer, everything looks like a nail (as the old saying goes). But what if you don't have a hammer? Well, sometimes, you can borrow a hammer. Then, hammer in hand, you can bang the virtual nails with the borrowed hammer, return it, and no one is the wiser. In this month's Java theory and practice, Brian Goetz demonstrates how data manipulation hammers such as SQL or XQuery can be applied to ad-hoc data.
I recently advised on a project that involved a fair bit of Web
crawling. As the program crawled the various Web sites, it would build
a database of the sites and pages crawled, the links each page
contained, the results of analysis on each pages, and so on. The end
result was to be a set of reports detailing which sites and pages were
crawled, which were followed, which links were broken, which pages had
errors, page metrics that were calculated, and so on. At the outset,
no one knew exactly what reports were going to be needed or what
format they should take -- just that there would need to be
reports. This suggested that the report development phase was going to
be iterative, with several rounds of feedback, tweaking, and possibly
starting over with a different organization. The only specified
reporting requirement was that the reports should be presented as XML,
or possibly HTML. As a result, the process of developing and modifying
the reports would have to be lightweight, because the reporting
requirements were going to be "discovered dynamically," rather than
specified up-front.
No database needed
The "obvious" approach to the problem was to put everything in a SQL
database -- pages, links, metrics, HTTP result codes, timing results,
and other metadata. The problem lent itself well to a relational
representation, especially as it was not necessary to store the
contents of the visited pages, just their structure and metadata.
So far, this project looked like a typical database application, and
there was no shortage of candidate persistence strategies. But maybe
it was possible to avoid the complexity of using a database for
persistence -- the crawler was only going to visit a few tens of
thousand of pages. This number was few enough that it was possible to
keep the whole database in memory, and persistence, if needed, could
be achieved through serialization. (Yes, the load and save operations
would take a long time, but they did not have to be performed
frequently.) Laziness had yielded a dividend -- not having to deal
with persistence can greatly reduce the amount of time to develop an
application, resulting in a significant savings in terms of
development effort. Building and manipulating an in-memory data
structure is just a lot easier than having to go to the database for
every operation that adds, fetches, or analyzes data. No matter which
persistence model you choose, it always constrains how you structure
any code that touches the data.
The in-memory data structure had a sort of tree structure, as shown in
Listing 1, rooted at the home pages of the various sites being
crawled, so the Visitor pattern was ideal for searching it or extracting
data from it. (It was not too hard to build a base Visitor class that
prevented getting stuck in cycles of links -- A links to B, B links to C,
C links back to A.)
Listing 1. Simplified schema for Web crawler
public class Site {
Page homepage;
Collection<Page> pages;
Collection<Link> links;
}
public class Page {
String url;
Site site;
PageMetrics metrics;
}
public class Link {
Page linkFrom;
Page linkTo;
String anchorText;
}
|
The crawler application had a dozen or so Visitors, which did things like
selecting pages for further analysis, selecting pages that had
unfollowed links, selecting pages with errors, tabulating the "most
linked" pages, and so on. Because all of these operations were so
simple, the Visitor pattern, shown in Listing 2, worked quite well, and because the data
structure could fit into memory, even exhaustive searches were cheap
enough:
Listing 2. Visitor pattern for Web crawler database
public interface Visitor {
public void visitSite(Site site);
public void visitLink(Link link);
}
|
Oops, forgot about reporting
The Visitor strategy worked just great for accessing the data, until
it was time to run the reports. One of the benefits of using a
database for persistence is that the power of SQL really shines when
it comes to generating reports -- you can pretty much let the database
do all the work. Even better, prototyping reports with SQL is easy --
you run your prototype report, and if the results aren't what you
need, you tweak the SQL query or write a new query, and try again. The
edit-compile-run cycle, when all you are changing is a SQL query, can
be pretty quick. If the SQL isn't stored in the program, you can even
skip the compile part of the cycle, which gives the ability to rapidly
prototype your reports. Once you've found the reports you want, then
it is easy enough to build them into your application.
So, the in-memory data structure, which was great for adding new
results, finding a specific result, and all sorts of ad-hoc traversals,
had become a liability when it came to reporting. For any report whose
structure was not similar to that of the database, a Visitor would
have to create a whole new data structure to hold the report
data. Therefore, each type of report would require its own
report-specific intermediate data structure for holding the results, a
visitor for populating the intermediate data structure, and then
post-processing code for turning the intermediate data structure into
the final report. This approach seemed like a lot of work, especially
when most of the prototyped reports were going to be thrown away. For
example, let's say you wanted a report that listed all the pages from
other sites linked to from a given site, and for each of those
external pages, a list of pages on the site that link to it, and then
sort the report by number of links, so the most-linked pages appear
first. This plan basically turns the data structure inside-out. To
accomplish this sort of data transformation with a Visitor, you need
to take the list of external page links reachable from a given site
and sort them by the linked-to page, as shown in Listing 3:
Listing 3. Visitor tabulates most-linked pages, along with pages that link to them
public class InvertLinksVisitor {
public Map<Page, Set<Page>> map = ...;
public void visitLink(Link link) {
if (link.linkFrom.site.equals(targetSite)
&& !link.linkTo.site.equals(targetSite)) {
if (!map.containsKey(link.linkTo))
map.put(link.linkTo, new HashSet<Page>());
map.get(link.linkTo).add(link.linkFrom);
}
}
}
|
The Visitor in Listing 3 produces a map that associates each external
page with a list of internal pages that link to it. To prepare the
report, you would then have to sort the entries by the size of the
associated page set, and then create the report. While none of these
steps is all that difficult, the amount of report-specific code for
each report is significant, and given that rapid report prototyping
was an important goal (because there were no stated reporting
requirements), the overhead of trying out a new report was higher than
ideal. Many reports required multiple passes on the data to select,
aggregate, and sort the data.
My kingdom for a data model
At this point, the lack of a formal data model, which describes the
data gathered and against which you could more easily express selection
and aggregation queries, was starting to look like a
liability. Perhaps laziness was not as efficient as first hoped. But,
while this application had no formal data model, maybe we could "borrow
one" by streaming the data into an in-memory database and querying
against that. Two possibilities immediately sprang to mind; HSQLDB,
the open-source in-memory SQL database, and XQuery. I didn't need the
persistence that a database offered, but I did want the query
language.
HSQLDB is an embeddable database engine written in the Java™
language. It has table types for both in-memory and disk-based tables,
and was designed for being embedded entirely within an application,
eliminating the administrative overhead associated with most real
databases. To load the data into HSQLDB, it would only be necessary to
write a Visitor that traversed the in-memory data structure and
generated the appropriate INSERT statements for each entity to be
stored. Then you could execute SQL queries against the in-memory
database tables to do the reports and throw away the "database" when
finished.
Oops, forgot how annoying relational databases are
The HSQLDB approach was workable, but it quickly became clear that I
would have to pay the object-relational impedance mismatch penalty not
once, but twice -- once when converting the tree-structured database
into a relational data model, and again when trying to turn the
results of flat relational queries into structured XML or HTML result
sets. And post-processing the JDBC ResultSet into a DOM representation
of an XML or HTML document was still nontrivial and would require some
customized coding for each report. So while an in-memory SQL database
did enable simplifying the queries, the extra coding required
to get the data in and out of the database would have eaten up all the
savings.
XQuery to the rescue
The other data query alternative that was easily available was
XQuery. XQuery has the advantage that it is designed for producing XML
or HTML documents as the result of its queries, so no postprocessing
would be required on query results. This idea was attractive -- only
one layer of coding for each report, rather than two or more. So the
first task was to build an XML document that represented the whole
data set. It was straightforward to design a simple XML data model and
to write a Visitor that traversed the data structure and appended each
element to a DOM Document. (There was never a need to write this
document out. It could be kept in memory for querying and then
discarded when finished. When the underlying data changed, it could
simply be regenerated.) Then all that was needed was to write XQuery
queries, which would both select and aggregate the data for the
report, and format them into the final desired format (XML or
HTML). The queries could be stored in separate files for rapid
prototyping and so that multiple report formats could be
supported. The code to evaluate the query using Saxon is shown in
Listing 4:
Listing 4. Code to execute XQuery query and serialize results to an XML or HTML document
String query = readFile(queryFile + ".xq");
Configuration c = new Configuration();
StaticQueryContext qp = new StaticQueryContext(c);
XQueryExpression xe = qp.compileQuery(query);
DynamicQueryContext dqc = new DynamicQueryContext(c);
dqc.setContextNode(new DocumentWrapper(document, z.getName(), c));
List result = xe.evaluate(dqc);
FileOutputStream os = new FileOutputStream(fileName);
XMLSerializer serializer = new XMLSerializer (os, format);
serializer.asDOMSerializer();
for(Iterator i = result.iterator(); i.hasNext(); ) {
Object o = i.next();
if (o instanceof Element)
serializer.serialize((Element) o);
else if (o instanceof Attr) {
Element e = document.createElement("scalar");
e.setTextContent(((Attr) o).getNodeValue());
serializer.serialize(e);
}
else {
Element e = document.createElement("scalar");
e.setTextContent(o.toString());
serializer.serialize(e);
}
}
os.close();
|
The structure of the XML document representing the database was a
little different from the in-memory data structure; each <site>
element had nested <page> elements, each <page> element
had nested <link> elements, and each <link> element had
<link-to> and <link-from> elements. This representation
turned out to be convenient enough for most reports.
Listing 5 shows a sample XQuery report that handles selection of
links, sorting, and presentation. It has several advantages over the
Visitor approach -- not only is it less code (because the query
language supports selection, aggregation, and sorting), but all the
code for the report -- selection, aggregation, sorting, and
presentation -- is in one place.
Listing 5. XQuery code that produces the entire most-linked-pages report
<html>
<head><title>Most-linked pages</title></head>
<body>
<ul>
{
let $links := //link[link-to/@siteUrl ne $targetSite
and link-from/@siteUrl eq $targetSite]
for $page in distinct-values($links/link-to/@url)
let $linkingPages := $links[link-to/@url eq $page]/link-from/@url
order by count($linkingPages)
return
<li>Page {$page}, {count($linkingPages)} links
<ul> {
for $p in $linkingPages return <li>Linked from {$p/@url}</li>
}
</ul></li>
}
</ul> </body> </html>
|
Summary
From a development costs perspective, the XQuery approach turned out
to be a significant savings. The tree-structure was ideal for building
and searching the data, but not for reporting; the XML approach was
well-suited for reporting (because we could leverage the power of
XQuery), but would have been both more inconvenient and worse
performing for implementing the entire application. Because the data
set was a manageable size -- no more than a few tens of megabytes --
it was possible to convert the data from one form to another as was
most convenient from a development perspective. A larger data set, one
that could not be stored entirely in memory easily, would have
required the entire application to be built around a database. While
many good tools for dealing with persistent data are available, they
all involve a lot more work than simply manipulating an in-memory data
structure. If your data set is of a suitable size, you can have the
best of both worlds.
Resources - Participate in the discussion forum.
- Brian first got bitten by the XQuery bug a few months ago, as
chronicled in
"Screen-scraping
with XQuery" (developerWorks, March 2005) . Read the complete Java theory and practice series by Brian Goetz, one of developerWorks' longest-running and most popular columns.
- Howard Katz's
"An
introduction to XQuery" (developerWorks, June 2001) covers the
basics and history of the XQuery standardization effort.
- The tutorial,
"Process
XML using XML Query" (developerWorks, September 2002), by Nicholas
Chase dives deeper into the uses and syntax of XQuery.
- Check out the Saxon XQuery and XSLT processor.
- You can try out the free
community edition of the Mark Logic server, a content database
that lets you search large document bases with XQuery.
- To learn more about Java technology, visit the
developerWorks Java zone. You'll find technical
documentation, how-to articles,
education, downloads, product information, and more.
- Visit the New to Java
technology site for the latest resources to help you get started with Java
programming.
- Get involved in the developerWorks community by
participating in
developerWorks blogs.
- Browse for books on these and other technical topics.
About the author  | |  | Brian Goetz has been a professional software developer for over 18 years. He is a Principal Consultant at Quiotix, a software development and consulting firm located in Los Altos, California, and he serves on several JCP Expert Groups. See Brian's published and upcoming articles in popular industry publications. |
Rate this page
|  |