I recently advised on a project that involved a fair bit of Web crawling. As the program crawled the various Web sites, it would build a database of the sites and pages crawled, the links each page contained, the results of analysis on each pages, and so on. The end result was to be a set of reports detailing which sites and pages were crawled, which were followed, which links were broken, which pages had errors, page metrics that were calculated, and so on. At the outset, no one knew exactly what reports were going to be needed or what format they should take -- just that there would need to be reports. This suggested that the report development phase was going to be iterative, with several rounds of feedback, tweaking, and possibly starting over with a different organization. The only specified reporting requirement was that the reports should be presented as XML, or possibly HTML. As a result, the process of developing and modifying the reports would have to be lightweight, because the reporting requirements were going to be "discovered dynamically," rather than specified up-front.
The "obvious" approach to the problem was to put everything in a SQL database -- pages, links, metrics, HTTP result codes, timing results, and other metadata. The problem lent itself well to a relational representation, especially as it was not necessary to store the contents of the visited pages, just their structure and metadata.
So far, this project looked like a typical database application, and there was no shortage of candidate persistence strategies. But maybe it was possible to avoid the complexity of using a database for persistence -- the crawler was only going to visit a few tens of thousand of pages. This number was few enough that it was possible to keep the whole database in memory, and persistence, if needed, could be achieved through serialization. (Yes, the load and save operations would take a long time, but they did not have to be performed frequently.) Laziness had yielded a dividend -- not having to deal with persistence can greatly reduce the amount of time to develop an application, resulting in a significant savings in terms of development effort. Building and manipulating an in-memory data structure is just a lot easier than having to go to the database for every operation that adds, fetches, or analyzes data. No matter which persistence model you choose, it always constrains how you structure any code that touches the data.
The in-memory data structure had a sort of tree structure, as shown in Listing 1, rooted at the home pages of the various sites being crawled, so the Visitor pattern was ideal for searching it or extracting data from it. (It was not too hard to build a base Visitor class that prevented getting stuck in cycles of links -- A links to B, B links to C, C links back to A.)
Listing 1. Simplified schema for Web crawler
public class Site {
Page homepage;
Collection<Page> pages;
Collection<Link> links;
}
public class Page {
String url;
Site site;
PageMetrics metrics;
}
public class Link {
Page linkFrom;
Page linkTo;
String anchorText;
}
|
The crawler application had a dozen or so Visitors, which did things like selecting pages for further analysis, selecting pages that had unfollowed links, selecting pages with errors, tabulating the "most linked" pages, and so on. Because all of these operations were so simple, the Visitor pattern, shown in Listing 2, worked quite well, and because the data structure could fit into memory, even exhaustive searches were cheap enough:
Listing 2. Visitor pattern for Web crawler database
public interface Visitor {
public void visitSite(Site site);
public void visitLink(Link link);
}
|
The Visitor strategy worked just great for accessing the data, until it was time to run the reports. One of the benefits of using a database for persistence is that the power of SQL really shines when it comes to generating reports -- you can pretty much let the database do all the work. Even better, prototyping reports with SQL is easy -- you run your prototype report, and if the results aren't what you need, you tweak the SQL query or write a new query, and try again. The edit-compile-run cycle, when all you are changing is a SQL query, can be pretty quick. If the SQL isn't stored in the program, you can even skip the compile part of the cycle, which gives the ability to rapidly prototype your reports. Once you've found the reports you want, then it is easy enough to build them into your application.
So, the in-memory data structure, which was great for adding new results, finding a specific result, and all sorts of ad-hoc traversals, had become a liability when it came to reporting. For any report whose structure was not similar to that of the database, a Visitor would have to create a whole new data structure to hold the report data. Therefore, each type of report would require its own report-specific intermediate data structure for holding the results, a visitor for populating the intermediate data structure, and then post-processing code for turning the intermediate data structure into the final report. This approach seemed like a lot of work, especially when most of the prototyped reports were going to be thrown away. For example, let's say you wanted a report that listed all the pages from other sites linked to from a given site, and for each of those external pages, a list of pages on the site that link to it, and then sort the report by number of links, so the most-linked pages appear first. This plan basically turns the data structure inside-out. To accomplish this sort of data transformation with a Visitor, you need to take the list of external page links reachable from a given site and sort them by the linked-to page, as shown in Listing 3:
Listing 3. Visitor tabulates most-linked pages, along with pages that link to them
public class InvertLinksVisitor {
public Map<Page, Set<Page>> map = ...;
public void visitLink(Link link) {
if (link.linkFrom.site.equals(targetSite)
&& !link.linkTo.site.equals(targetSite)) {
if (!map.containsKey(link.linkTo))
map.put(link.linkTo, new HashSet<Page>());
map.get(link.linkTo).add(link.linkFrom);
}
}
}
|
The Visitor in Listing 3 produces a map that associates each external page with a list of internal pages that link to it. To prepare the report, you would then have to sort the entries by the size of the associated page set, and then create the report. While none of these steps is all that difficult, the amount of report-specific code for each report is significant, and given that rapid report prototyping was an important goal (because there were no stated reporting requirements), the overhead of trying out a new report was higher than ideal. Many reports required multiple passes on the data to select, aggregate, and sort the data.
At this point, the lack of a formal data model, which describes the data gathered and against which you could more easily express selection and aggregation queries, was starting to look like a liability. Perhaps laziness was not as efficient as first hoped. But, while this application had no formal data model, maybe we could "borrow one" by streaming the data into an in-memory database and querying against that. Two possibilities immediately sprang to mind; HSQLDB, the open-source in-memory SQL database, and XQuery. I didn't need the persistence that a database offered, but I did want the query language.
HSQLDB is an embeddable database engine written in the Java™ language. It has table types for both in-memory and disk-based tables, and was designed for being embedded entirely within an application, eliminating the administrative overhead associated with most real databases. To load the data into HSQLDB, it would only be necessary to write a Visitor that traversed the in-memory data structure and generated the appropriate INSERT statements for each entity to be stored. Then you could execute SQL queries against the in-memory database tables to do the reports and throw away the "database" when finished.
Oops, forgot how annoying relational databases are
The HSQLDB approach was workable, but it quickly became clear that I would have to pay the object-relational impedance mismatch penalty not once, but twice -- once when converting the tree-structured database into a relational data model, and again when trying to turn the results of flat relational queries into structured XML or HTML result sets. And post-processing the JDBC ResultSet into a DOM representation of an XML or HTML document was still nontrivial and would require some customized coding for each report. So while an in-memory SQL database did enable simplifying the queries, the extra coding required to get the data in and out of the database would have eaten up all the savings.
The other data query alternative that was easily available was XQuery. XQuery has the advantage that it is designed for producing XML or HTML documents as the result of its queries, so no postprocessing would be required on query results. This idea was attractive -- only one layer of coding for each report, rather than two or more. So the first task was to build an XML document that represented the whole data set. It was straightforward to design a simple XML data model and to write a Visitor that traversed the data structure and appended each element to a DOM Document. (There was never a need to write this document out. It could be kept in memory for querying and then discarded when finished. When the underlying data changed, it could simply be regenerated.) Then all that was needed was to write XQuery queries, which would both select and aggregate the data for the report, and format them into the final desired format (XML or HTML). The queries could be stored in separate files for rapid prototyping and so that multiple report formats could be supported. The code to evaluate the query using Saxon is shown in Listing 4:
Listing 4. Code to execute XQuery query and serialize results to an XML or HTML document
String query = readFile(queryFile + ".xq");
Configuration c = new Configuration();
StaticQueryContext qp = new StaticQueryContext(c);
XQueryExpression xe = qp.compileQuery(query);
DynamicQueryContext dqc = new DynamicQueryContext(c);
dqc.setContextNode(new DocumentWrapper(document, z.getName(), c));
List result = xe.evaluate(dqc);
FileOutputStream os = new FileOutputStream(fileName);
XMLSerializer serializer = new XMLSerializer (os, format);
serializer.asDOMSerializer();
for(Iterator i = result.iterator(); i.hasNext(); ) {
Object o = i.next();
if (o instanceof Element)
serializer.serialize((Element) o);
else if (o instanceof Attr) {
Element e = document.createElement("scalar");
e.setTextContent(((Attr) o).getNodeValue());
serializer.serialize(e);
}
else {
Element e = document.createElement("scalar");
e.setTextContent(o.toString());
serializer.serialize(e);
}
}
os.close();
|
The structure of the XML document representing the database was a little different from the in-memory data structure; each <site> element had nested <page> elements, each <page> element had nested <link> elements, and each <link> element had <link-to> and <link-from> elements. This representation turned out to be convenient enough for most reports.
Listing 5 shows a sample XQuery report that handles selection of links, sorting, and presentation. It has several advantages over the Visitor approach -- not only is it less code (because the query language supports selection, aggregation, and sorting), but all the code for the report -- selection, aggregation, sorting, and presentation -- is in one place.
Listing 5. XQuery code that produces the entire most-linked-pages report
<html>
<head><title>Most-linked pages</title></head>
<body>
<ul>
{
let $links := //link[link-to/@siteUrl ne $targetSite
and link-from/@siteUrl eq $targetSite]
for $page in distinct-values($links/link-to/@url)
let $linkingPages := $links[link-to/@url eq $page]/link-from/@url
order by count($linkingPages)
return
<li>Page {$page}, {count($linkingPages)} links
<ul> {
for $p in $linkingPages return <li>Linked from {$p/@url}</li>
}
</ul></li>
}
</ul> </body> </html>
|
From a development costs perspective, the XQuery approach turned out to be a significant savings. The tree-structure was ideal for building and searching the data, but not for reporting; the XML approach was well-suited for reporting (because we could leverage the power of XQuery), but would have been both more inconvenient and worse performing for implementing the entire application. Because the data set was a manageable size -- no more than a few tens of megabytes -- it was possible to convert the data from one form to another as was most convenient from a development perspective. A larger data set, one that could not be stored entirely in memory easily, would have required the entire application to be built around a database. While many good tools for dealing with persistent data are available, they all involve a lot more work than simply manipulating an in-memory data structure. If your data set is of a suitable size, you can have the best of both worlds.
- Brian first got bitten by the XQuery bug a few months ago, as
chronicled in
"Screen-scraping
with XQuery" (developerWorks, March 2005) . Read the complete Java theory and practice series by Brian Goetz, one of developerWorks' longest-running and most popular columns.
- Howard Katz's
"An
introduction to XQuery" (developerWorks, June 2001) covers the
basics and history of the XQuery standardization effort.
- The tutorial,
"Process
XML using XML Query" (developerWorks, September 2002), by Nicholas
Chase dives deeper into the uses and syntax of XQuery.
- Check out the Saxon XQuery and XSLT processor.
- You can try out the free
community edition of the Mark Logic server, a content database
that lets you search large document bases with XQuery.
- To learn more about Java technology, visit the
developerWorks Java zone. You'll find technical
documentation, how-to articles,
education, downloads, product information, and more.
- Visit the New to Java
technology site for the latest resources to help you get started with Java
programming.
- Get involved in the developerWorks community by
participating in
developerWorks blogs.
- Browse for books on these and other technical topics.
Brian Goetz has been a professional software developer for over 18 years. He is a Principal Consultant at Quiotix, a software development and consulting firm located in Los Altos, California, and he serves on several JCP Expert Groups. See Brian's published and upcoming articles in popular industry publications.





