Java theory and practice: Screen-scraping with XQuery

XQuery makes light work of HTML extraction and transformation

XQuery is a W3C standard for extracting information from XML documents, currently spanning 14 working drafts. While the majority of interest in XQuery is centered around querying large bases of semi-structured document data, XQuery can be surprisingly effective for some much more mundane uses as well. In this month's Java theory and practice, columnist Brian Goetz shows you how XQuery can be used effectively as an HTML screen-scraping engine.

Share:

Brian Goetz (brian@quiotix.com), Principal Consultant, Quiotix

Brian Goetz has been a professional software developer for over 18 years. He is a Principal Consultant at Quiotix, a software development and consulting firm located in Los Altos, California, and he serves on several JCP Expert Groups. See Brian's published and upcoming articles in popular industry publications.



22 March 2005

Also available in Russian Japanese

Last month, Java™ technology guru Sam Pullara was showing me his latest Java-enabled phone, the Nokia 6630. It is crammed full of technology -- an embedded JVM, GPRS, Bluetooth -- but it suffers from the same problem that plagues all smart phones -- limited screen real estate. Some Web sites have support for phone-based browsers, and embedded browsers try to render pages effectively on small screens, but trying to view a typical Web page on a phone screen is a lot like trying to squeeze an elephant into the back seat of your car (to the dissatisfaction of everyone -- you, the car, and the elephant). Sam had built a simple, elegant solution for screen-scraping data from his favorite Web sites and reformatting them for small-screen display.

A novel approach

You can use a number of approaches to extract data from HTML documents. I really liked the approach Sam took, which was to use XQuery as both a screen-scraping tool (to extract the relevant data from the pages) and as a stylesheet tool (to reformat the data so it fits nicely on the page without scrolling). With a small amount of infrastructure and some pretty simple XQuery expressions, it became possible to extract the relevant data -- such as traffic, weather, and financial quotes -- out of numerous data sources and display it nicely on the phone.

I've often been in the situation where screen-scraping HTML pages seemed a sensible solution for a particular problem, but there are very few Java-based toolkits for screen scraping. Many HTML parsing tools are available, but they generally lack sufficient abstractive capability (making screen-scraping code messy), are limited by the widespread use of poorly conforming HTML, and deal poorly with dynamically generated pages whose structure may change over time.

To bridge the gap between poor-quality HTML and the rich set of XML-processing tools, you first need to convert the HTML into XML. A number of tools can help you do this; the JTidy toolkit does a good job and makes it easy. JTidy is designed to read-in typical-quality (that is to say, bad) HTML and output something cleaner (you have a choice of options), and also provides a DOM interface for traversing HTML documents that can be fed to an XML parser. The code in Listing 1 will read in an HTML document from an InputStream and generate a DOM representation of the document:

Listing 1. Code to convert HTML into an XML-compatible DOM with JTidy
Tidy tidy = new Tidy(); 
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(inputStream, null);

With this simple transformation, you can process almost any Web page as an XML document, and you can apply your favorite XML tools for extracting data -- SAX, XSL, XPath -- you name it. While XSL might be the obvious choice, as it is designed for extracting information from XML documents and transforming it for presentation, XSL has a significant learning curve if you don't already know it, and even the simplest XSL transformations can be annoyingly complicated. XPath is a good candidate for the extraction part -- which XSL and XQuery both use for content selection -- and you could easily use XPath to pull out the data you need and then format the HTML yourself, but XQuery makes it even easier.


XQuery: A (ridiculously) brief tour

XQuery was designed for extracting data from potentially very large XML datasets. The input dataset need not be an XML document, though it could be -- but it could also be a collection of documents that have been indexed and stored in an XML database, or even a set of tables in a relational database. Like SQL, XQuery contains functions for extracting, summarizing, aggregating, and joining data from multiple datasets.

Just like presentation template languages, such as JSP, ASP, or Velocity, XQuery combines elements from two domains -- the presentation domain and a computational domain -- into a single combined syntax. The result is that any XML document is already a valid XQuery expression, which evaluates to itself. There are also language statements, such as "for" and "let," which can be intermixed with XML elements.

Listing 2 shows a sample XML document, bib.xml, which represents a bibliography of books. I'll show you a few quick XQuery expressions to give you a flavor of what XQuery can do, and then move on to the screen-scraping examples. Covering the syntax and use cases of XQuery could take hundreds of pages -- see the Resources section for more detailed reference material and examples.

Listing 2. Example XML bibliography
<bib>
    <book year="1994">
        <title>TCP/IP Illustrated</title>
        <author><last>Stevens</last><first>W.</first></author>
        <publisher>Addison-Wesley</publisher>
        <price> 65.95</price>
    </book>
    . . .  more books . . . 
</bib>

Listing 3 shows an XQuery expression that selects all books published by Addison-Wesley after 1991, extracts their titles, and formats the titles into a bulleted (<ul>) list. A mode switch from "presentation mode" (data that will be passed directly to the output, such as the <ul> and <li> tags) to "code mode" is indicated by curly braces; an implicit mode switch from "code mode" to "presentation mode" occurs immediately after the return clause.

Listing 3. XQuery expression to select book titles according to query criteria
<ul>
{
  for $b in doc("bib.xml")/bib/book
  where $b/publisher = "Addison-Wesley" and $b/@year > 1991
  return
    <li>{ data($b/title) }</li>
}
</ul>

The query syntax, introduced with "for" and often called a "Flower expression" (from FLWOR, an abbreviation for for-let-where-order-return), selects a sequence of XML nodes from a document, in this case the set of <book> nodes from the bib.xml document using an XPath expression, and further filtering those nodes that match the specified query criteria (the publisher is Addison-Wesley, and the publication year is after 1991). For each of these nodes, it computes the expression in the return clause, which here is a mix of markup (the <li> tags) and code (extracting the contents of the <title> element of each <book> node).

This simple XQuery example illustrates several aspects of XQuery -- the mixing of presentation and code in one document, the use of XPath, the use of substitution (the $b references), a nontrivial query expression, an XQuery function (data()), and the fact that the structure of the output document need not match the structure of the input document. That's a lot of processing power in a pretty compact and not-so-hard-to-read query.

Listing 4 shows an even simpler XQuery expression, which outputs the number of distinct publishers in the bibliography in a single <count> element. Like the previous example, it uses an XPath expression to select a set of nodes, and applies XQuery functions for selecting distinct values and counting the number of nodes. It evaluates to a number -- the number of distinct publishers in the bib.xml document.

Listing 4. XQuery expression to count distinct publishers
<count>
{
  let $d := distinct-values(doc("bib.xml")/book/publisher)
  return count($d)
}
</count>

These examples barely scratch the surface of the types of queries that can be performed by XQuery -- they are intended to simply give you the flavor of the sort of thing you can do with it, and to suggest how you can use XQuery for transforming XML documents into the format of your choosing. While much of its power is aimed at querying large bases of documents or other data sources, you can use a very simple subset of XQuery to screen-scrape HTML documents to extract the parts you want for a variety of applications, such as displaying the relevant data on a screen-limited device such as a cell phone, or creating a do-it-yourself portal where data from multiple sites is aggregated and presented.


Screen-scraping with XQuery

One of the (many) challenges of screen-scraping Web pages is that they usually have no self-identifying structure, and their structure may change as the site content is edited, or even as different dynamic content (such as ad content) is interpolated into the page in different requests. As a result, you often have to guess as to which portions of the page correspond to the data you want to extract.

Stock prices

Let's start by extracting the current price of IBM stock from the Yahoo! Finance page (http://finance.yahoo.com/q?s=IBM). There's a lot of stuff on this page -- news headlines, ads, financial data -- but I want the stock price data, which is in a table cell next to the cell that contains "Last Trade." The query in Listing 5 selects all <td> nodes whose text contains "Last Trade," and for each one (I expect only one), outputs a table row containing the contents of the following<td> node. The contents are extracted with the data() function in the return clause; otherwise, I'd get more than just the text in the <td> node, I'd get all the markup, too. (The only tricky part in the query is the text()[1] part; what's going on here is that the text() function matches all the text nodes within the <td> element -- in this case there is only one, but XQuery doesn't know that -- and so I must further tell it to select the first text node before trying to do text matching on it.) As long as the page contains a table cell with the text "Last Trade" in it, and the following cell contains the stock price, then the structure of the page can change arbitrarily without causing the query to fail.

Listing 5. XQuery expression for extracting stock quotes from Yahoo! Finance
<table>
{
  for $d in //td
  where contains($d/text()[1], "Last Trade")
  return <tr><td> { data($d/following-sibling::td) } </td></tr>
}
</table>

Weather

Let's try another page. The Yahoo! Weather page contains a number of portlet panels, and I want to extract the names, temperatures, and icons for the cities listed. (The Yahoo! Weather page, http://weather.yahoo.com, will show weather for the cities you've selected in your My Yahoo! if you are logged into Yahoo!, or for a sampling of major cities if you are not.) Listing 6 shows a query that looks for the sub-panel containing the text "New York, NY" and then navigates up to the enclosing table and selects all the rows:

Listing 6. XQuery expression for extracting weather information from Yahoo! Weather
<table>
{
  for $d in //td[contains(a/small/text(), "New York, NY")]
  for $row in $d/parent::tr/parent::table/tr
  where contains($d/a/small/text()[1], "New York")
  return <tr><td>{data($row/td[1])}</td> 
           <td>{data($row/td[2])}</td>              
           <td>{$row/td[3]//img}</td> </tr>
}
</table>

Then, for each row, it extracts the three relevant data columns -- city name, temperature, and icon -- and outputs a simpler table containing only this information. The result is a compact display of the weather information for the cities you care about, suitable for display on a small screen. The results are shown below:

Chicago, IL 49...63 FPartly Cloudy
London, UK 32...41 FFair
New York, NY 36...44 FCloudy
San Francisco, CA 52...67 FPartly Cloudy

This query is not quite as robust as the query in Listing 5. It assumes that the text "New York, NY" will be inside a small element (which is the sort of markup that could easily change the next time Yahoo! redesigns their pages). Also, "New York, NY" could easily appear more than once on a page devoted to weather. However, these elements of risk can be mitigated by spending more effort developing the queries; as with many development options, there is a tradeoff between query complexity and query stability.

The queries shown in Listing 5 and Listing 6 are not the only way these queries could be cast. Using a more complicated XPath syntax, the two for clauses in Listing 6 could be folded into a single XPath expression, and the entirety of Listing 5 could be cast as an XPath expression instead of using the FLWOR syntax. If you are an XPath guru, you will probably find it easier to use a more XPath-oriented approach, whereas those with more SQL experience will probably find the FLWOR syntax more appealing.

Tools

A remarkably small amount of code is needed to execute XQuery expressions against HTML pages. The JTidy library can be used to clean up an HTML document and represent it as a DOM object (see Listing 1). The Saxon XQuery engine was used to compile and execute the query against the DOM object of the document. Compiling and executing an XQuery expression against a DOM representation of a document requires only six lines of code, as shown in Listing 7:

Listing 7. Code to compile and execute an XQuery expression with Saxon
Configuration c = new Configuration();
StaticQueryContext qp = new StaticQueryContext(c);
XQueryExpression xe = qp.compileQuery(query);
DynamicQueryContext dqc = new DynamicQueryContext(c);
dqc.setContextNode(new DocumentWrapper(tidyDOM, url, c));
List result = xe.evaluate(dqc);

The result of the query evaluation is a List of DOM Elements, and you can use your favorite DOM manipulation technique (OK, your least-unfavorite DOM manipulation technique) to turn the query results into a document.

Lots of other implementations of XQuery are available, some free, some commercial -- see Resources for some places to look.


Summary

While XQuery was designed for querying large document bases, it serves as a fine tool for transforming simple documents as well. Whether simplifying complex pages for display on small screens, or extracting elements from multiple pages to aggregate them together on a home-grown portal, or simply extracting data from Web pages because there's no other programmatic way to get the data, XQuery offers a relatively easy way to scrape HTML pages for the data you need.

Resources

  • Howard Katz's An introduction to XQuery (developerWorks, June 2001) covers the basics and history of the XQuery standardization effort.
  • The tutorial, Process XML using XML Query (developerWorks, September 2002), by Nicholas Chase dives deeper into the uses and syntax of XQuery.
  • You can read about Sam Pullara's cell phone in his blog.
  • Download JTidy from its home on SourceForge.
  • Check out the Saxon XQuery and XSL implementation.
  • You can try out the free community edition of the Mark Logic server, a content database which lets you search large document bases with XQuery.
  • The official specifications for XQuery can be downloaded from the W3C site; this page also hosts a list of XQuery implementations.
  • This set of slides from an XQuery tutorial offers a lot of good examples of what XQuery is good for and how to use it.
  • To learn more about Java technology, visit the developerWorks Java zone. You'll find technical documentation, how-to articles, education, downloads, product information, and more.
  • To learn more about XML, visit the developerWorks XML zone. As with the Java zone, you'll find technical documentation, how-to articles, education, downloads, product information, and more.
  • Visit the New to Java technology site for the latest resources to help you get started with Java programming.
  • Get involved in the developerWorks community by participating in developerWorks blogs.
  • Browse for books on these and other technical topics.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, XML
ArticleID=56686
ArticleTitle=Java theory and practice: Screen-scraping with XQuery
publish-date=03222005