Query XML documents outside an XML database

Learn how by analyzing Maven projects using XQuery

Processing XML in Java usually requires a lot of code and overhead. If you use XQuery, you can do a lot more with a lot less code, even when the XML is stored outside of XML databases. Learn how to use XQuery with Java technology by extracting the hidden information from XML-based Maven POM files.

Adriaan de Jonge, Software Professional, Freelance

Adriaan de JongeAdriaan de Jonge is a software professional working for the ANWB in The Netherlands. He currently spends most of his spare time investigating the Google App Engine and plans to write about it. Adriaan has written XML-related articles for IBM developerWorks and Amazon. You can reach him at adriaandejonge@gmail.com.



13 July 2010

Also available in Chinese

Frequently used acronyms

  • API: Application Programming Interface
  • HTTP: Hypertext Transfer Protocol
  • POM: Project Object Model
  • SQL: Structured Query Language
  • URL: Uniform Resource Locator
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language
  • XSLT: Extensible Stylesheet Language Transformations

The majority of XML data is not typically stored in XML databases. Although XQuery is well known for its database querying capabilities, you do not need a database for XQuery to be powerful. You can also use XQuery to investigate and analyze XML documents residing in various non-XML storage media.

For example, Java™ developers use Maven Project Object Model (POM) files to build their projects and manage dependencies. Maven can best be described as a Java build tool that specializes in dependency management, or, as the Maven website defines it, "a software project management and comprehension tool." Maven is typically the only program using the information contained in a Maven POM file. But, because POM files are XML, you can just as easily read them using XQuery and do your own analysis. You can ask questions such as, "How many of my projects are still using the old version of Log4J?"

Using XQuery

Develop skills on this topic

This content is part of a progressive knowledge path for advancing your skills. See Querying XML from Java applications

A common misconception about XQuery is that it is the SQL version of native XML databases. Although this is one of the functions of XQuery, you can also use it as a full programming language without using an XML database.

Functional programming

In the Java community, there is a lot of buzz about functional languages. Early adopters are experimenting with Clojure, Scala, and F#. Although XQuery is a functional language, this fact is not always recognized in the functional language hype.

XQuery has been an official W3C recommendation since 2007, and it's widely supported. Traditional databases like IBM® DB2®, Oracle, and Microsoft® SQL Server® all support XQuery for XML processing. Native XML databases like eXist and MarkLogic Server are the natural playing ground for XQuery. A large number of small commercial, open source, and academic XQuery implementations are also available on the Internet, and some are even bundled with a native XML database. Others are stand-alone XQuery processors not tied to a database.

Stand-alone implementations

A well-known stand-alone XQuery processor is Saxon Home Edition (Saxon-HE), the open source edition of Saxonica. This processor supports XSLT version 2.0 and XPath version 2.0 and is optimized for performance. This article demonstrates the use of XQuery outside database environments using Saxon.

You can use XQuery to process any XML document, regardless of where it's stored. A lot of XML is stored on local file systems or as binary large objects (BLOBs) in traditional relational databases. It is a waste not to use XQuery outside of native XML databases, as well.

With some effort, data does not even need to be in XML format to be processed using XQuery. For example, if you use an extension function to retrieve non-XML data, XQuery is well suited to generating XML output in an efficient way. Here is one of the simplest possible XQuery queries that does not require a database:

<sum>{2+3}</sum>

The result of this query is:

<sum>5</sum>

Another misconception about XQuery is that it requires FLWOR (which stands for let, where, order by, return) expressions—similar to the misconception that SQL requires SELECT-FROM-WHERE expressions. Although FLWOR expressions are powerful and used in many queries, you can do many things without having to resort to them. Every valid XPath 2.0 expression is also a valid XQuery 1.0 expression. XQuery adds powerful features to the language, making it capable of larger operations, whereas XPath by itself is most appropriate for small queries and expressions that fit in a single line of code.

Running the example

You can launch XQuery code in many ways. In XML databases, the possibilities are different from product to product. Outside XML databases, you can use command-line programs or write a few lines of Java code to invoke XQuery.

The XQuery API for Java (XQJ—the Java Specification Request [JSR] 225) is the standard Java API for invoking XQuery code. Listing 1 provides a simple way for you to run the example using XQJ.

The example consists of a method call taken from an HTTP Servlet. First, you set up a connection to an XQuery processor. You read an XQuery from a text file (simple.xqy) that is included on the class path as a resource in the same package as the Java class. After executing the query, you loop through the result set (which should only contain a single result) and write that result to the OutputStream of the servlet.

Listing 1. Using the XQJ API to run XQuery code
protected void doGet(HttpServletRequest request,
        HttpServletResponse response) throws ServletException, 
        IOException {
    try {
        XQDataSource dataSource = new SaxonXQDataSource();
        XQConnection connection = dataSource.getConnection();
        InputStream inputStream = 
            this.getClass().getResourceAsStream("simple.xqy");
        XQPreparedExpression expression =
            connection.prepareExpression(inputStream);
        XQResultSequence result = expression.executeQuery();
        while (result.next()) {
            result.writeItem(response.getOutputStream(), null);
        }
    } catch (XQException e) {
        throw new ServletException(e);
    }
}

Using XQJ feels like using Java Database Connectivity (JDBC): Connect to a database, prepare a query, execute it, and read the results. This is great for querying XML databases. Also, the API for reading the results helps integrate XQuery code with Java code.

When using the Saxon XQuery processor instead of an XML database, the XQJ naming is a bit misleading. With Saxon, you aren't actually making a connection. Instead, you're invoking a processor. Listing 2 shows a different way to invoke XQuery code using the Saxon API.

Listing 2. Using Saxon to run XQuery code
protected void doGet(HttpServletRequest request,
        HttpServletResponse response) throws ServletException, 
        IOException {
    try {
        Processor processor = new Processor(false);
        XQueryCompiler compiler = processor.newXQueryCompiler();
        Serializer serializer = new Serializer();
        serializer.setOutputStream(response.getOutputStream());
        InputStream inputStream = 
            this.getClass().getResourceAsStream("simple.xqy");
        XQueryExecutable executable = compiler.compile(inputStream);
        XQueryEvaluator evaluator = executable.load();
        evaluator.setDestination(serializer);
        evaluator.run();
    } catch (SaxonApiException e) {
        throw new ServletException(e);
    }
}

The Saxon API makes it relatively simple to connect multiple processors to an XML pipeline. These processors might be other XQuery implementations but can also be XSLT or Java code.

For the examples presented in this article, both APIs work equally well. In general, if your program heavily relies on Java code, you might prefer XQJ for both its syntax and the portability to other XQuery implementations. In contrast, if your program heavily relies on XML technologies like XQuery or XSLT and the Java code is just the glue, then you might prefer the Saxon-specific API.


XQuery and Maven POM files

Maven POM files are a good example of XML that lives outside an XML database. In most cases, the only tool that actually reads the POM files is Maven itself. Consider the possibilities of an XML-based POM file: It is open, can be read on any platform, and contains a lot of unused information about your software projects. And yet, it's rarely used outside Maven.

That can be changed easily! Using XQuery, it takes only a few lines of code to retrieve valuable information from your Maven POM files. Before you do this, though, take a moment to think about what kind of information is hidden inside Maven POM files. What are the maintenance issues you deal with during daily development? How can XQuery analysis of Maven POM files help resolve such issues?

Retrieving basic dependencies

If you have multiple software projects in your company, it is wise to achieve some uniformity in the use of libraries. For example, do you know which versions of Log4J your projects are using? Find out by querying a single POM file, as shown in Listing 3.

Listing 3. Determine the Log4J version
declare namespace m = "http://maven.apache.org/POM/4.0.0";
let $doc := doc("http://krank.googlecode.com/svn/trunk/pom.xml")
return <results>{$doc//m:dependency[m:artifactId eq 'log4j']/m:version/text()}</results>

The result of this query is:

<results>1.2.131.2.13</results>

Note: This POM file is just a random file taken from Google Code, because it is publicly available. The downside is that the file might change or disappear after this article goes live. You can easily fix this by searching Google Code for pom.xml files if you want another example.

The result of this query (at the time of writing, at least) is two version texts printed directly after each other. If you know the contents of this specific POM file, this result is logical. Two dependency declarations refer to Log4J: one for the code and one for a plug-in.

Writing the versions in separate elements makes it easier to distinguish them. For this example, you can use a simple for construction (see Listing 4).

Listing 4. Creating versions in separate elements
declare namespace m = "http://maven.apache.org/POM/4.0.0";
let $doc := doc("http://krank.googlecode.com/svn/trunk/pom.xml")
let $versions := $doc//m:dependency[m:artifactId eq 'log4j']/m:version/text()
return <results>
{
    for $version in $versions
    return <version>{$version}</version>
}
</results>

The result of this query is:

<results>
   <version>1.2.13</version>
   <version>1.2.13</version>
</results>

Strictly speaking, this result does not answer the question: It simply repeats the same version number. The question was, which versions of Log4J are your projects using? After the first <version> element, it was already clear that version 1.2.13 is used. The second <version> element is only relevant when the versions are different.

You can easily fix this discrepancy using the distinct-values() function, as in Listing 5.

Listing 5. Removing duplicates from the version elements
declare namespace m = "http://maven.apache.org/POM/4.0.0";
let $doc := doc("http://krank.googlecode.com/svn/trunk/pom.xml")
let $versions := distinct-values($doc//m:dependency[m:artifactId eq 'log4j']/m:version)
return <results>
{
    for $version in $versions
    return <version>{$version}</version>
}
</results>

The result of this query is:

<results>
   <version>1.2.13</version>
</results>

The next step is to query a set of documents instead of a single document. Doing so requires the small change in Listing 6.

Listing 6. Querying multiple Maven POMs at once
declare namespace m = "http://maven.apache.org/POM/4.0.0";
let $docs :=  (doc("http://q4e.googlecode.com/svn/trunk/pom.xml"),
               doc("http://gmaps4jsf.googlecode.com/svn/trunk/pom.xml"),
               doc("http://java-twitter.googlecode.com/svn/trunk/pom.xml"),
               doc("http://xmlzen.googlecode.com/svn/trunk/pom.xml"),
               doc("http://krank.googlecode.com/svn/trunk/pom.xml"))
let $versions := distinct-values($docs//m:dependency[m:artifactId eq 'log4j']/m:version)
return <results>
{
    for $version in $versions
    return <version>{$version}</version>
}
</results>

The result of this query is:

<results>
   <version>1.2.14</version>
   <version>1.2.13</version>
</results>

You might have expected an extra for loop. The example loads the documents and puts them into a sequence that you can query. Depending on the XQuery implementation, this construction may be inefficient in terms of memory consumption. Performance and resource usage is discussed at the end of this article.

Another question is whether you want to hard-code all these document references into your XQuery code. Of course, you can also put these URLs into an XML file and pass it as input to the query: It's just a matter of personal preference.

Retrieving more information

Right now, the query only tells you which versions of Log4J are used in all your projects. It does not tell you which projects use which version. After querying for Log4J dependencies, it's a small step to retrieve the <name> element in the same document. The XPath ancestor axis allows you to navigate to the parent of the document, as in Listing 7.

Listing 7. Adding the project name to the query results
declare namespace m = "http://maven.apache.org/POM/4.0.0";
let $docs :=  (doc("http://q4e.googlecode.com/svn/trunk/pom.xml"),
               doc("http://gmaps4jsf.googlecode.com/svn/trunk/pom.xml"),
               doc("http://java-twitter.googlecode.com/svn/trunk/pom.xml"),
               doc("http://xmlzen.googlecode.com/svn/trunk/pom.xml"),
               doc("http://krank.googlecode.com/svn/trunk/pom.xml"))
return <results>
{
    for $doc in $docs//m:dependency[m:artifactId eq 'log4j']
    let $name := $doc/ancestor::m:project/m:name/text()
    let $version := $doc/m:version/text()
    return
        <result>
            <project-name>{$name}</project-name>
            <version>{$version}</version>
        </result>
}
</results>

The result of this query is:

<results>
   <result>
      <project-name>XML Zen</project-name>
      <version>1.2.14</version>
   </result>
   <result>
      <project-name>Crank :: ROOT</project-name>
      <version>1.2.13</version>
   </result>
   <result>
      <project-name>Crank :: ROOT</project-name>
      <version>1.2.13</version>
   </result>
</results>

There is a problem, though: The duplicate results are back. This shouldn't be a surprise, because this query no longer contains the distinct-values() function. And that was the fix to the duplication problem. Why remove it, then?

The distinct-values() function returns text values, not XML elements. It is not possible to query for ancestors against a text value. To solve the duplication again, the query grows a little, as Listing 8 shows.

Listing 8. Removing duplicates from the versions again
declare namespace m = "http://maven.apache.org/POM/4.0.0";
let $docs :=  (doc("http://q4e.googlecode.com/svn/trunk/pom.xml"),
               doc("http://gmaps4jsf.googlecode.com/svn/trunk/pom.xml"),
               doc("http://java-twitter.googlecode.com/svn/trunk/pom.xml"),
               doc("http://xmlzen.googlecode.com/svn/trunk/pom.xml"),
               doc("http://krank.googlecode.com/svn/trunk/pom.xml"))
let $artifactId := 'log4j'
return <results>
{
     for $doc in $docs[//m:artifactId = $artifactId]
     let $name := $doc/m:project/m:name/text()
     return
         <result>
             <project-name>{$name}</project-name>
             {
                 let $versions := distinct-values($doc//m:dependency[m:artifactId
                 	eq $artifactId]/m:version)     
                 for $version in $versions
                 return <version library="{$artifactId}">{$version}</version>
             }
         </result>
}
</results>

The result of this query is:

<results>
   <result>
      <project-name>XML Zen</project-name>
      <version library="log4j">1.2.14</version>
   </result>
   <result>
      <project-name>Crank :: ROOT</project-name>
      <version library="log4j">1.2.13</version>
   </result>
</results>

Note the difference between the equal sign (=) and the eq operator. The eq operator can only be used to compare atomic values. If you use it to compare a sequence of items, the code runs into an error. The = sign can be used to compare sequences. Just remember: Every eq operator is a potential error and should only be used when you can be fairly sure it is dealing with atomic values.

The last example contains an extra for loop and an extra variable $artifactId. The extra variable is to avoid duplication and turns out to pay off in another way in the next step. There are no rules against using extra for loops. However, if possible, it is good practice to avoid them where possible to keep the code concise.

In some cases, one extra for loop means a lot of added value. For example, say that you want to add the versions of JUnit to the query result. It's just a small change (see Listing 9).

Listing 9. Determine which versions of JUnit are in use
declare namespace m = "http://maven.apache.org/POM/4.0.0";
let $docs :=  (doc("http://q4e.googlecode.com/svn/trunk/pom.xml"),
               doc("http://gmaps4jsf.googlecode.com/svn/trunk/pom.xml"),
               doc("http://java-twitter.googlecode.com/svn/trunk/pom.xml"),
               doc("http://xmlzen.googlecode.com/svn/trunk/pom.xml"),
               doc("http://krank.googlecode.com/svn/trunk/pom.xml"))
let $artifactIds := ('log4j', 'junit')
return <results>
{
     for $doc in $docs[//m:artifactId = $artifactIds]
     let $name := $doc/m:project/m:name/text()
     return
         <result>
             <project-name>{$name}</project-name>
             {
                 for $artifactId in $artifactIds
                     let $versions := distinct-values($doc//m:dependency[m:artifactId
                     	eq $artifactId]/m:version)     
                     for $version in $versions
                     return <version library="{$artifactId}">{$version}</version>
             }
         </result>
}
</results>

The result of this query is:

<results>
   <result>
      <project-name>GMaps4JSF Project</project-name>
      <version library="junit">3.8.1</version>
   </result>
   <result>
      <project-name>java-twitter</project-name>
      <version library="junit">4.5</version>
   </result>
   <result>
      <project-name>XML Zen</project-name>
      <version library="log4j">1.2.14</version>
      <version library="junit">4.6</version>
   </result>
   <result>
      <project-name>Crank :: ROOT</project-name>
      <version library="log4j">1.2.13</version>
   </result>
</results>

Performance considerations

This article shows that you don't need a native XML database to use XQuery. For other purposes, it is good to be careful, though. Maven analysis is something you only do incidentally. Even if the query took a minute to process, you might be willing to wait for it.

As the number of Maven documents in your query increases, it might become problematic to load all the documents into memory at once. A clever XQuery implementation can avoid memory problems. For example, you might consider replacing the $docs variable with a small XML snippet containing the URLs references instead of directly calling the doc() function multiple times. Calling the doc() function within a for-loop might reduce overall memory usage. XML databases provide an advantage here, because they only load the elements you specifically ask for.

When you are creating a solution for production environments, it might not be wise to retrieve data directly from external URLs frequently. An alternative is to have a background process copy the data periodically to a local store. If the local store is an XML database, the XQuery processing might be even faster because of the indices in the database.


Conclusion

You have seen a number of XQuery queries analyzing data hidden in Maven POM files. The complexity of the queries slowly increases when your requirements become more specific. Nevertheless, the resulting XQuery code remains concise and does not contain the overhead you'd have when using a different language to process XML.

In this article, I chose to use Maven, because most Java developers are familiar with it, and most Maven files are not stored in XML databases. You can use XQuery for many other purposes. Within the Java domain, think of Spring configurations or deployment descriptors. Outside the Java domain, the possibilities are endless.


Download

DescriptionNameSize
Sample files for this articlemaven-xquery-src-v2.zip1952KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=499985
ArticleTitle=Query XML documents outside an XML database
publish-date=07132010