Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Managing XML data: eXist -- an open source native XML database

This database offers a variety of features, but is it ready for prime time?

Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University
Photo of Elliot Rusty Harold
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the Jaxen XPath engine. You can contact him at elharo@metalab.unc.edu.

Summary:  As XML gains popularity, more and more users are finding themselves with a lot of XML documents to manage. Native XML databases are being developed to meet this obvious need. This article examines one such database, the open source eXist. eXist has the tools you need to manage data, and it benefits from broad API support -- but it's still in beta, and big performance and functionality holes need to be filled before it can be called solid.

View more content in this series

Date:  27 Jun 2005
Level:  Intermediate
Also available in:   Japanese

Activity:  23363 views
Comments:  

Wolfgang Meier's open source eXist database is probably the most popular native XML database available today (which is not at all the same thing as saying it's the best). eXist is written in the Java™ programming language and runs on most major platforms. Programs interface with eXist through its bundled HTTP server. SOAP, XML-RPC, and RESTful interfaces are all provided, and through these you can submit XPath, XQuery, and XUpdate requests to the core server. Command-line and GUI clients are also available.

Install eXist

eXist requires Java 1.4 or later; otherwise, all necessary dependencies are bundled (a nice touch). In fact, installing eXist is shockingly easy for a server-side open source project. A lot of other projects, closed and open source, might learn from it. The installer is built with IzPack. The distribution is a single JAR archive. To install eXist, just run the archive like so:

$ java -jar eXist-1.0b2-build-1107.jar

The installer brings up a GUI that asks you where you want to install the eXist directory. I put it in /home/elharo/eXist. The eXist/bin directory contains the necessary startup scripts. To launch the server, execute startup.sh (UNIX®) or startup.bat (Microsoft® Windows®):

$ ./startup.sh

This command runs the server on port 8080 and begins serving the files in /eXist. You can connect to eXist from any Web browser. For instance, I installed eXist on eliza.elharo.com, so I can connect to it at the following URL:

http://eliza.elharo.com:8080/exist/

(Don't try this at home -- my firewall will block you. You'll have to connect to your own server.)

Initially, you'll see the eXist documentation, as well as some samples that you can try out.


Load data into eXist

eXist isn't really a Web server; it just uses one as a convenient interface to the underlying database server. The package also includes independent GUI clients and programming APIs that you can use to perform various operations. You can even browse it from Microsoft Windows Explorer using WebDAV. For initial experimentation, it's probably easiest to use the simple GUI client. To launch the client, execute client.sh (UNIX) or client.bat (Windows) from the eXist/bin directory:

$ ./client.sh

As you can see in Figure 1, by default the client tries to connect to an eXist database running on the localhost on port 8080. You can specify a different host and port in the URL text field. The same window also asks you for a username and a password. By default, the username is admin; you can leave the password field blank. (Haven't software companies learned by now to not ship servers with default usernames and passwords?)


Figure 1. Connect to eXist
Connect to eXist

Yes, you can administer users, set passwords, set privileges, assign users to different groups with different access permissions, and all the other necessary tasks in a full production environment. However, the brevity of this article forces me to skip over many such options in order to get to the meat of the database.

After you've logged in, the client displays the GUI shown in Figure 2. Initially, eXist comes with one collection, called system, in which the user information is stored. You want to stay out of this collection for now. Instead, create a new collection for your documents by selecting File > New Collection. I created a collection named books. To open the collection, double-click it in the GUI. After you open a collection, to upload documents, click the icon that looks like a bent piece of paper with a plus sign next to it.


Figure 2. The eXist admin client
The eXist admin client

I first uploaded a couple of small documents, and the database accepted them without complaint. I then tried to upload the complete text of my book Processing XML with Java. This operation failed silently, with no error message. Uploading through the Web interface instead of the GUI client also failed. However, that interface showed me a stack trace to help debug the problem. It turned out that eXist didn't resolve the relative URL used in the document type declaration. To load documents with external DTD subsets, you must manually install the DTDs on the server's filesystem and edit a catalog file to tell the database where they are; then, you have to restart the database server to make it reload the catalog file. This is a major hassle, although you normally only need to install each different DTD once. eXist works best if your documents either don't use DTDs or use only a small number of infrequently changed DTDs.


Query eXist

eXist supports both XPath and XQuery (see Resources for more information on both). eXist uses the XQuery syntax from the November 2003 XQuery working draft. Work is ongoing to update the database to use the syntax from more recent working drafts. The differences between the drafts for basic For-Let-Where-Order-Return (FLWOR) queries aren't large.

To enter queries against a collection, click the little binoculars icon in the GUI client to bring up the window shown in Figure 3.


Figure 3. eXist query window
eXist query window

Annoyingly, copy and paste functions don't work in this interface, so you have to manually type in all queries. Of course, this program is really just for testing and experiments -- you wouldn't use it for heavy-duty interaction with the database any more than you'd type raw SQL into an Oracle database. After you have a fairly good idea of the queries that you want to run, you can write programs that generate and submit the queries algorithmically, as I discuss next.


Write programs that interface with eXist

IBM®, Oracle, and the other members of the JSR 225 expert group are currently working to define an API that will do for XQuery what JDBC does for SQL. However, until this process is finished and the API is implemented in eXist, it will be necessary to use eXist's native API. You can access this API through SOAP, XML-RPC, WebDAV, or HTTP interfaces. Any API that supports one of these protocols can communicate with eXist. For instance, you can use JAX-RPC to talk to eXist over SOAP or java.net to talk to it over HTTP.

The RESTful HTTP interface is the simplest and most broadly available of the options. For example, suppose you want to find all para elements in the books collection that contain the word "XSLT." The XQuery in Listing 1 locates all such elements.


Listing 1. A sample XQuery
for $p in //para 
where contains($p, "XSLT") 
return $p

You GET this query from the following URL:

http://eliza.elharo.com:8080/exist/servlet/db/books/

Here, eliza.elharo.com is the network host on which the database is running; 8080 is the port; /exist/servlet/db identifies the Web app, the servlet, and the database, respectively; and books is the specific collection you're querying in that database. eXist allows nested collections. For instance, the books collection might contain separate fiction and nonfiction collections, which are available at the following URLs:

http://eliza.elharo.com:8080/exist/servlet/db/books/fiction/
http://eliza.elharo.com:8080/exist/servlet/db/books/nonfiction/

For the purposes of this article, however, you want to query all the books, both fiction and nonfiction. The XQuery is sent as the value of the _query field in the URL's query string (the part of the URL after a question mark). It must be percent-encoded in the usual way (for example, spaces become %20, the double quotation mark becomes %22, and so forth). Thus, you can send the query in Listing 1 to the server by GETting the following URL:

http://eliza.elharo.com:8080/exist/servlet/db/books/?_query=
for%20$p%20in%20//para%20where%20contains($p,%20%XSLT%22)%20return%20$p

The server sends back the query results wrapped in an exist:result element like the one in Listing 2.


Listing 2. Results of sample query
<exist:result xmlns:exist="http://exist.sourceforge.net/NS/exist" 
  exist:hits="148" exist:start="1" exist:count="10">
<para><quote>HTML? You must be joking</quote> said the fourth, a computer
science professor on sabbatical from MIT, who was engrossed in an XSLT
stylesheet ...</para>
<para>XSLT and the TrAX API</para>
<para>Combine functional XSLT transforms with traditional imperative Java code</para>
<para>The TrAX API for XSLT processing</para>
<para>Once you're comfortable with one or more of these APIs, you
  can read Chapters 16 and 17 on XPath and XSLT.
  However, those APIs and chapters do require some knowledge of at least one
  of the three major APIs.</para>
...</exist:result>

Other optional query string variables control whether the results are pretty printed, what elements wrap the results, how many matches return (by default, eXist only returns the first 10 hits), and so forth.

Because this is all done with HTTP GET, you can make this query simply by typing the appropriate URL into a Web browser. Of course, any software library that speaks HTTP can also send this query and get back the result as a stream of XML. If you were to write this query in the Java language, you might use the URLEncoder class to encode the query string, the URL class to submit it, and XOM to process the results, as shown in Listing 3.


Listing 3. Query eXist in Java code
String xquery = "for $p in //para" 
  + " where contains($p, \"XSLT\") "
  + " return $p";
String encodedQuery = URLEncoder.encode(xquery);
URL u = new URL("http://eliza.elharo.com:8080/exist/servlet/db/books/?_query=");
  + encodedQuery);
InputStream in = u.openStream();
Document doc = (new Builder()).build(in);
// work with the document...

An HTTP interface like this one is completely language independent. You can easily reproduce the functionality in Listing 3 in Perl, Python, C, C#, or any other language that has a simple HTTP library and some XML support. One of the most effective ways to query such a database is to write an XSLT stylesheet that formats the results.

Insert documents

XQuery allows you to get information out of the database. But what about putting data in? This is even easier. Instead of sending a GET request, you send a PUT request. The URL where you PUT the data is the URL where the document will be placed inside the database; the body of the request is the document to store. For example, the Java code in Listing 4 grabs the RSS feed from the Cafe con Leche Web site and puts it in the syndication collection with the name 20050401.


Listing 4. Insert documents into eXist with Java code
URL u = "http://www.cafeaulait.org/today.rss";
InputStream in = u.openStream();
URL u = new URL("http://eliza.elharo.com:8080/exist/servlet/db/syndication/20050401");
HttpURLConnection conn = (HttpURLConnection) u.openConnection();
conn.setDoOutput(true);
conn.setRequestMethod("PUT");
conn.setHeaderField("Content-type", "application/xml");
OutputStream out = conn.getOutputStream();
for (int c = in.read(); c != -1; c = in.read()) {
  out.write(c);
}
out.flush();
out.close();
in.close();
// read the response...

PUTting new documents into the database typically requires authentication. eXist's REST interface supports HTTP Basic authentication. The Java language supports this through the java.net.Authenticator class. Complete details would take this discussion a little too far afield; but in brief, you have to subclass Authenticator with a class that knows (or knows how to ask for) the user name and password for the database, and then install an instance of this subclass as the system default authenticator.

Delete documents

Need to remove a document from the collection? Just send a DELETE request to the appropriate URL, as shown in Listing 5.


Listing 5. Delete a document in eXist
URL u = new URL("http://eliza.elharo.com:8080/exist/servlet/db/syndication/20050401");
HttpURLConnection conn = (HttpURLConnection) u.openConnection();
conn.setRequestMethod("DELETE");
conn.connect();
// read the response...

Again, in practice you also need to supply a username and a password via an Authenticator object.

Update documents

The final and trickiest operation is to modify information in the database. For example, suppose I change my e-mail address from elharo@metalab.unc.edu to elharo@macfaq.com. Therefore, I want to change all <email>elharo@metalab.unc.edu</email> elements to <email>elharo@macfaq.com</email>. XQuery doesn't provide this capability, so eXist uses XUpdate instead. The XUpdate query in Listing 6 makes the change.


Listing 6. Using XUpdate to update documents in eXist
 <xupdate:update
  xmlns:xupdate="http://www.xmldb.org/xupdate"
  select="//email[.='elharo@metalab.unc.edu']"> 
   elharo@macfaq.com 
 </xupdate:update>

Because this operation changes a resource, you need to use the POST method to send it to the server. You post to the URL of the document you want to change and give the XUpdate instructions in the body of the request.

I've just hit the highlights of the REST interface. It also includes instructions to create and drop collections, to specify how the query results are formatted, and to supply user credentials. Nor is HTTP the only interface to eXist. eXist also has native APIs for Perl, PHP, and the Java language, along with generic WebDAV, SOAP, and XML-RPC interfaces. Broad API support is one of the particular strengths of eXist.


Performance, robustness, and stability

eXist is not the fastest database on the planet. You can easily use a stopwatch to measure the time it takes to load a medium-sized document, even on fast hardware connecting to a local database. Query speed is of similar quality. Complex queries over moderately large collections give you enough time to brew a cup of coffee. To improve both document loading and query times, you can give eXist more memory. The default configuration that ships with eXist specifies settings that are appropriate for machines with about 256 MB of memory. If you have a beefier server, you can modify the conf.xml file to allocate more memory.

To tune the database, you can add indexes. By default, eXist indexes element and attribute nodes as well as the full text of the document. You can specify additional range indexes for particular node-sets that are likely to occur in your queries. For instance, if you know that you are likely to do a lot of queries that looked at para elements, you can define an index on //para. This tells eXist to precompute and store the values of all the para elements in the document because they're likely to be needed later.

Still, eXist is mostly suitable for small collections where speed isn't critical. If you have gigabyte-sized documents or you process thousands of transactions per hour, plan to look elsewhere.

Similarly, I'm not sure I'm ready to trust my critical data to eXist. I haven't personally experienced any database corruption. However the developers are still finding and fixing database corruption problems more frequently than I'm comfortable with. On the plus side, eXist does make it quite easy to back up the database. Very importantly, the backup format saves the contents in real textual XML, not some proprietary binary format; this means that in a worst-case scenario, you can fix problems with a text editor. If you make frequent archival backups, eXist is unlikely to do anything that makes the data irretrievable.

Feature-wise, eXist suffices for basic needs and includes some unexpected lagniappes such as XInclude support. Transactions, rollover, fallback, and similar enterprise-level features are all missing (transactions are on the "to do" list); but many applications don't need such advanced functionality.

One of my biggest concerns about eXist (or any other XQuery-based native XML database, for that matter) is the stability of the underlying standards and APIs. This article is based on the latest beta of eXist, from November 2004, which is based on the XQuery drafts from November 2003. The version of eXist now in CVS has made quite a few backwards-incompatible changes that are not yet fully documented. More changes will come in the future, both in eXist and in the W3C specs it depends on. Do not put eXist into production unless you're comfortable with frequent updates that will require you to retest and rewrite some of your own code.


Summary

The more data you have, the more important it becomes to use some sort of database system to manage it. If the data is XML, a solid native XML database is an obvious choice. Is eXist such a solid system? Sadly, the answer is no. eXist is an interesting research project that might develop into a useful tool in a year or two. However, it's hard to recommend in its current state. Documentation is incomplete and often misleading. Error messages are nonexistent. (Note to programmers everywhere: Exception stack traces don't count as decent error messages -- and sometimes eXist doesn't even give you those.) GUIs violate user interface standards at every turn. Basic features like copy and paste are omitted. During the very basic testing I did for this article, I encountered multiple bugs.

eXist isn't finished yet. It's currently classified as a beta. Many of the problems I encountered might be fixed before version 1.0 ships, but that won't happen tomorrow. I know some people now use eXist for real work today, and that worries me. Either they're very lucky, or they carefully craft their queries and documents to avoid eXist's bugs. If you're interested in contributing to a worthwhile open source project, eXist is a worthwhile candidate. However, the same incompleteness that makes it a fun project for programmers with time on their hands makes it unsuitable for production systems.


Resources

About the author

Photo of Elliot Rusty Harold

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the Jaxen XPath engine. You can contact him at elharo@metalab.unc.edu.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=87005
ArticleTitle=Managing XML data: eXist -- an open source native XML database
publish-date=06272005
author1-email=elharo@metalab.unc.edu
author1-email-cc=dwxed@us.ibm.com