The basic tasks of data management do not change simply because the data are stored in XML. In essence, the four tasks are: store, search, retrieve, and display. Of course, all four are intertwined, and how you approach any one of them affects the other three.
You have to put the data somewhere. In 2005, desktop drives can routinely store several hundred gigabytes; RAID arrays easily cross the terabyte threshold; and even laptop hard drives have tens of gigabytes, so this problem is not as big as it once was. You can toss pretty much anything you want into a file system, but what happens when you need to get the data out? The difficulties of search and retrieval mean that even if you use a file system as your primary data store (and it's very possible that you shouldn't do this), you need to give some thought to how the files are laid out and organized.
Storing XML in any other way than a precise one-document-to-one-file correspondence has received surprisingly little attention. At best, a lot of books pay lip service to the idea that entities or XInclude may be used to break up a document into several files. The possibility of storing multiple documents in a single file or other container is rarely mentioned, though it turns out to be crucial for many use cases (such as logging). There are other storage mechanisms besides one file per document. Indeed, some of the most promising storage mechanisms don't even use files in the traditional sense.
By far, the thorniest problem in data management is how to find information. If the only thing you know about the datum you're looking for is that it lies in one of the hundreds of thousands of unindexed files with names like "p095893t.xml" on a particular hard disk, you're in trouble, even if the disk and CPU are really, really fast. If you need to know the contents of not one file in that collection but several files, you're in bigger trouble, especially if you don't know how many files contain information relevant to your query until you've looked in all of them. And if you need to make a query like this several dozen times a second, well, you're done for. It can't be done -- not in a collection of flat files. And the overhead of parsing XML only exacerbates the problem. This is why databases have indexes -- so that a typical query doesn't require reading each and every byte in the database.
The mathematical underpinnings of relational databases combined with decades of experience optimizing these queries make search and query an area in which SQL databases really shine. However, XML documents rarely fit relational structures very well. While it's possible to shred documents and store them in relational tables, it's rather like parking a car by disassembling it into its component parts, placing each part on carefully labeled shelves, then reassembling the car when you're ready to drive away. It also takes about equally as long. Efficient, practical indexing of XML documents requires that you take advantage of the natural structure of those documents, which is rarely anything like the rows and columns of a relational database. Fortunately, a lot of work is being done in precisely this area, and the results look quite promising.
You've successfully stored the data. You've even found the place in which you put the data. Now, you need to get the data back out again. Retrieval of the data from the store is not normally your biggest problem (unless you stored your documents by breaking them up into tiny pieces and now you need to put Humpty Dumpty back together again). Still, doing so is not without challenges of its own. For instance, what happens if the physical media you stored the data on go bad? Backups become really important. What if the network fails? or the entire server fails? Do you have redundant systems in place in different locations? If so, how do you keep them in sync? And just who's allowed to see these data coming out of your store anyway? You certainly don't want to show the salary records for the citizen team members to the H1-B hires who are paid 20% less for the same job -- whether in XML or in any other format. Access control is critical (though in fairness, that's an issue for storage and query as well).
The final step is showing the user the document you found, successfully queried, and retrieved. This is beginning to return to some well-covered territory. A lot has been written about how to use XSLT, CSS, XHTML, SVG, and other technologies to show a single document to a single person or even multiple people. I don't intend to rehash all that. However, there's still some unexplored territory on the fringes here. For instance, suppose you don't have just one document to view but several -- or several dozen or several thousand. How do you summarize multiple XML documents in an easily accessible fashion?
Fundamentally, these four basic tasks are the same tasks that are performed by relational databases, network databases, flat files, and index cards. (One database I've recently begun working with, the Louisiana State University Museum of Natural History (LSUMNH) Bird Observation Collection, has records going back more than 50 years and is stored completely on paper.) People have managed large data collections at least since the time of Ptolemy II of Egypt and the founding of the Library of Alexandria, and at least one Babylonian library dates to about 1000 years earlier. I hope today's librarians know a little more about how to manage data than the Alexandrians did, but sometimes it's surprising how little has changed. Indeed, one way you can tell what's fundamental and what's merely this year's ephemeral hype is to see what's really new and what's merely an old idea wrapped in a new disguise. (I say "merely," but those old, proven ideas are worth learning and hanging on to. The new ideas have a tendency to burn very brightly for a short time, then burn out.)
Although humans have managed data for millennia, the new forms data take as technology changes do necessitate new tools. And the new tools can also make the tasks of searching and storing the data much easier. Index cards were state-of-the-art technology in 1947, when the LSUMNH card file database was initiated, but they don't work so well in the age of the Internet; indeed, plans are under way to computerize the collection. As more and more data are authored, exchanged, and made available in XML, informaticians need tools that make use of the XML structures for storage, search, access control, and more. Some of these tools will store the actual XML documents. Others will simply use the explicit structures in the XML documents to inform their decisions about how to store the content in other formats. But the XML structure does convey meaning beyond the unmarked-up text; and it makes sense to take advantage of that carefully laid-out structure when it's available.
For the past two decades, relational databases have enjoyed an unprecedented run of success managing data. They have the distinct advantage of being backed by a really sound mathematical theory. They have the even more important advantage of working well in practice (though the impressive performance of real-world relational systems tends to come at the expense of precisely implementing the mathematical model -- a fact that incenses relational purists no end). In fact, relational databases have been so successful that developers sometimes forget about other ways of storing data. Contrary to popular belief, most of the world's data is not stored in relational databases. Quite a lot of data are still locked up in old-style network and hierarchical databases like IDMS and IMS. Even more is stored in things you probably wouldn't recognize as a database at all: plain vanilla text files, spreadsheets, e-mail messages, Microsoft® Office documents, and the like -- all stored in file systems on various computers. Much (though not all) of this sort of data is far more easily, efficiently, and effectively stored using XML and XML data management techniques than by using traditional relational approaches.
In the weeks and months ahead, I'll explore the issues I've identified here. I'll look at a lot of topics that are almost trivial when considering one document at a time, but become much more important when managing collections of documents. Topics I'll explore include:
- File name extensions and MIME media types
- Using catalogs to manage central repositories of schemas and stylesheets
- Native XML databases
- XQuery
- XUpdate
- Storing XML in relational databases
- Storing relational tables in XML documents
- Managing malformed HTML
- Serving XML documents on the Web
- Content management systems
- XML streams and logging
- Version control
I'll explore both the theory and the practice of managing collections of XML documents. Planned installments of this column will cover how to analyze needs, how to configure existing systems for better performance and maintainability, and reviews of new technology that you might want to consider in the future. I'll look at old issues, such as backups and version control, with a particular focus on the changes that new XML style data require in these systems. I'll be looking at new issues that may not have arisen in the past, such as making sure that an organization keeps all its schemas in sync across multiple, independent computers in branches located around the world.
I have a lot to say about all this, and it will take me more than a few articles to say it. But at the risk of making my task even larger than it is, I invite you to send in your own questions, comments, and war stories about managing XML data. What problems do you struggle with in your organizations? What's caused you grief in the past, and what holds you back with respect to moving more data into XML in the future? Most of all, what do you need to know that nobody's talked about yet? Let me know. I'll sort through the responses. And, as the common themes bubble to the top, I'll address them here in the coming months.
- Visit the LSU Museum of Natural History in Baton Rouge, Louisiana. (You'll need to make an appointment if you want to see the bird observation catalog.)
- Learn about the Library of Alexandria in the Wikipedia.
- Read Cliff Stoll's book Silicon Snake Oil, which makes a very cogent argument that paper databases, especially library card catalogs, have distinct advantages compared to computerized databases.
- Check out An Introduction to Database Systems by Chris Date, who is probably the best advocate of the "relational is the one true data model" position. This book is the standard introduction to the relational model. The eighth edition now includes (grudgingly) a chapter about XML written by IBM's Nick Tindall. The XML chapter is available from the book's Web site.
- Explore Ron Bourret's solid introduction to using XML with various types of database systems.
- Find out more about DB2, the IBM software solution for information management. At its core is a powerful family of relational database management system (RDBMS) servers.
- Find hundreds more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.
Comments (Undergoing maintenance)





