 | Level: Intermediate Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University
11 Apr 2005 Much has been written about how to process XML documents, including how to search them with XPath, transform them with XSLT, style them with CSS, and create them with DOM. But as XML becomes increasingly popular and begins to pervade your systems (whether you want it to or not), a larger problem arises: How do you manage collections of XML documents? When you've got thousands, tens of thousands, or even millions of XML documents to hunt through, how do you find what you're looking for? How do you organize, index, search, store, serve, cross-reference, update, and otherwise manage medium-to-large collections of XML data?This column will attempt to provide useful answers to these questions. Four tasks of data management
The basic tasks of data management do not change simply because the data are stored in XML.
In essence, the four tasks are: store, search, retrieve, and display. Of course, all four are
intertwined, and how you approach any one of them affects the other three.
Store
You have to put the data somewhere. In 2005, desktop drives can routinely store
several hundred gigabytes; RAID arrays easily cross the terabyte threshold; and
even laptop hard drives have tens of gigabytes, so this problem is not as big as it
once was. You can toss pretty much anything you want into a file system, but what
happens when you need to get the data out? The difficulties of search and retrieval
mean that even if you use a file system as your primary data store (and it's very
possible that you shouldn't do this), you need to give some thought to how the files
are laid out and organized.
Storing XML in any other way than a precise one-document-to-one-file correspondence
has received surprisingly little attention. At best, a lot of books pay lip service to the idea
that entities or XInclude may be used to break up a document into several files. The
possibility of storing multiple documents in a single file or other container is rarely
mentioned, though it turns out to be crucial for many use cases (such as logging). There
are other storage mechanisms besides one file per document. Indeed, some of the most
promising storage mechanisms don't even use files in the traditional sense.
Search
By far, the thorniest problem in data management is how to find information. If the only
thing you know about the datum you're looking for is that it lies in one of the hundreds
of thousands of unindexed files with names like "p095893t.xml" on a particular hard
disk, you're in trouble, even if the disk and CPU are really, really fast. If you need to know
the contents of not one file in that collection but several files, you're in bigger trouble,
especially if you don't know how many files contain information relevant to your query
until you've looked in all of them. And if you need to make a query like this several dozen
times a second, well, you're done for. It can't be done -- not in a collection of flat files.
And the overhead of parsing XML only exacerbates the problem. This is why databases
have indexes -- so that a typical query doesn't require reading each and every byte in
the database.
The mathematical underpinnings of relational databases combined with decades of
experience optimizing these queries make search and query an area in which SQL
databases really shine. However, XML documents rarely fit relational structures very
well. While it's possible to shred documents and store them in relational tables, it's
rather like parking a car by disassembling it into its component parts, placing each
part on carefully labeled shelves, then reassembling the car when you're ready
to drive away. It also takes about equally as long. Efficient, practical indexing of XML
documents requires that you take advantage of the natural structure of those documents,
which is rarely anything like the rows and columns of a relational database.
Fortunately, a lot of work is being done in precisely this area, and the results look
quite promising.
Retrieve
You've successfully stored the data. You've even found the place in which you
put the data. Now, you need to get the data back out again. Retrieval of the data
from the store is not normally your biggest problem (unless you stored your documents
by breaking them up into tiny pieces and now you need to put Humpty Dumpty back
together again). Still, doing so is not without challenges of its own. For instance,
what happens if the physical media you stored the data on go bad? Backups
become really important. What if the network fails? or the entire server fails? Do
you have redundant systems in place in different locations? If so, how do you keep
them in sync? And just who's allowed to see these data coming out of your store
anyway? You certainly don't want to show the salary records for the citizen team
members to the H1-B hires who are paid 20% less for the same job -- whether
in XML or in any other format. Access control is critical (though in fairness,
that's an issue for storage and query as well).
Display
The final step is showing the user the document you found, successfully queried,
and retrieved. This is beginning to return to some well-covered territory. A lot has
been written about how to use XSLT, CSS, XHTML, SVG,
and other technologies to show a single document to a single
person or even multiple people. I don't intend to rehash all that. However, there's still
some unexplored territory on the fringes here. For instance, suppose you don't have
just one document to view but several -- or several dozen or several
thousand. How do you summarize multiple XML documents in an easily accessible fashion?
A little history
Fundamentally, these four basic tasks are the same tasks that are performed by relational databases,
network databases, flat files, and index cards. (One database I've recently begun working
with, the Louisiana State University Museum of Natural History (LSUMNH) Bird Observation
Collection, has records going back more than 50 years and is stored completely on paper.)
People have managed large data collections at least since the time of Ptolemy II of Egypt
and the founding of the Library of Alexandria, and at least one Babylonian library
dates to about 1000 years earlier. I hope today's librarians know a little more about how to manage
data than the Alexandrians did, but sometimes it's surprising how little has changed. Indeed,
one way you can tell what's fundamental and what's merely this year's ephemeral hype is
to see what's really new and what's merely an old idea wrapped in a new disguise. (I
say "merely," but those old, proven ideas are worth learning and hanging on to. The new ideas have a tendency to burn very brightly for a short time, then
burn out.)
Although humans have managed data for millennia, the new forms data
take as technology changes do necessitate new tools. And the new tools can also make
the tasks of searching and storing the data much easier. Index cards were state-of-the-art
technology in 1947, when the LSUMNH card file database was initiated, but they don't work
so well in the age of the Internet; indeed, plans are under way to computerize the collection.
As more and more data are authored, exchanged, and made available in XML, informaticians
need tools that make use of the XML structures for storage, search, access control, and more.
Some of these tools will store the actual XML documents. Others will simply use the explicit
structures in the XML documents to inform their decisions about how to store the content in
other formats. But the XML structure does convey meaning beyond the unmarked-up text;
and it makes sense to take advantage of that carefully laid-out structure when it's available.
For the past two decades, relational databases have enjoyed an unprecedented run of
success managing data. They have the distinct advantage of being backed by a really
sound mathematical theory. They have the even more important advantage of working
well in practice (though the impressive performance of real-world
relational systems tends to come at the expense of precisely implementing the mathematical
model -- a fact that incenses relational purists no end). In fact, relational databases have
been so successful that developers sometimes forget about other ways of storing data.
Contrary to popular belief, most of the world's data is not stored in relational databases.
Quite a lot of data are still locked up in old-style network and hierarchical databases
like IDMS and IMS. Even more is stored in things you probably wouldn't recognize as a
database at all: plain vanilla text files, spreadsheets, e-mail messages, Microsoft®
Office documents, and the like -- all stored in file systems on various computers.
Much (though not all) of this sort of data is far more easily, efficiently, and effectively stored
using XML and XML data management techniques than by using traditional relational
approaches.
 |
Looking forward
In the weeks and months ahead, I'll explore the issues I've identified here. I'll look at
a lot of topics that are almost trivial when considering one document at a time, but
become much more important when managing collections of documents. Topics I'll
explore include:
- File name extensions and MIME media types
- Using catalogs to manage central repositories of schemas and stylesheets
- Native XML databases
- XQuery
- XUpdate
- Storing XML in relational databases
- Storing relational tables in XML documents
- Managing malformed HTML
- Serving XML documents on the Web
- Content management systems
- XML streams and logging
- Version control
I'll explore both the theory and the practice of managing collections of XML documents.
Planned installments of this column will cover how to analyze needs, how to configure existing systems
for better performance and maintainability, and reviews of new technology that you might
want to consider in the future. I'll look at old issues, such as backups and version
control, with a particular focus on the changes that new XML style data require in these systems.
I'll be looking at new issues that may not have arisen in the past, such as making sure that an
organization keeps all its schemas in sync across multiple, independent computers in
branches located around the world.
I have a lot to say about all this, and it will take me more than a few articles to
say it. But at the risk of making my task even larger than it is, I invite you to send
in your own questions, comments, and war stories about managing XML data. What problems
do you struggle with in your organizations? What's caused you grief in the past, and what
holds you back with respect to moving more data into XML in the future? Most of all, what
do you need to know that nobody's talked about yet? Let me know. I'll sort through the
responses. And, as the common themes bubble to the top, I'll address them here in the
coming months.
Resources - Visit the LSU Museum of Natural History in Baton Rouge, Louisiana. (You'll need to make an appointment if you want to see the bird observation catalog.)
- Learn about the Library of Alexandria in the Wikipedia.
- Read Cliff Stoll's book Silicon Snake Oil, which makes a very cogent argument that paper databases, especially library card catalogs, have distinct advantages compared to computerized databases.
- Check out An Introduction to Database Systems by Chris Date, who is probably the best advocate of the "relational is the one true data model" position. This book is the standard introduction to the relational model. The eighth edition now includes (grudgingly) a chapter about XML written by IBM's Nick Tindall. The XML chapter is available from the book's Web site.
- Explore Ron Bourret's solid introduction to using XML with various types of database systems.
- Find out more about DB2, the IBM software solution for information management. At its core is a powerful family of relational database management system (RDBMS) servers.
- Find hundreds more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.
About the author  | 
|  | Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a
decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn
with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his
mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where
he teaches Java technology and object-oriented programming.
His Cafe au Lait
Web site has become one of the most popular independent Java sites on the Internet, and his
spin-off site, Cafe con Leche, has become one
of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the
XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.
|
Rate this page
|  |