The emergence of the World Wide Web in the 1990s was a seminal event in human culture. Suddenly, as if overnight, a significant fraction of the world’s computers were connected, not only by a physical network but also by a common protocol for exchanging information. The Web offered an unprecedented opportunity to make information truly ubiquitous. It seemed to promise that people would no longer need to move physically to places and times where information was available, since all information would be everywhere, all the time.
Realizing this promise required some organizing principle for the exchange of information. This principle had to be independent of any particular language or application and easily extensible to new and unanticipated kinds of information. At present, the leading candidate for this organizing principle is the Extensible Markup Language, XML. XML provides a neutral notation for labeling the parts of a body of information and representing the relationships among these parts. Since XML does not attach any semantic meaning to its labels, applications are free to interpret them as they see fit. Applications that agree on a common vocabulary can use XML for data interchange. Since XML does not mandate any particular storage technique, it can be used as a common interchange format among systems that store data in file systems, relational databases, object repositories, and many other storage formats.
Since XML is emerging as a universal format for data interchange among disparate applications, it is natural for queries that cross application boundaries to be framed in terms of the XML representation of data. In other words, if an application is viewed as a source of information in XML format, it is logical to pose queries against that XML format. This is the basic reason why a query language for XML data is extremely important in a connected world.
Recognizing the importance of an XML query language, the World Wide Web Consortium (W3C) organized a query language workshop called QL ’98, which was held in Boston in December 1998. The workshop attracted nearly a hundred participants and fostered sixty-six papers investigating various aspects of querying XML data. One of the long-term outcomes of the workshop was the creation of a W3C working group for XML Query. This working group, chaired by Paul Cotton, met for the first time in September 1999. Its initial charter called for the specification of a formal data model and query language for XML, to be coordinated with existing W3C standards such as XML Schema, XML Namespaces, XML Information Set, and XSLT. The purpose of the new query language was to provide a flexible facility to extract information from real and virtual XML documents. Approximately forty participants became members of the working group, representing about twenty-five different companies, along with a W3C staff member to provide logistical support.
One of the earliest activities of the Query working group was to draw up a formal statement of requirements for an XML query language. This document was quickly followed by a set of use cases that described diverse usage scenarios for the new language, including specific queries and expected results. The XML Query Working Group undertook to define a language with two alternative syntaxes: a keyword-based syntax called XQuery, optimized for human reading and writing, and an XML-based syntax called XQueryX, optimized for machine generation. This chapter describes only the keyword-based XQuery syntax, which has been the major focus of the working group.
Creating a new query language is a serious business. Many person-years have been spent in defining XQuery, and many more will be spent on its implementation. If the language is successful, developers of Web-based applications will use it for many years to come. A successful query language can enhance productivity and serve as a unifying influence in the growth of an industry. On the other hand, a poorly designed language can inhibit the acceptance of an otherwise promising technology. The designers of XQuery took their responsibilities very seriously, not only in the interest of their individual companies but also in order to make a contribution to the industry as a whole.
The purpose of this chapter is to discuss the major influences on the design of the XQuery language. Some of the influences on XQuery were principles of computer language design. Others were related languages, interfaces, and standards. Still others were "watershed issues" that were debated by the working group and resolved in ways that guided the evolution of the language. We discuss several of these watershed issues in detail, including the alternatives that were considered and the reasons for the final resolution.
This chapter is based on the most recent XQuery specification at the time of publication. At this time, the broad outline of the language can be considered to be reasonably stable. However, readers should be cautioned that XQuery is still a work in progress, and the design choices discussed here are subject to change until the language has been approved and published as a W3C recommendation.
Early in its history, the XML Query Working Group confronted the question of whether XML is sufficiently different from other data formats to require a query language of its own. The SQL language is a very well established standard for retrieving information from relational databases and has recently been enhanced with new facilities called "structured types" that support nested structures similar to the nesting of elements in XML. If SQL could be further extended to meet XML query requirements, developers could leverage their considerable investment in SQL implementations, and users could apply the features of these robust and mature systems to their XML databases without learning a completely new language.
Given these incentives, the working group conducted a study of the differences between XML data and relational data from the point of view of a query language. Some of the significant differences between the two data models are summarized below.
- Relational data is "flat" -- that is, organized in the form of a
two-dimensional array of rows and columns. In contrast, XML data is
"nested", and its depth of nesting can be irregular and unpredictable.
Relational databases can represent nested data structures
by using structured types or tables with foreign keys but it is difficult
to search these structures for objects at an unknown depth of
nesting. In XML, on the other hand, it is very natural to search for
objects whose position in a document hierarchy is unknown. An
example of such a query might be "Find all the red things", represented
in the XPath language by the expression
//*[@color = "Red"]. This query would be much more difficult to represent in a relational query language.
- Relational data is regular and homogeneous. Every row of a table
has the same columns, with the same names and types. This allows
metadata -- information that describes the structure of the data --
to be removed from the data itself and stored in a separate catalog.
XML data, on the other hand, is irregular and heterogeneous.
Each instance of a Web page or a book chapter can have a different
structure and must therefore describe its own structure. As a
result, the ratio of metadata to data is much higher in XML than
in a relational database, and in XML the metadata is distributed
throughout the data in the form of tags rather than being separated
from the data. In XML, it is natural to ask queries that span
both data and metadata, such as “What kinds of things in the 2002
inventory have color attributes," represented in XPath by the
/inventory[@year = "2002"]/*[@color]. In a relational language, such a query would require a join that might span several data tables and system catalog tables.
- Like a stored table, the result of a relational query is flat, regular, and homogeneous. The result of an XML query, on the other hand, has none of these properties. For example, the result of the query “Find all the red things" may contain a cherry, a flag, and a stop sign, each with a different internal structure. In general, the result of an expression in an XML query may consist of a heterogeneous sequence of elements, attributes, and primitive values, all of mixed type. This set of objects might then serve as an intermediate result used in the processing of a higher-level expression. The heterogeneous nature of XML data conflicts with the SQL assumption that every expression inside a query returns an array of rows and columns. It also requires a query language to provide constructors that are capable of creating complex nested structures on the fly -- a facility that is not needed in a relational language.
- Because of its regular structure, relational data is "dense" -- that is, every row has a value in every column. This gave rise to the need for a "null value" to represent unknown or inapplicable values in relational databases. XML data, on the other hand, may be "sparse." Since all the elements of a given type need not have the same structure, information that is unknown or inapplicable can simply not appear. This gives an XML query language additional degrees of freedom for dealing with missing data.
- In a relational database, the rows of a table are not considered to have an ordering other than the orderings that can be derived from their values. XML documents, on the other hand, have an intrinsic order that can be important to their meaning and cannot be derived from data values. This has several implications for the design of a query language. It means that queries must at least provide an option in which the original order of elements is preserved in the query result. It means that facilities are needed to search for objects on the basis of their order, as in "Find the fifth red object" or "Find objects that occur after this one and before that one." It also means that we need facilities to impose an order on sequences of objects, possibly at several levels of a hierarchy. The importance of order in XML contrasts sharply with the absence of intrinsic order in the relational data model.
The significant data model differences summarized above led the working group to decide that the objectives of XML queries could best be served by designing a new query language rather than by extending a relational language. Designing a query language for XML, however, is not a small task, precisely because of the complexity of XML data. An XML "value," computed by a query expression, may consist of zero, one, or many items, each of which may be an element, an attribute, or a primitive value. Therefore, each operator in an XML query language must be well defined for all these possible inputs. The result is likely to be a language with a more complex semantic definition than that of a relational language such as SQL.
The XML Query Working Group did not draw up a formal list of the principles that guided the design of XQuery. Nevertheless, throughout the design process, a reasonably stable consensus existed in the working group about at least some of the principles that should underlie the design of an XML query language. Some of these principles were mandated by the charter of the working group, and others arose from strongly held convictions of its members. The following list is my own attempt to enumerate the basic ideas and principles that were most influential in shaping the XQuery language. Tension exists among some of these principles, and several design decisions were the result of an attempt to find a reasonable compromise among conflicting principles.
- Compositionality: Perhaps the longest-standing principle in the design of XQuery is that XQuery should be a functional language incorporating the principle of compositionality. This means that XQuery consists of several kinds of expressions, such as path expressions, conditional expressions, and element constructors, that can be composed with full generality. The result of any expression can be used as the operand of another expression. No syntactic constraints are imposed on the ways in which expressions can be composed (though the language does have some semantic constraints). Each expression returns a value that depends only on the operands of the expression, and no expression has any side effects. The value returned by the outermost expression in a query is the result of the query.
- Closure: XQuery is defined as a transformation on a data model called the Query data model. The input and output of every query or subexpression within a query each form an instance of the Query data model. This is what is meant by the statement that XQuery is closed under the Query data model. The working group spent considerable time on the definition of the Query data model and on how instances of this model can be constructed from input XML documents and/or serialized in the form of output XML documents.
- Schema conformance: Since XML Schema has recently been adopted as a W3C Recommendation, the working group considered it highly desirable for XQuery to be based on the type system of XML Schema. This constraint strongly influenced the design of XQuery by providing a set of primitive types, a type-definition facility, and an inheritance mechanism. The validation process defined by XML Schema also strongly influenced the XQuery facilities for constructing new elements and assigning their types. Nevertheless, members of the working group attempted to modularize the parts of the language that are related to type definition and validation, so that XQuery could potentially be used with an alternative schema language at some future time.
- XPath compatibility: Because of the widespread usage of XPath in the XML community, a strong effort was made to maintain compatibility between XQuery and XPath Version 1.0. Despite the importance of this goal, it was necessary in a few areas to compromise compatibility in order to conform to the type system of XML Schema, because the design of XPath Version 1.0 was based on a much simpler type system.
- Simplicity: Many members of the working group considered simplicity of expression and ease of understanding to be primary goals of our language design. These goals were often in conflict with other goals, resulting in some painful compromises.
- Completeness: The working group attempted to design a language that would be complete enough to express a broad range of queries. The existence of a well-motivated use case was considered a strong argument for inclusion of a language feature. The expressive power of XQuery is comparable to the criterion of “relational completeness" defined for database query languages, though no such formal standard has been defined for an XML data model. Informally, XQuery is designed to be able to construct any XML document that can be computed from input XML documents using the power of the first-order predicate calculus. In addition, recursive functions add significant expressive power to the language.
- Generality: XQuery is intended for use in many different environments and with many kinds of input documents. The language should be applicable to documents that are described by a schema, or by a Document Type Definition, or by neither. It should be usable in strongly typed environments where input and output types are well known and rigorously enforced, as well as in more dynamic environments where input and output types may be discovered at execution time and some data may be untyped. It should accommodate input documents from a variety of sources, including XML files discovered on the Web, repositories of pre-validated XML documents, streaming data sources such as stock tickers, and XML data synthesized from databases.
In the interest of conciseness, the semantics of the
XQuery operators were defined to include certain implicit operations.
For example, arithmetic operators such as
+, when applied to an element, automatically extract the numeric value of the element. Similarly, comparison operators such as
=, when applied to sequences of values, automatically iterate over the sequences, looking for a pair of values that satisfies the comparison (this process is called existential quantification). These implicit operations are consistent with XPath Version 1.0 and were preferred over a design that would require each operation to be explicitly specified by the user.
- Static analysis: From the beginning, the processing of a query was assumed to consist of two phases, called query analysis and query evaluation (roughly corresponding to compilation and execution of a program). The analysis phase was viewed as an opportunity to perform optimization and to detect certain kinds of errors. A great deal of effort went into defining the kinds of checks that could be performed during the analysis phase and in deciding which of these checks should be required and which should be permitted.
- Read the entire book! More information on XQuery from the Experts can be found at the Addison-Wesley Professional site.
- Want to know more about the XQuery language? Read Howard Katz's article "An introduction to XQuery" (developerWorks, September 2003).
- Visit the XML Query working group home page, where you'll find links to various XQuery specifications and implementations.
- Get a solid understanding of the fundamentals of XML with Doug Tidwell's popular tutorial "Introduction to XML" (developerWorks, August 2002).
- Meet the author -- renowned database expert Don Chamberlin -- in this DB2 Developer Domain interview.
- IBM's DB2 database provides relational database storage, plus pureXML to quickly serve data and reduce your work in the management of XML data.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
Don Chamberlin, an IBM Fellow at Almaden Research Center, is one of IBM's representatives in the W3C XML Query Working Group. He is also a co-author of the Quilt language proposal, which formed the basis for the XQuery design. Don is best known as co-inventor of the SQL database language and as author of two books on the DB2 database system. He holds a B.S. from Harvey Mudd College and a Ph.D. from Stanford University. He is also an ACM Fellow and a member of the National Academy of Engineering.