 | Level: Introductory Don Chamberlin (chamberl@almaden.ibm.com), IBM Fellow, IBM
03 Sep 2003 IBM's own XQuery pioneer Don Chamberlin discusses the emergence of the XQuery language -- specifically, the need for a query language for XML data, and the basic principles behind it. This excerpt is from Chapter 2 of the newly-released Addison-Wesley book XQuery from the Experts.
The emergence of the World Wide Web in the 1990s was a seminal
event in human culture. Suddenly, as if overnight, a significant fraction
of the world’s computers were connected, not only by a physical network
but also by a common protocol for exchanging information. The
Web offered an unprecedented opportunity to make information truly
ubiquitous. It seemed to promise that people would no longer need to
move physically to places and times where information was available,
since all information would be everywhere, all the time.
Realizing this promise required some organizing principle for the
exchange of information. This principle had to be independent of any
particular language or application and easily extensible to new and
unanticipated kinds of information. At present, the leading candidate for this
organizing principle is the Extensible Markup Language, XML.
XML provides a neutral notation for labeling the parts of a body of information
and representing the relationships among these parts. Since
XML does not attach any semantic meaning to its labels, applications are
free to interpret them as they see fit. Applications that agree on a common
vocabulary can use XML for data interchange. Since XML does not
mandate any particular storage technique, it can be used as a common
interchange format among systems that store data in file systems, relational
databases, object repositories, and many other storage formats.
 |
XQuery from the Experts
This book excerpt is from XQuery from the Experts by Howard Katz, Don
Chamberlin, Denise Draper, Mary Fernandez, Michael Kay, Jonathan Robie,
Michael Rys, Jerome Simeon, Jim Tivy, and Philip Wadler, (0321180607),
copyright 2004. All rights reserved. Chapter 2, titled "Influences on the
Design of XQuery," is written by Don Chamberlin. It is posted with
permission from Addison-Wesley.
|
|
Since XML is emerging as a universal format for data interchange among
disparate applications, it is natural for queries that cross application
boundaries to be framed in terms of the XML representation of data. In
other words, if an application is viewed as a source of information in
XML format, it is logical to pose queries against that XML format. This
is the basic reason why a query language for XML data is extremely
important in a connected world.
Recognizing the importance of an XML query language, the World
Wide Web Consortium (W3C) organized a query language
workshop called QL ’98, which was held in Boston in December
1998. The workshop attracted nearly a hundred participants and fostered
sixty-six papers investigating various aspects of querying XML data. One
of the long-term outcomes of the workshop was the creation of a W3C
working group for XML Query. This working group, chaired
by Paul Cotton, met for the first time in September 1999. Its initial charter
called for the specification of a formal data model and query language
for XML, to be coordinated with existing W3C standards such as XML
Schema, XML Namespaces, XML Information
Set, and XSLT. The purpose of the new query language
was to provide a flexible facility to extract information from real and
virtual XML documents. Approximately forty participants became members
of the working group, representing about twenty-five different companies,
along with a W3C staff member to provide logistical support.
One of the earliest activities of the Query working group was to draw up
a formal statement of requirements for an XML query language.
This document was quickly followed by a set of use cases that
described diverse usage scenarios for the new language,
including specific queries and expected results. The XML Query Working
Group undertook to define a language with two alternative syntaxes:
a keyword-based syntax called XQuery, optimized for
human reading and writing, and an XML-based syntax called XQueryX,
optimized for machine generation. This chapter describes only
the keyword-based XQuery syntax, which has been the major focus of the
working group.
Creating a new query language is a serious business. Many person-years
have been spent in defining XQuery, and many more will be spent on its
implementation. If the language is successful, developers of Web-based
applications will use it for many years to come. A successful query language
can enhance productivity and serve as a unifying influence in the growth of
an industry. On the other hand, a poorly designed language can inhibit the
acceptance of an otherwise promising technology. The designers of
XQuery took their responsibilities very seriously, not only in the interest of
their individual companies but also in order to make a contribution to the
industry as a whole.
The purpose of this chapter is to discuss the major influences on the design
of the XQuery language. Some of the influences on XQuery were principles of computer
language design. Others were related languages, interfaces, and standards.
Still others were "watershed issues" that were debated by the working
group and resolved in ways that guided the evolution of the language. We
discuss several of these watershed issues in detail, including the alternatives
that were considered and the reasons for the final resolution.
This chapter is based on the most recent XQuery specification at the
time of publication. At this time, the broad outline of the language can be
considered to be reasonably stable. However, readers should be cautioned
that XQuery is still a work in progress, and the design choices discussed
here are subject to change until the language has been approved
and published as a W3C recommendation.
The need for an XML Query language
Early in its history, the XML Query Working Group confronted the
question of whether XML is sufficiently different from other data formats
to require a query language of its own. The SQL language
is a very well established standard for retrieving information from relational
databases and has recently been enhanced with new facilities called
"structured types" that support nested structures similar to the nesting
of elements in XML. If SQL could be further extended to meet XML
query requirements, developers could leverage their considerable investment
in SQL implementations, and users could apply the features of
these robust and mature systems to their XML databases without learning
a completely new language.
Given these incentives, the working group conducted a study of the differences
between XML data and relational data from the point of view of
a query language. Some of the significant differences between the two
data models are summarized below.
- Relational data is "flat" -- that is, organized in the form of a
two-dimensional array of rows and columns. In contrast, XML data is
"nested", and its depth of nesting can be irregular and unpredictable.
Relational databases can represent nested data structures
by using structured types or tables with foreign keys but it is difficult
to search these structures for objects at an unknown depth of
nesting. In XML, on the other hand, it is very natural to search for
objects whose position in a document hierarchy is unknown. An
example of such a query might be "Find all the red things", represented
in the XPath language by the expression
//*[@color = "Red"].
This query would be much more difficult
to represent in a relational query language.
- Relational data is regular and homogeneous. Every row of a table
has the same columns, with the same names and types. This allows
metadata -- information that describes the structure of the data --
to be removed from the data itself and stored in a separate catalog.
XML data, on the other hand, is irregular and heterogeneous.
Each instance of a Web page or a book chapter can have a different
structure and must therefore describe its own structure. As a
result, the ratio of metadata to data is much higher in XML than
in a relational database, and in XML the metadata is distributed
throughout the data in the form of tags rather than being separated
from the data. In XML, it is natural to ask queries that span
both data and metadata, such as “What kinds of things in the 2002
inventory have color attributes," represented in XPath by the
expression
/inventory[@year = "2002"]/*[@color].
In a relational language, such a query would require a join that might span
several data tables and system catalog tables.
- Like a stored table, the result of a relational query is flat, regular,
and homogeneous. The result of an XML query, on the other
hand, has none of these properties. For example, the result of the
query “Find all the red things" may contain a cherry, a flag, and a
stop sign, each with a different internal structure. In general, the
result of an expression in an XML query may consist of a heterogeneous
sequence of elements, attributes, and primitive values, all of
mixed type. This set of objects might then serve as an intermediate
result used in the processing of a higher-level expression. The
heterogeneous nature of XML data conflicts with the SQL assumption
that every expression inside a query returns an array of rows
and columns. It also requires a query language to provide constructors
that are capable of creating complex nested structures on the
fly -- a facility that is not needed in a relational language.
- Because of its regular structure, relational data is "dense" -- that is,
every row has a value in every column. This gave rise to the need
for a "null value" to represent unknown or inapplicable values in
relational databases. XML data, on the other hand, may be
"sparse." Since all the elements of a given type need not have the
same structure, information that is unknown or inapplicable can
simply not appear. This gives an XML query language additional
degrees of freedom for dealing with missing data.
- In a relational database, the rows of a table are not considered to
have an ordering other than the orderings that can be derived
from their values. XML documents, on the other hand, have an
intrinsic order that can be important to their meaning and cannot
be derived from data values. This has several implications for the
design of a query language. It means that queries must at least
provide an option in which the original order of elements is preserved
in the query result. It means that facilities are needed to
search for objects on the basis of their order, as in "Find the fifth
red object" or "Find objects that occur after this one and before
that one." It also means that we need facilities to impose an order
on sequences of objects, possibly at several levels of a hierarchy.
The importance of order in XML contrasts sharply with the
absence of intrinsic order in the relational data model.
The significant data model differences summarized above led the working
group to decide that the objectives of XML queries could best be served by
designing a new query language rather than by extending a relational language.
Designing a query language for XML, however, is not a small task,
precisely because of the complexity of XML data. An XML "value," computed
by a query expression, may consist of zero, one, or many items, each
of which may be an element, an attribute, or a primitive value. Therefore,
each operator in an XML query language must be well defined for all these
possible inputs. The result is likely to be a language with a more complex
semantic definition than that of a relational language such as SQL.
Basic principles
The XML Query Working Group did not draw up a formal list of the
principles that guided the design of XQuery. Nevertheless, throughout
the design process, a reasonably stable consensus existed in the working
group about at least some of the principles that should underlie the
design of an XML query language. Some of these principles were mandated
by the charter of the working group, and others arose from
strongly held convictions of its members. The following list is my own
attempt to enumerate the basic ideas and principles that were most influential
in shaping the XQuery language. Tension exists among some of
these principles, and several design decisions were the result of an
attempt to find a reasonable compromise among conflicting principles.
-
Compositionality:
Perhaps the longest-standing principle in the
design of XQuery is that XQuery should be a functional language
incorporating the principle of compositionality. This means that
XQuery consists of several kinds of expressions, such as path
expressions, conditional expressions, and element constructors,
that can be composed with full generality. The result of any
expression can be used as the operand of another expression. No
syntactic constraints are imposed on the ways in which expressions
can be composed (though the language does have some
semantic constraints). Each expression returns a value that
depends only on the operands of the expression, and no expression
has any side effects. The value returned by the outermost
expression in a query is the result of the query.
-
Closure:
XQuery is defined as a transformation on a data model
called the Query data model. The input and output of every
query or subexpression within a query each form an instance of
the Query data model. This is what is meant by the statement that
XQuery is closed under the Query data model. The working group
spent considerable time on the definition of the Query data model
and on how instances of this model can be constructed from input
XML documents and/or serialized in the form of output XML
documents.
-
Schema conformance:
Since XML Schema has recently been adopted
as a W3C Recommendation, the working group considered it
highly desirable for XQuery to be based on the type system of
XML Schema. This constraint strongly influenced the design of
XQuery by providing a set of primitive types, a type-definition
facility, and an inheritance mechanism. The validation process
defined by XML Schema also strongly influenced the XQuery
facilities for constructing new elements and assigning their types.
Nevertheless, members of the working group attempted to modularize
the parts of the language that are related to type definition
and validation, so that XQuery could potentially be used with an
alternative schema language at some future time.
-
XPath compatibility:
Because of the widespread usage of XPath in
the XML community, a strong effort was made to maintain compatibility
between XQuery and XPath Version 1.0. Despite the
importance of this goal, it was necessary in a few areas to compromise
compatibility in order to conform to the type system of XML
Schema, because the design of XPath Version 1.0 was based on a
much simpler type system.
-
Simplicity:
Many members of the working group considered
simplicity of expression and ease of understanding to be primary goals
of our language design. These goals were often in conflict with
other goals, resulting in some painful compromises.
-
Completeness:
The working group attempted to design a language
that would be complete enough to express a broad range of queries.
The existence of a well-motivated use case was considered a strong
argument for inclusion of a language feature. The expressive power
of XQuery is comparable to the criterion of “relational completeness"
defined for database query languages, though no
such formal standard has been defined for an XML data model.
Informally, XQuery is designed to be able to construct any XML
document that can be computed from input XML documents using
the power of the first-order predicate calculus. In addition, recursive
functions add significant expressive power to the language.
-
Generality:
XQuery is intended for use in many different
environments and with many kinds of input documents. The language
should be applicable to documents that are described by a schema,
or by a Document Type Definition, or by neither. It
should be usable in strongly typed environments where input and
output types are well known and rigorously enforced, as well as in
more dynamic environments where input and output types may
be discovered at execution time and some data may be untyped. It
should accommodate input documents from a variety of sources,
including XML files discovered on the Web, repositories of pre-validated
XML documents, streaming data sources such as stock
tickers, and XML data synthesized from databases.
-
Conciseness:
In the interest of conciseness, the semantics of the
XQuery operators were defined to include certain implicit operations.
For example, arithmetic operators such as
+,
when applied to an element, automatically extract the numeric value of the
element. Similarly, comparison operators such as =, when applied to
sequences of values, automatically iterate over the sequences, looking for a
pair of values that satisfies the comparison (this process is
called existential quantification). These implicit operations are
consistent with XPath Version 1.0 and were preferred over a
design that would require each operation to be explicitly specified
by the user.
-
Static analysis:
From the beginning, the processing of a query was
assumed to consist of two phases, called query analysis and query
evaluation (roughly corresponding to compilation and execution
of a program). The analysis phase was viewed as an opportunity to
perform optimization and to detect certain kinds of errors. A
great deal of effort went into defining the kinds of checks that
could be performed during the analysis phase and in deciding
which of these checks should be required and which should be
permitted.
Resources
About the author  | |  | Don Chamberlin, an IBM Fellow at Almaden Research Center, is one of IBM's representatives in the W3C XML Query Working Group. He is also a co-author of the Quilt language proposal, which formed the basis for the XQuery design. Don is best known as co-inventor of the SQL database language and as author of two books on the DB2 database system. He holds a B.S. from Harvey Mudd College and a Ph.D. from Stanford University. He is also an ACM Fellow and a member of the National Academy of Engineering. |
Rate this page
|  |