Skip to main content

Preserving XML queries during schema evolution

A guide to writing queries that behave well across XML schema changes

Mirella Moura Moro (mirella@inf.ufrgs.br), Research Assistant, Universidade Federal do Rio Grande do Sul
Photo of Mirella Moro
Mirella Moro is a researcher at the Universidade Federal do Rio Grande do Sul, Brazil. She received a PhD from the University of California, Riverside, and Masters and Bachelor degrees from Universidade Federal do Rio Grande do Sul. She was an intern at IBM T.J. Watson Research Center in the summer of 2006.
Susan Malaika (malaika@us.ibm.com), Senior Technical Staff Member, IBM
Photo of Susan Malaika
Susan Malaika is a senior technical staff member in IBM's Information Management Group (part of IBM Software Group). Her specialties include XML, the Web, and databases. She has developed standards that support data for grid environments at the Global Grid Forum. In addition to working as an IBM product software developer, she has also worked as an Internet specialist, a data analyst, and an application designer and developer. She has also co-authored a book on the Web and published articles on transaction processing and XML. She is a member of the IBM Academy of Technology.
Lipyeow Lim (liplim@us.ibm.com), Research Staff Member, IBM Japan, Software Group
Photo of Lipyeow Lim
Lipyeow Lim is a research staff member at IBM T. J. Watson Research Center. He obtained his Ph.D. from Duke University in Durham, North Carolina. His research interests lie in the area of database technology—in particular, XML databases, statistics collection, and query optimization.

Summary: 

The always evolving context of the Web imposes the challenge of how to accommodate new functionalities and new data types in the database that underlies a Web application or service. For XML databases, new schema versions can be released as frequently as once every six months. This article extends a taxonomy of changes that might apply to an XML schema during its evolution. It then examines the impact of those changes on the schema validation process (both forward and backward validations) and query evaluation. Based on the cases studied, this article proposes guidelines for XML schema evolution and for writing queries so they continue to operate as expected across evolving schemas.

As a followup to reader comments, we made updates to the XQuery sections of Listings 5, 6, 7, 8, 10, 11, 12, 14, 15, 16, 18, and 19 plus the Results sections of Listings 5, 15, 16, 18, and 19.

Date:  05 Jun 2007 (Published 07 May 2007)
Level:  Intermediate
Activity:  2594 views

As XML gains widespread use as an information exchange standard, the ability to persist, validate, and query XML documents becomes increasingly important. Moreover, with the proliferation of Web services and mash-ups, Web application developers increasingly need to query and transform XML messages, where such messages come directly from a Web service or indirectly from a database in which they are persisted.

Most commercial database management systems already support XML persistence in some form. For example, IBM's DB2® pureXML™ (see Resources for a link) provides support for storing XML documents natively in XML typed columns, validating XML documents against XML schemas, and querying XML documents using XQuery and SQL/XML query languages. You can store and query well-formed XML documents together in the same column even if they conform to different schemas.

Furthermore, the always evolving context of the Web imposes the challenge of accommodating new functionalities (an expansion of an organization's business, for instance) and new data types (RSS feed messages, perhaps) in the database. New schema versions can be released as frequently as once every six months—and sometimes as often as once every two weeks. Information technology architects, application developers, and database administrators often find the management of applications that operate on a column of XML documents from different schema versions confusing. Furthermore, the applications on top of the database and its users still need to interact with it regardless of its schema evolution.

For all these reasons, schema evolution is a very important topic that has been researched in the context of relational, object-oriented, and XML databases. Specifically, for XML databases, schema evolution still is one of the demanding use cases for XML. Whereas most previous work considers the technical aspects of the schema evolution (that is, how to efficiently store different schema, preserving data consistency over the changes), the goal of this article is to provide guidelines to preserve query integrity on evolving schemas. If you follow the guidelines outlined here, your database will keep up with the evolving scenarios of Web applications, and your Web applications will keep their interaction with the database (through queries) working as well. The contributions of this work are summarized as follows:

  • We consider the most common types of changes that can happen during XML schema evolution and propose a taxonomy for them.
  • Based on that taxonomy, we overview the impact of each change on schema validation. We consider both cases of forward compatibility (that is, documents from the old schema are compatible with the new schema) and backward compatibility (that is, documents from the new schema are compatible with the previous schema).
  • We present and discuss the impact of schema evolution on query formulation. In other words, queries that are evaluated against the documents from the original schema might not work on the new one (and vice versa).
  • We introduce guidelines for managing XML schema evolution that describe how to control the schema changes and how to write queries across schema versions.

We continue and summarize some basic concepts of XML and XQuery. In "Taxonomy of XML Schema changes," we present the taxonomy of changes that may happen during XML schema evolution. "Impact of XML Schema evolution on queries" discusses the impact and shows the interactions between schema changes, queries, and query results through examples from the insurance industry. "Guidelines for managing XML schema evolution" provides guidelines to manage schema evolution in the context of writing queries so that they continue to return expected results when the schema evolves.

Background

XML (Extensible Markup Language) is a markup language derived from SGML, and its specification is organized by W3C. The use of XML can facilitate the development of applications in multiple contexts. XML is consistently used in Web services applications; RSS feeds are a typical example. Web services also benefit from the structured data in message-passing environments. Furthermore, XML can contribute to the establishment of open standards in file formats, facilitating the exchange of data between multiple applications in different platforms. Additionally, a common XML data structure can improve the integration between e-commerce players, allowing them to trade orders, stock information, and shipping details more efficiently. XML data encapsulation also provides improved privacy and reliability when accessing data, either through the Web or desktop applications.

Finally, you can implement efficient document search methods for structured XML data. Unlike binary data, XML is highly structured, and many parsers and tools are optimized to search and query XML data. Listing 1 illustrates a simple XML document for a library.


Listing 1. An example of a simple XML document
                
<?xml version="1.0" encoding="UTF-8"?>
<library>
  <book edition="5">
    <title>Fundamentals of Database Systems</title>
    <author>Ramez Elmasri</author>
    <author>Shamkant B. Navathe</author>
    <year>2006</year>
    <publisher>Benjamin/Cummings</publisher>
    <category>Computer Science
      <subcategory>Databases</subcategory>
    </category>
  </book>
  <book edition="3" format="hardcover">
    <title>Mists of Avalon</title>
    <info>Marion Zimmer Bradley, published by Del Rey</info>
    <category>Fiction</category>
  </book>
</library>

As the example shows, unlike relational data, XML data is usually schemaless and self-describing, while exhibiting an inherent structure. The structure is defined mostly by elements (<library> and <book>, in this example) and attributes (edition and format). Each element can have many attributes (with their respective values), other elements nested within, and text content.

Usually, an XML document must be well formed and valid for an application to process. The document is well formed when it conforms to XML's syntax rules. For example, an XML document may have one, and only one, root element. The document is valid when it conforms to its XML schema. The schema defines rules for the organization and content of the document. For example, each <book> element in the library must have a <title> element and may not be nested within another <book> element. Of the many languages for defining XML schemas, one of the most common is straightforwardly called XML Schema.

With information specified as an XML document, you can retrieve data with an XML query language, like XQuery. XQuery uses XPath expressions for accessing specific parts of the document. A simple XPath expression that retrieves the elements subcategory of books published in 2006 is shown in Listing 2.


Listing 2. An example of a simple XPath expression
                
    /library/book[year="2006"]//subcategory

XPath expressions are then incorporated into XQuery statements that are specified by a FLWOR (For, Let, Where, Order by, Return) expression. For example, the XQuery expression in Listing 3 returns all books published by Benjamin/Cummings in 2006 ordered by title.


Listing 3. An example XQuery
                
    for $b in doc("mylibrary.xml")//book
    where $b/publisher="Benjamin/Cummings"
    order by $b/year
    return $b/title

Taxonomy of XML Schema changes

During schema evolution, any component of a schema can change from one version to the next. In this section, we'll introduce different types of schema changes with respect to validation. First, we'll introduce a taxonomy of basic and complex changes. Then we'll present their impact on validating documents from evolving schemas.

Basic changes

A taxonomy of five types of schema changes has been proposed by the FpML Architecture Working Group in the context of FpML, the financial products markup language that specifies protocols for sharing information on, and dealing in, swaps, derivatives, and other structured products (see Resources for a link with more information about this taxonomy). Table 1 presents the taxonomy with the most common changes applicable to the schema definition. It also describes how each change affects whether document instances from the original schema are valid according to the new schema (forward compatibility) and whether document instances from the new schema are valid according to the original schema (backward compatibility).

This new set of changes reflects real-world cases found on different industries. However, it is important to realize that, as XML schema languages (such as DTD and XML Schema) are extended, other types of changes might appear as well. Moreover, even though some of those complex changes can be reduced to combinations of the basic changes (for example, removal followed by refinement), they are presented individually so as to provide a better understanding in the next section. Note that, with most of the complex changes, document instances from an older schema will not typically validate with a newer schema, or will require some support from the application on top of the database. All those cases will be clear with the examples in the next section. Note also that, with the basic changes, only refinement and removal might present compatibility problems, which will become clear as we analyze some examples.


Table 1. Basic schema changes and their impact on validation
Change nameDefinitionForward compatibilityBackward compatibility
RefinementAdds optional or required elements to the schema.Documents created from the original schema will be compatible with new schema only if elements inserted are optional (or within a choice element).Documents created from the new schema will be compatible with original schema only if they do not instantiate the new elements.
RemovalDeletes elements from the schema.Compatible only if the elements deleted were optional and not present on the instances.Same as forward compatibility.
ExtensionAdds new constructs to schema without major impact on current structures. Adding complex types is an example.Compatible.Compatible only if documents created from the new schema do not refer to the new constructs.
ReinterpretationChanges the semantics of an element without changing its structure (that is, its syntax definition).Syntactically compatible.Same as forward compatibility.
RedefinitionUpdates the schema definition without changing the document instances format. An example is an update that factored out common constructions.Compatible if redefinition preserves names and types.Same as forward compatibility.

Complex changes

The changes in Table 1 are quite common, but they do not cover all possible cases. Hence, this article extends that taxonomy by including several more complex change types, as shown in Table 2.


Table 2. Complex schema changes and their impact on validation
Change nameDefinitionForward compatibilityBackward compatibility
Element compositionGroups related elements under a new element. An example is grouping <lastName> and <FirstName> under a new element called <personName>.Probably not compatible.Probably not compatible.
Element decompositionThe opposite of element composition: ungroups subelements into individual elements.Probably not compatible.Probably not compatible.
RenamingUpdates the name of an element or attribute.Not compatible unless the application running on top of the database compensates—by specifies rules for synonyms, for example.Same as forward compatibility.
OptionalityChanges the participation semantics of an element from optional to required, or vice versa.From optional to required: Compatible only if element is always present. From required to optional: Compatible.From optional to required: Compatible. From required to optional: Compatible only if element is always present.
RenumberingChanges the cardinality of an element—that is, its multiplicity specification. An example is a move from singleton to multiple elements.From singleton to multiple: compatible. From multiple to singleton: compatible only if all elements have one instance.From singleton to multiple: compatible only if all elements have one instance. From multiple to singleton: compatible.
RetypingModifies the data type of an element. This includes extending or restricting an element using facets.Not compatible unless the application (on top of the database) specifies translation rules.Same as forward compatibility.
Default valuesChanges the default value of an attribute or an element.Compatible if it does not change the type of the value.Same as forward compatibility.
NamespacesChanges the namespace.Not compatible.Not compatible.
ReorderingChanges the order of the elements inside a complex type.Sequence to all: Compatible. All to sequence: Not compatible, usually. Re-order within sequence: Not compatible.Sequence to all: Not compatible in general. All to sequence: Compatible. Re-order within sequence: Not compatible.

Impact of XML Schema evolution on queries

Over and above the impact of schema changes on validation, each change can also affect query formulation: Queries that are evaluated against the documents from the original schema might not work on the new schema (and vice versa). This section discusses the impact of schema evolution on queries and their results through a series of illustrative examples based on the ACORD (Association for Cooperative Operations Research and Development) schema (see Resources for a link). ACORD is an association that provides standards for the insurance, reinsurance, and related financial services industries. In the real world, a new ACORD schema is released each month, on average; the changes we use in our examples are for illustration purposes only, though, and don't reflect changes to the real schema. Queries are based on the personal information of the parties involved in a financial transaction rather than financial information from the ACORD standard.

General queries

We'll begin by presenting a couple of general queries that do not access any modified schema elements. Each query has its specification, the equivalent XQuery expression, and the results based on the instance documents presented in Listing 4 (which illustrates a document instance from the original schema and another from the new schema). For clarity, the listing includes a graphical representation of the instances rather than the schema specification.

Listings 5, 6, and 7 show that the results are similar if the queries do not access the modified elements. Specifically, Listing 5 retrieves the whole transaction from each instance with no constraints on the query; Listing 6 retrieves one specific element (the last name) from the transactions; and Listing 7 is the same as Listing 6 except that it groups the result by transaction ID. The examples are simple and show that there is no problem when the evolving part of the schema is not evaluated on any query. However, this is not always the case, as the next set of examples demonstrates.


Listing 4. Document instances on schema update: Refinement of <SSN> element and removal of <BirthDate> element

Document instance from original schema
                            
<TXLife> 
<TXLifeRequest id="TXLifeRequest1001">
  <TransRefGUID>2006-0712-1001</TransRefGUID>
  <TransType tc="103">Schema Evolution </TransType>
  <OLifE>
    <Party id="ID1101">
      <Person>
        <FirstName>Alan</FirstName>
        <LastName>Bird</LastName>
        <Gender>MALE</Gender>
      </Person>
    </Party>
    <Party id="ID1102">
      <Person>
        <FirstName>Anthony</FirstName>
        <LastName>Bell</LastName>
        <Gender>MALE</Gender>
        <BirthDate>1975-11-14</BirthDate>
      </Person>
    </Party>
  </OLifE>
</TXLifeRequest>
</TXLife>


Document instance from new schema
                            
<TXLife> 
<TXLifeRequest id="TXLifeRequest1002">
  <TransRefGUID>2006-0712-1001</TransRefGUID>
  <TransType tc="103">Schema Evolution </TransType>
  <OLifE>
      <Party id="ID1103">
        <Person>
          <FirstName>Carl</FirstName>
          <LastName>Devon</LastName>
          <Gender>MALE</Gender>
          <SSN>606-23-0987</SSN>
        </Person>
      </Party>
      <Party id="ID1104">
        <Person>
          <FirstName>Cinthia</FirstName>
          <LastName>Din</LastName>
          <Gender>FEMALE</Gender>
        </Person>
      </Party>
  </OLifE>
</TXLifeRequest>
</TXLife>


Listing 5. Retrieve all transactions for Schema Evolution examples
XQuery
for $trans in //TXLifeRequest return $trans

Result
The result is a sequence of TXLifeRequest elements including the transactions from both the original and the new schemas.

Listing 6. Retrieve values for all <LastName> elements
XQuery
for $trans in //TXLifeRequest//LastName 
return <result>{$trans}</result>

Result
<result>
<LastName>Bird</LastName>
<LastName>Bel</LastName>
<LastName>Devon</LastName>
<LastName>Din</LastName>
</result>


Listing 7. Retrieve values for all <LastName> elements grouped by transaction
XQuery
for $trans in //TXLifeRequest 
return 
<TX>
  {$trans/@tid}
  <Names>
    {$trans//LastName}
  </Names>
</TX>

Result
<TX>
<LastName>Bird</LastName>
<LastName>Bel</LastName>
<LastName>Devon</LastName>
<LastName>Din</LastName>
</TX>

Basic changes

Out of the five types of basic changes previously specified, our discussion will concentrate on refinement and removal because of their impact on query formulation and results. The other changes do not significantly affect query evaluation (extension causes no problem unless new elements are defined using the new construct—which is then equivalent to refinement; reinterpretation does not affect the structure of an element and neither does redefinition). Hence, only refinement and removal are relevant for the next sections. All examples discussed in this section illustrate points that serve as the base for the guidelines that we outline later.

Refinement

The first basic change presented in Listing 4 is the extension of the ACORD schema by adding an optional element for a party's social security number (SSN). Listing 8 returns the last names of all parties from the transactions and the SSN of those who have it. Therefore, the names of four parties (from both the original and new documents) and one SSN are retrieved. Even though the query accesses a new element, its access is limited to the return clause—that is, it is not restrictive. On the other hand, Listing 10 has an exist clause within the where statement that requires the names of those parties who have a SSN. Hence, the query processing is limited to the document instances from the new schema, because all instances from previous schema evaluate the where statement as false. Listing 10 also gives a hint of how not to write queries when the goal is to preserve their ability to work on different schemas.


Listing 8. Retrieve last name and SSN grouped by party and transaction
XQuery
for $trans in //TXLifeRequest
return 
<TX>
  { $trans/@id }
  { for $P in $trans//Party 
    return 
    <Party> 
      {$P//LastName}
      {$P//SSN} 
    </Party> } 
</TX>

Result
<TX id="TXLifeRequest1001">
  <Party><LastName>Bird</LastName></Party>
  <Party><LastName>Bell</LastName></Party>
</TX>
<TX id="TXLifeRequest1002">
  <Party>
        <LastName>Devon</LastName>
        <SSN>606-23-0987</SSN>
  </Party>
  <Party><LastName>Din</LastName>
  </Party>
</TX>
 


Listing 9. Document instances on schema update: Element composition

Document instance from original schema
                            
<TXLife> 
<TXLifeRequest id="TXLifeRequest1001">
  <TransRefGUID>2006-0712-1001</TransRefGUID>
  <TransType tc="103">Schema Evolution</TransType>
  <OLifE>
    <Party id="ID1101">
      <Person>
        <FirstName>Alan</FirstName>
        <LastName>Bird</LastName>
        <Gender>MALE</Gender>
      </Person>
      <Address>
        <Line1>998 Mamaroneck Ave</Line1>
        <City>White Plains</City>
        <AddressStateTC tc="60">NY</AddressStateTC>
        <Zip>10605</Zip>
      </Address>
    </Party>
  </OLifE>
</TXLifeRequest>
</TXLife>


Document instance from new schema
                            
<TXLife> 
<TXLifeRequest id="TXLifeRequest1002">
  <TransRefGUID>2006-0712-1001</TransRefGUID>
  <TransType tc="103">Schema Evolution</TransType>
  <OLifE>
    <Party id="ID1103">
      <Person>
        <FirstName>Carl</FirstName>
        <LastName>Devon</LastName>
        <Gender>MALE</Gender>      
      </Person>
      <Address>20 Fifth Ave, New York, NY 10011</Address>





    </Party>
  </OLifE>
</TXLifeRequest>
</TXLife>

Removal

The second change illustrated in Listing 4 is to remove the element <BirthDate> from the ACORD schema. Listing 11 returns a list of birth dates for the first document instance, and an empty list for the second instance. If it is necessary to retrieve only those transactions that specify elements for birth date, then the exists function is added to the where statement, as in Listing 10. In this case, only documents from original schemas will be evaluated.


Listing 10. Retrieve last names grouped by party and transaction from only those parties that have an SSN
XQuery
for $trans in //TXLifeRequest 
where fn:exists($trans//SSN)
return
<TX>
  {$trans/@id}
  { for $P in $trans//Party 
  where fn:exists($P//SSN)
  return 
    <Party> {$P//LastName}</Party> } 
</TX>

Result
<TX id="TXLifeRequest1002">
  <Party>
    <LastName>Din</LastName>
  </Party>
</TX>


Listing 11. Retrieve transaction IDs and their list of birth dates
XQuery
for $trans in //TXLifeRequest 
return
<TX>
  {$trans/@id}
  { for $P in $trans//Party//BirthDate 
    return $P} 
</TX>

Result
<TX id="TXLifeRequest1001">
  <BirthDate>1975-11-14</BirthDate>
</TX>
<TX id="TXLifeRequest1002"/>


Listing 12. Retrieve TXLifeRequest ID and list of addresses
XQuery
for $trans in //TXLifeRequest 
<TX>
  {$trans/@id}
  { for $P in $trans//Address return $P}
</TX>

Result
<TX id="TXLifeRequest1001">
  <Address>
    <Line1>998 Mamaroneck Ave</Line1>
    <City>White Plains</City>
    <AddressStateTC tc="60">NY</AddressStateTC>
    <Zip>10605</Zip>
  </Address>
</TX>
<TX id="TXLifeRequest1002">
  <Address>20 Fifth Ave, New York,
        NY 10011</Address>
</TX>

Complex changes

This section discusses the impact of the complex changes on query formulation. As with the basic changes, each query has its specification, the equivalent XQuery statement, and its result.

Element composition

This change is described in Listing 9, where the subelements of <Address> are composed in one plain element (with no substructure). Listing 12 presents an interesting example of retrieving the addresses because it returns the address information from both instances, but the results present different structures.

Two problems with this kind of change are:

  • First, the new composed element might have a different name from the original structured element. This change has consequences similar to those of renaming.
  • Second, the query might refer to a specific substructure of the original schema. For example, Listing 14 requests only the addresses' zip code. Therefore, the result returns only the addresses from the first instance because the <Zip> element does not exist in the second instance. The impact on the query in this last case is extremely important because, even though no data is lost from one document to the other, the qualification of the data does not exist anymore.

Listing 13. Document instances on schema update: Renaming

Document instance from original schema
                            
<TXLife> 
<TXLifeRequest id="TXLifeRequest1001">
  <TransRefGUID>2006-0712-1001</TransRefGUID>
  <TransType tc="103">Schema Evolution</TransType>
  <OLifE>
    <Party id="ID1101">
      <Person>
        <FirstName>Alan</FirstName>
        <LastName>Bird</LastName>
        <Gender>MALE</Gender>
      </Person>
    </Party>
  </OLifE>
</TXLifeRequest>
</TXLife>


Document instance from new schema
                            
<TXLife> 
<TXLifeRequest id="TXLifeRequest1002">
  <TransRefGUID>2006-0712-1001</TransRefGUID>
  <TransType tc="103">Schema Evolution</TransType>
  <OLifE>
    <Party id="ID1103">
      <Person>
        <fName>Carl</fName>
        <lName>Devon</lName>
        <Gender>MALE</Gender>
      </Person>
    </Party>
  </OLifE>
</TXLifeRequest>
</TXLife>


Listing 14. Retrieve TXLifeRequest ID and list of zip codes of addresses
XQuery
for $trans in //TXLifeRequest 
return 
<TX>
  {$trans/@id}
  { for $P in $trans//Address//Zip return $P } 
</TX>

Result
<TX id="TXLifeRequest1001">
  <Zip>10605</Zip>
</TX>
<TX id="TXLifeRequest1002"/>

Element decomposition

The example here is the inverse of the situation in Listing 9: We invert the original and new documents. If the name of the structure that is decomposed remains the same, then the only difference in the queries is that the results from the new schema are structured. The same problems from composition apply inversely to decomposition.

Renaming

Listing 13 illustrates an example where the elements <FirstName> and <LastName> were renamed to <fName> and <lName>. The impact on the queries depends on whether or not these element names are referred to directly For example, Listing 15 requests for persons per transaction, so the results from both instances are retrieved. Now, if the query were specific for returning //Person/LastName, then only the persons from the first instance would be retrieved.

One way to solve this problem is for the application that runs on top of the database to have a table of synonyms. A another way is to specify all names on the query. This solution is less elegant and requires specific knowledge about the schema. An example is presented in Listing 16.

Optionality

This change is in some ways related to refinement and removal. Changing an element's optionality from optional to required is related to refinement for those optional elements that are not instantiated on the original document. For example, changing the <SSN> element from optional to required has the same impact presented in Listing 8 and Listing 9, considering that the original instances do not present a value for it. Likewise, changing an element's optionality from required to optional affects the results of the queries that reference the formerly required elements. In this case, it might happen that the result is empty. Changing the element <birthDate> from required to optional has the same impact presented in Listing 11, considering that the new instances do not present a value for it. The same concerns from refinement and removal are therefore applied here as well.


Listing 15. Retrieve TXLifeRequest ID and list of persons
XQuery
for $trans in //TXLifeRequest
return 
<TX>
  {$trans/@id}
  { for $P in $trans//Person return $P}
</TX>

Result
<TX id="TXLifeRequest1001">
  <Person>
    <FirstName>Alan</FirstName>
    <LastName>Bird</LastName>
    <Gender>MALE</Gender>
  </Person>
</TX>
<TX id="TXLifeRequest1002">
  <Person>
    <fName>Carl</fName>
    <lName>Devon</lName>
    <Gender>MALE</Gender>
  </Person>
</TX>


Listing 16. Retrieve TXLifeRequest ID and list of persons
XQuery
for $trans in //TXLifeRequest 
return 
<TX>
  {$trans/@id}
  { for $P in $trans//Person/LastName return $P}
  { for $P in $trans//Person/lName return $P}
</TX>                    

Result
<TX id="TXLifeRequest1001">
    <LastName>Bird</LastName>
</TX>
<TX id="TXLifeRequest1002">
    <lName>Devon</lName>
</TX>

Renumbering

It is also possible to specify the cardinality of an element—that is, whether the element appears once (singleton) or more times under the same parent. Changing from singleton to multiple elements affects results that will have more instances of the same element. Changing from multiple elements to singleton is more straightforward because it just reduces the cardinality of the results. Listing 17 illustrates an example in which the new document instance has three phone numbers, as opposed to only one in the original document. A query to retrieve the phones of the parties (like one similar to Listing 12) returns only one number for the parties from the original schema, and multiple numbers for the new one.


Listing 17. Document instances on schema updates: Renumbering and retyping

Document instance from original schema
                            
<TXLife> 
<TXLifeRequest id="TXLifeRequest1001">
  <TransRefGUID>2006-0712-1001</TransRefGUID>
  <TransType tc="103">Schema Evolution</TransType>
  <OLifE>
    <Party id="ID1101">
      <Person>
        <FirstName>Alan</FirstName>
        <LastName>Birch</LastName>
        <Gender>MALE</Gender>
        <Phone>212 123 4567</Phone>
      </Person>
      <Address>
        <Line1>202 W 101st St</Line1>
        <City>New York</City>
        <AddressStateTC tc="60">NY</AddressStateTC>
        <Zip>10025</Zip>
      </Address>
    </Party>
  </OLifE>
</TXLifeRequest>
</TXLife>
   


Document instance from new schema
                            
<TXLife> 
<TXLifeRequest id="TXLifeRequest1002">
  <TransRefGUID>2006-0712-1001</TransRefGUID>
  <TransType tc="103">Schema Evolution</TransType>
  <OLifE>
    <Party id="ID1103">
      <Person>
        <FirstName>Carl</FirstName>
        <LastName>Devon</LastName>
        <Gender>MALE</Gender>   
        <Phone>212 234 4567</Phone>
        <Phone>212 234 5678</Phone>   
        <Phone>212 234 6789</Phone>
      </Person>
      <Address>
        <Line1>80 W 109th St</Line1>
        <City>New York</City>
        <AddressStateTC tc="60">NY</AddressStateTC>
        <Zip>10025-2638</Zip>
    </Party>
  </OLifE>
</TXLifeRequest>
</TXLife>

Retyping

This is probably the trickiest change from the view of compatibility and query impact because it requires specific procedures in order to keep the queries working. For example, consider a query that specifies an integer comparison for a value v on the where clause; later, v's type is changed to string. In this case, casting functions (like $elem cast as xs:integer) are needed to keep the queries working.

This type of change also includes scenarios in which an element is extended or restricted using facets. The constraining facets include length, minLength, maxLength, pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minExclusive, minInclusive, totalDigits, and fractionDigits. From the query formulation perspective, these changes affect queries with comparison constraints. In addition, the result of the affected types can also change. For example, Listing 17 illustrates a change on the zip code pattern from <pattern value='[0-9]{5}'/> to <pattern value='[0-9]{5}(-[0-9]{4})?'/> —in other words, it adds the optional four-digit extension. General queries that simply retrieve the element contents, like Listing 18, work with no problem. On the other hand, comparison constraints can easily break the query, as illustrated in Listing 19.

Default values

Schema validation adds default values for omitted attributes and empty elements on a document instance. Changing the default value does not affect the query formulation if the data type is the same, but it affects the results for queries on those values. Also, special attention is needed on queries that retrieve all values except the default ones by using comparison constraints. In this case, you should use different queries to access documents from different schemas.


Listing 18. Retrieve TXLifeRequest ID and list of addresses
XQuery
for $trans in //TXLifeRequest
return 
<TX>
  {$trans/@id}
  { for $P in $trans//Address return $P}
</TX>

Result
<TX id="TXLifeRequest1001">
  <Address><Line1>202 W 101st At</Line1>
  <City>New York</City>
  <AddressStateTC tc="60">NY</AddressStateTC>
  <Zip>10025</Zip></Address>
</TX>
<TX id="TXLifeRequest1003">
  <Address>   <Line1>80 W 109th St </Line1>
    <City>New York</City>
    <AddressStateTC tc="60">NY</AddressStateTC>
  <Zip>10025-2638</Zip></Address>
</TX>


Listing 19. Retrieve TXLifeRequest ID and list of addresses with the zip code 10025
XQuery
for $trans in //TXLifeRequest 
return 
<TX>
  {$trans/@id}
  { for $P in $trans//Address 
    where $P/Zip="10025"
    return $P}
</TX>

Result
<TX id="TXLifeRequest1001">
  <Address>   <Line1>202 W 101st St</Line1> 
  <City>New York</City>
  <AddressStateTC tc="60">NY</AddressStateTC> 
  <Zip>10025</Zip>
  </Address>
</TX>
<TX id="TXLifeRequest1003"></TX>

Namespaces

In some cases of XML schema evolution, different namespaces denote different versions of the schema. Queries that are sensitive to namespace will silently return no results on documents whose namespaces are different from the namespace specified in the query. If the desired behavior is to return the same results, write the query with a wildcard for the namespace.

Re-ordering

In general, most XML queries are not order sensitive. However, positional predicates are sometimes used in XML queries, like so:

//OLifE/party[2]

In such cases, the query can return the wrong result, which might have serious consequences for the application functionality.

Guidelines for managing XML schema evolution

In this section, we'll discuss practical issues involved in controlling schema evolution and present a compact set of guidelines for writing queries across schema versions.

Controlling schema changes

An issue related to keeping queries working as schemas evolve is the question of how to manage the evolving schemas themselves. How to manage XML schema evolution depends on several factors, such as:

  • Who has control over the schema changes
  • Who has control over the application semantics (for example, queries, updates, schema validation)
  • What types of schema changes are allowed
  • What types of applications (the type of query) are allowed

Table 3 illustrates the different options to answer those questions with examples of what roles fall into each of the four possibilities.

Of these four roles, the XML data and application architects have the most freedom in how they design their schema changes and applications to manage schema evolution. On the opposite end, the plain XML user can do virtually nothing to manage schema evolution. In the middle term, application architects have control only over how they write the queries in their applications. In general, they need to decide when their applications need to break and how the queries in their applications break. The options for when the applications can break are:

  • The application works across all versions of the XML schema
  • The application breaks for major version changes
  • The application breaks even for minor version changes

Table 3. Different options to control applications and schema changes
  Have control over schema changesNo control over schema changes
Control over applications XML data and application architects, who design and develop both the XML schema and the application. XML application architects, who design and develop applications that consume XML data conforming to some standardized XML schema (not designed by the same architects).
No control over applications XML data architects or standards organizations, who design XML schemas. XML users, who run off-the-shelf applications (for example, XML schema validation) over standardized XML formats.

Likewise, evaluating queries on different schemas might result in the following outcomes:

  • The query returns the correct results (where correctness is defined by the semantics of the application)
  • The query returns no results
  • The query returns the incorrect results

When an application encounters an XML document from a new version of the schema, the application architect needs to decide which of these three outcomes the application requires and how to process the outcomes.

In general, write queries for very generic information (such as message identifiers or social security numbers) so the correct result is returned across all schema versions. In applications that are highly dependent on major version changes, design the queries to either silently return no results or detect the major version changes and flag an exception. Likewise, in applications that are sensitive to minor version changes, design the queries to silently return no results or flag an exception. Leaving version change undetected and allowing incorrect results to be processed is deprecated in general.

This leads us to the discussion of what constitutes a major or minor version change—a question that is partly determined by the application semantics. However, since the application program needs to detect the major or minor version change, the architect needs to consider how to encode the changes. In other words, how to encode the versioning and whether a new schema is a minor or major version change fall into the purview of the XML data architect, who needs to consider two aspects. First, there is the problem of how to encode a schema version, which is usually solved considering two approaches:

  • The namespace encodes the version
  • An explicit version element or attribute is included in the XML documents

Each method has its advantages and shortcomings.

Second, the XML data architect needs to have some understanding of the different applications or consumers of the XML data (how the XML documents will be processed, for example) in order to decide whether a schema change is a major or minor change.

Writing queries across schema versions

This article assumes that a Web application or service must work across all versions of the XML schema and that the query must return the correct results across all versions as well. In this context, and based on the examples from the previous sections, we developed the following initial set of guidelines to keep queries working across schema versions.

  • Do not add required elements (or attributes) to the middle of the hierarchy. If necessary, the queries can use the ancestor or descendant axis in XPath; otherwise, they will not work on new schemas.

  • Likewise, do not delete required elements (or attributes) from the middle of the hierarchy.

  • Do not change the order of the elements on the schema when the queries consider ordered predicates.

  • Do not change atomic types (string, integer, etc.) if queries are strongly typed or have value comparisons.

  • Do not change the name of elements when they are referred in any query. If changing an element name is unavoidable, then consider including a synonym control in the application such that the old and the new names refer to the same element.

  • Likewise, do not change the type or facet of elements when they are referred in any query. If retyping is unavoidable, then you must update the queries as well, by adding respective cast functions.

  • If you use namespaces to distinguish schema versions, then consider adding wildcards (*) to the namespace specification within the query.

  • Certain XPath functions, such as exist, need special attention, because they can restrain the query evaluation to a set of versions that have a specific element. Consider the use of such functions carefully.

  • Querying composed or decomposed elements returns results with different structures. On the other hand, if composed or decomposed elements are evaluated in comparisons (within a where clause) or path expression, then the query will probably not consider document instances from all schema versions. Review queries that access composed or decomposed elements to ensure the appropriate behavior.

Summary

This article has presented an extended taxonomy for changes that can happen during XML schema evolution. We discussed the impact of schema evolution on validation, query formulation, and results in detail. Based on that discussion, we proposed an initial set of guidelines to preserve queries across different schema versions. This work is not intended to exhaust the subject but rather to cover the most common situations an XML designer faces on schema evolution. Moreover, in some cases, queries should not work across major versioning changes. Those cases are not covered in this study and will be covered in a future article.


Resources

Learn

Get products and technologies

Discuss

About the authors

Photo of Mirella Moro

Mirella Moro is a researcher at the Universidade Federal do Rio Grande do Sul, Brazil. She received a PhD from the University of California, Riverside, and Masters and Bachelor degrees from Universidade Federal do Rio Grande do Sul. She was an intern at IBM T.J. Watson Research Center in the summer of 2006.

Photo of Susan Malaika

Susan Malaika is a senior technical staff member in IBM's Information Management Group (part of IBM Software Group). Her specialties include XML, the Web, and databases. She has developed standards that support data for grid environments at the Global Grid Forum. In addition to working as an IBM product software developer, she has also worked as an Internet specialist, a data analyst, and an application designer and developer. She has also co-authored a book on the Web and published articles on transaction processing and XML. She is a member of the IBM Academy of Technology.

Photo of Lipyeow Lim

Lipyeow Lim is a research staff member at IBM T. J. Watson Research Center. He obtained his Ph.D. from Duke University in Durham, North Carolina. His research interests lie in the area of database technology—in particular, XML databases, statistics collection, and query optimization.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Information Management
ArticleID=217084
ArticleTitle=Preserving XML queries during schema evolution
publish-date=06052007
author1-email=mirella@inf.ufrgs.br
author1-email-cc=dwxed@us.ibm.com
author2-email=malaika@us.ibm.com
author2-email-cc=dwxed@us.ibm.com
author3-email=liplim@us.ibm.com
author3-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers