Skip to main content

XML Matters: DTDs and XML documents from SQL queries

David Mertz, Ph.D. (mertz@gnosis.cx), Archivist, Gnosis Software, Inc.
Photo of David Mertz
David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Summary:  This column discusses the public-domain sql2dtd and sql2xml utilities that allow RDBMS-independent generation of portable XML result sets. SQL queries that extract data from relational databases can provide very practical ad hoc document-type information for the representation of query results in XML.

View more content in this series

Date:  01 May 2001
Level:  Introductory
Activity:  1216 views

The previous "XML Matters" column discussed some of the theory and advantages underlying various data models. One conclusion of that column was that RDBMSs are here to stay (with good reasons), and that XML is best seen in this context as a means of transporting data between various DBMSs, rather than as something to replace them. XPath and XSLT are useful for certain "data querying" purposes, but their application is far less broad and general than that of RDBMSs, and SQL, in particular. However, for lack of space, I am deferring a discussion of the specific capabilities (and limits) of XPath and XSLT until a later column.

A number of recent RDBMSs, including at least DB2, Oracle, and probably others, come with built-in (or at least optional) tools for exporting XML. However, the tools discussed in this column are intended to be generic; in particular, the DTDs generated by these tools will remain identical for the same query performed against different RDBMSs. I hope this will further goals of data transparency.

Simplifying too much

What you might imagine as the most obvious way to convert relational database data to XML is also generally a bad idea. That is, it would be simple enough -- conceptually and practically -- to do a table-by-table dump of all the contents of an RDBMS into corresponding XML documents. For example, a LIBRARY database (to continue the simple example from the previous column) might have one table called BOOKPRICE, with contents like:



SELECT * FROM BOOKPRICE;
+------------+-------+
| ISBN       | Price |
+------------+-------+
| 2994927282 | 34.99 |
| 3920202049 | 47.50 |
+------------+-------+

I could straightforwardly convert this into a document BOOKPRICE.xml that looked like:


Listing 1. The BOOKPRICE.xml document

<?xml version="1.0"?>
<SQL>
  <row>
    <ISBN>2994927282</ISBN>
    <Price>34.99</Price>
  </row>
  <row>
    <ISBN>3920202049</ISBN>
    <Price>47.50</Price>
  </row>
</SQL>

Once you have created similar XML documents for every table within a database, you'll have a complete snapshot of the database.


Simplifying only enough

There are two basic problems with the approach sketched above. The first is that the XML dump is hugely inefficient. XML, as you well know, is an extremely verbose format, and an XML dump of a large database would generate an even larger collection of XML documents. Furthermore, major DBMSs already provide compact and efficient flat-file or packed-record-style dumps. The XML sketched provides no additional functionality to those simple formats, but it significantly multiplies the size of transport files.

The second problem is more interesting than the first. In a sense, "DASD is cheap," and bandwidth is also getting cheaper ("DASD" is an old IBM way of saying "hard disks"; and the phrase is a design cliche). So mere efficiency is not necessarily an overriding factor. More importantly, you rarely want to communicate the entire contents of a database to partners/departments/users/etc. Sometimes some of the contents are private; and almost all the time, most contents are simply not relevant to a particular recipient.

Dumping the results of useful SQL queries to XML would be more interesting than dumping raw tables to XML. And associating DTDs with these useful queries in order to regularize expectations about the exact parsing and processing required for XML transactions or feeds would be still better. These two interesting actions are precisely what the public-domain Python utilities sql2dtd and sql2xml allow you to accomplish.

Suppose that A and B each has its own internal data storage strategy (for example, in different RDBMSs). Each maintains all sort of related information that is not relevant to the interaction between A and B, but they also both have some information they would like to share. Suppose, along these lines, that A needs to communicate a particular kind of data set to B on a recurrent basis. One thing A and B can do is agree that A will periodically send B a set of XML documents, each of which will conform to a DTD agreed to in advance. The specific data in one transmission will vary with time, but the validity rules have been specified in advance. Both A and B can carry out their programming, knowing the protocol between them.


Generating the DTDs

One way to develop this communication between A and B is to develop DTDs (or schemas) that match the specific needs of A and B. Then A will need to develop custom code to export data into the agreed DTDs from A's current RDBMS; and B will need to develop custom code to import the same data (into a differently structured database). Then, finally, the communication channel can be opened.

However, a quicker way -- a way that is likely to leverage existing export/import procedures -- usually exists. The Standard Query Language (SQL) is a wonderfully compact means of expressing exactly what data interests you within an RDBMS database. Trying to bolt XML native techniques like XPath or XSLT onto a relational model will probably feel unnatural, although they can certainly express querying functions within XML's basically hierarchical model.

Many organizations have already developed well-tested sets of SQL statements for achieving known tasks. Often, in fact, RDBMSs provide means for optimizing stored queries. While there are certainly cases where designing rich DTDs for data exchanges makes sense, in many or most cases, using the structuring information implicit in SQL queries as an (automatic) basis for XML data transmissions can be a good solution.

While SQL queries can combine table data in complex ways, the result from any SQL query is a rather simple row-and-column arrangement. Query output has a fixed number of columns, with each row filling in values for every fixed column. (That is, as well as not changing in number, neither the value type nor the names of columns change within a SQL result -- even though both these things could change in XML documents.) The potential of XML to represent complex nesting patterns of elements is just simply not going to be deeply exercised in representing SQL results. Nonetheless, several important aspects of an SQL query can and should be represented in an XML DTD beyond simply row/column positions.


What to represent from an SQL query?

I would argue that in our contemplated data exchange between A and B, two aspects should be separated out. On the one hand, there is the internal organization of A's data -- its normalization and denormalized optimizations, for example. B is not, and need not, be concerned about the A-internal aspect. On the other hand, there is metadata that describes what is actually transmitted. Of course, separating these aspects is not necessarily easy.

In writing sql2dtd (and in helping plan Scott Hathaway's sql2xml), I made several decisions about what is part of a data transmission and what is internal to the sender's setup (and need not be represented in the DTD). A sample XML (with the DTD as an internal set) output, shown in Listing 2, helps illustrate these decisions (this output is generated entirely from the SQL query contained as an attribute; when run against a suitable database, of course):


Listing 2. A sample XML output with the DTD as an internal set


<?xml version="1.0"?>
<!DOCTYPE SQL [
<!ELEMENT SQL (row)*>
<!ATTLIST SQL
  GROUP_BY NMTOKEN #FIXED "AuthID"
  query CDATA #FIXED "SELECT AuthID AS SSN,COUNT(GroupID)
                      FROM AUTHGROUP GROUP BY AuthID
                      ORDER BY AuthID"
>
<!ELEMENT row (SSN, column2)>
<!ATTLIST row num ID #IMPLIED>
<!ELEMENT SSN (#PCDATA)>
<!ELEMENT column2 (#PCDATA)>
<!ATTLIST column2 CALC CDATA #FIXED "COUNT(GroupID)">
]>
<SQL>
  <row num="1">
    <SSN>111-22-3333</SSN>
    <column2>1</column2>
  </row>
  <row num="2">
    <SSN>333-22-4444</SSN>
    <column2>2</column2>
  </row>
  <row num="3">
    <SSN>666-44-5555</SSN>
    <column2>1</column2>
  </row>
</SQL>

This simple XML document may actually contain more metadata than one notices initially. Of course, by including the SQL itself as an attribute of the root node, one can reconstruct anything implicit in the SQL. But doing that requires reparsing the SQL, which sql2dtd has represented in the document, so it is not generally necessary.

The specification of the CALC attribute contains the fact that the XML contains calculated elements. Since a calculated expression can be long, and contain characters illegal for XML tags, calculated columns are named simply by their position. However, the particular calculation that went into the element contents is contained as an attribute of the tag. In order to avoid repeating the attribute throughout the XML body, it is specified as #FIXED in the DTD.

Frequently, if calculated columns are used, the calculation reflects grouping of columns with the "GROUP BY" modifier. Any such groupings are listed in the GROUP_BY attribute of the root element.

Furthermore, if an "ORDER BY" clause is used, each <row> tag carries a num attribute that specifies the sequence of output data. However, if a result set is unordered, no num attribute is used.

Let us consider what is not represented within the DTD, and see that it really does pertain to A's internal data representation, not to the transmitted message.

Beyond the embedded original SQL query, no representation is kept of the table or tables that were queried for data (the "FROM" clause). The particular table organization is simply not something B needs to be interested in. In fact, A may very well modify its database design after the transmission protocol is in place; but B need not worry about that as long as the same fields (columns) are extracted by some means. In particular, since the "AS" clause overrides the actual table column name, it is possible to continue sending an XML element that no longer has any direct literal meaning in A's database.

Most importantly in the design of sql2dtd, "WHERE" and "HAVING" clauses are ignored (and also JOIN, DISTINCT and ALL modifiers). As with table names, the particular joins and filters necessary to get the data out of A's tables is not something B should have to worry about. If A happens to need to join together a few tables to get some datum, that is simply a normalization strategy by A. B may or may not use any analogous strategy (for a different data subset), and either way does not care about A's actions. Filters (mostly using the "WHERE" clause or "DISTINCT" modifier) are ignored for a related, but slightly different reason. If, for whatever business reasons, A only needs to inform B about those woozles whose whatzit is more than 25, from B's perspective that is just in the nature of woozles. That is, A may be interested in a subclass of woozles that B does not care about; but the specific fact that A needed to use a filter to get the interesting ones (as opposed to not having them, or putting them in different tables) is not B's worry. Sub-selects, in this respect, are just another kind of filter.


Wrapping up

I haven't shown anything on the specific usage of sql2dtd and sql2xml. Not much is needed here, since both are well documented internally. In general, sql2dtd can generate the DTD from an SQL query but does not itself query any database. sql2xml peforms queries via ODBC and optionally utilizes sql2dtd to get a DTD (or it can generate DTD-less XML documents).

These tools help with only approximately half the process contemplated between A and B. A and B can quickly arrive at DTDs using these tools, and A can equally quickly generate the output XML documents conforming with these DTDs. But B, at its end, still needs to do all the work involved in parsing, storing and processing these received documents. Later columns will discuss B's job in some more detail.


Resources

  • My fellow XML zone columnist, Uche Ogbuji has written an interesting article covering somewhat different elements of RDBMS/XML connections for LinuxWorld. As well as writing a good article, Uche includes an extensive list of relevant resources.

  • My sql2dtd is available online.

  • Matt Sergeant's DBIx::XML_RDB is a Perl module whose purpose is substantially similar to sql2xml. However, the XML documents generated by DBIx::XML_RDB lack an accompanying DTD.

  • IBM's DB2 database provides relational database storage, plus pureXML to quickly serve data and reduce your work in the management of XML data..

  • Find other articles in David Mertz's XML Matters column.

About the author

Photo of David Mertz

David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=11996
ArticleTitle=XML Matters: DTDs and XML documents from SQL queries
publish-date=05012001
author1-email=mertz@gnosis.cx
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers