Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

XML Matters: Using Python modules xml2sql and dtd2sql

Generating SQL statements from DTDs and XML documents

David Mertz (mertz@gnosis.cx), Format manipulator, Gnosis Software, Inc.
Author photo: David Mertz
David Mertz believes that without data, life itself would be impossible ... well, OK, at least the Internet would get less hype. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Summary:  An earlier column explored the generation of XML documents from SQL queries. Now David Mertz shows that the reverse direction of translating an XML document and DTD back into an RDBMS storage format is equally possible, but with its own set of constraints and complications. The Python public-domain utilities xml2sql and dtd2sql discussed here generate SQL statements to create and fill a database in a consistent and reversible fashion. Seven code samples demonstrate the techniques.

View more content in this series

Date:  01 Jun 2001
Level:  Introductory

Comments:  

Both XML documents and RDBMSs are going to be around for a good while longer. This is my earlier (and current) conclusion after investigating various data models and studying how XML fits into the picture. The discussion of utilities here continues my efforts to make data and format transitions a little bit easier for developers. By using dtd2sql and xml2sql, programmers can automatically move the contents of XML documents into SQL databases (and get the information back out later).

Caveats

The utilities presented -- here and in the earlier discussion of sql2dtd and sql2xml -- will, I hope, ease the work of developers looking for quick and actual transformations between XML and SQL. But do not assume that these utilities substitute for database analysts. Optimization, normalization, and de-normalization are complex matters requiring knowledge and experience. The tools I present provide good, workable, and easily deployed solutions; but these solutions are only sometimes substitutes for custom development.

Moreover, xml2sql and dtd2sql work best with XML documents that contain table-oriented information to start with. They do a far less stellar job of transforming prose-oriented and linear-oriented XML documents. The limits imposed in this regard are essentially identical to the limits imposed by the xml_objectify library that xml2sql utilizes. However, this particular focus is not really so much of a limitation: There is really no way that any technique -- even custom development -- can produce a very natural representation of prose data and linear data within an RDBMS framework. The models are just different, and xml2sql does about as good job as any utility could.

Finally, do not expect xml2sql to be particularly fast or efficient; this utility is designed primarily for portability and generality. Therefore, rather than call any database library -- whether for a specific RDBMS or via a mechanism like ODBC -- xml2sql and dtd2sql create simple textual SQL statements. The module dtd2sql generates a list of CREATE TABLE statements, while xml2sql generates a list of INSERT INTO statements. It's up to the user or programmer to feed these statements into an actual RDBMS. The beauty of this arrangement is that developers can feed these statements into any vendor's RDBMS equally well.


Dirty hands

The modules xml2sql and dtd2sql are probably best introduced with practical examples. I will circle back to some theoretical issues in their design later.

For purposes of testing, I created a simple test script. For development purposes, I used the popular open-source RDBMS mySQL. While mySQL has some limitations compared to more sophisticated RDBMSs, it provides a nice test platform. The sample DTD and XML documents belong to a developerWorks tutorial I wrote (see Resources for the collection). My script, in Listing 1, is an OS/2 command file, but it should work equally well under Windows, and with only slight modification under a Unix-like system.

echo drop database test;       > test.sql
echo create database test;    >> test.sql
echo use test;                >> test.sql
python dtd2sql.py dwtut.dtd   >> test.sql
python xml2sql.py haskell.xml >> test.sql
mysql -u root -pPASSWORD test  < test.sql

The first few lines of the script in Listing 1 simply wipe out and restore the test database in mySQL. In real life, you would rarely want to drop an existing database but merely INSERT and DELETE from tables. I used drop because it's a good idea to start fresh when testing.

When you run dtd2sql, it takes a DTD on either STDIN or from the filename given on the command line. This DTD can be the internal subset of an XML document, if you wish, or it can be an external file. However, the tool is limited to reading from a single source at a time, and it will not currently handle complex multifile DTDs, parametric entities, or an internal subset that partially overrides an external definition. As mentioned, dtd2sql simply produces a set of CREATE TABLE statements.

In Listing 2, take a look at one output line from dtd2sql and then see how to pick apart its pieces (some wrapping added for presentation).

CREATE TABLE a (
    primary_key BIGINT UNSIGNED PRIMARY KEY,
    seq INT UNSIGNED,
    href BLOB,
    PCDATA BLOB,
    _XML BLOB,
    foreign_key_p BIGINT UNSIGNED,
    foreign_key_li BIGINT UNSIGNED,
    foreign_key_prompt BIGINT UNSIGNED,
    foreign_key_response BIGINT UNSIGNED
);

Some of the column names are easily explained by the definition of the a element itself; for others you need to look a little deeper.


<!--A hyperlink to some other resource.-->
<!ELEMENT a (#PCDATA | code)* >
<!ATTLIST a href CDATA #REQUIRED  >


Every CREATE TABLE statement includes a primary_key and seq column. The seq column sometimes has little meaning (other than indicating the absence of sequentiality pertaining to some rows). Running such a CREATE TABLE statement -- by a user at the command line or by a custom application -- creates these columns within each table. The href column comes straight from the identically named XML tag attribute. And the PCDATA and _XML columns are places to hold the actual element contents (either with or without any embedded character-level markup).

The most interesting thing in the CREATE TABLE statement shown is the several foreign_key_* columns. I consider those below.


Creating relations

According to the relational model, different tables are connected to each other via a primary key/foreign key identity. A JOIN in SQL is simply a way of saying that a field in one table must correspond to another field in a different table. A primary key is a special thing in the relational model: It must be unique to each record (row) of a table. Most of the time, a database analyst looks at the deep structure of the data and figures out -- in consultation with application programmers and end users -- what the best candidates are to serve as primary keys. These keys can be a concatenation of multiple columns, and often identifiers like Social Security numbers, employee IDs, ISBNs, or part IDs are used in these roles.

Obviously enough, dtd2sql cannot do all the background research that a database analyst does: All it has is a DTD (or maybe an XML document). On that basis, there is no real way of determining which attributes or element contents will be unique (if any). Fortunately, dtd2sql can take the road of some commercial RDBMSs and simply eschew "natural" primary keys. Instead, the modules are able to choose wholly artificial primary keys -- ones that isolate the uniqueness requirement while refusing any data-representation role. I would argue that such a pattern actually achieves better orthogonality in database design than does the more common strategy of identifying suitable "real world" data. And the approach has the additional advantage of giving every table's primary key an identical and predictable name and format.

The primary keys used are random 18-digit integers. Assuming that Python's random module is quite good, the risk of collision is pretty small. However, the code creates no strict guarantee of noncollision (maybe in a later version). So far, so good.

The next step is making these primary keys usable for purposes of SQL JOINs. To do that, we need to make sure that when an XML element can contain a subelement, the subelement contains a foreign key corresponding to the parent's primary key. Admittedly, the most parsimonious manner of achieving the above goal is to create a single foreign_key column for every non-root XML element. In that case, the SQL user who queried the resultant database would need to know what JOINs would produce results (say by reading the original DTD).

Throwing away parsimony, I opt for explicitness. I create a separate foreign_key_* column for every element that might be a parent of the XML element to which a table corresponds. So in the above CREATE TABLE example, dtd2sql identified the DTD element definitions shown in Listing 4.

<!ELEMENT p        (#PCDATA | code | img | br | i | b | a)* >
<!ELEMENT li       (#PCDATA | code | img | br | i | b | a)* >
<!ELEMENT prompt   (#PCDATA | code | img | br | i | b | a)* >
<!ELEMENT response (#PCDATA | code | img | br | i | b | a)* >

An advantage of the explicit approach of dtd2sql is that the table structures created inherently contain a large part of the information in the DTD (but not all of it, since quantifiers are not differentiated thereby).


Walking the winding roads

Putting data into the tables created by dtd2sql is the job of xml2sql. Well, technically neither tool actually puts any data anywhere; each just specifies what the data is. You need to use the tools that come with your RDBMS to actually load the data.

The truth is that xml2sql does remarkably little. It has fewer than 50 lines of code at its heart (the walkNodes() function). Furthermore even those few lines are well documented and make no effort to achieve conciseness through programming tricks. Of course, most of what xml2sql does is actually done by xml_objectify. The first step for xml2sql is to create a "Pythonic" object using xml_objectify. After that, it is simple to recurse through all the nested attributes, outputting INSERT INTO SQL statements as it goes. Users of an older version of xml_objectify, however, will need to grab the latest one because of a slight change in the way XML element names are "mangled" along the way.

Once you run xml2sql you get a bunch of SQL statements in return. This collection of statements can be redirected and piped according to normal STDOUT behavior, which makes combining xml2sql with RDBMS command-line tools straightforward. If you want to use xml2sql as a support module, you can get the set of SQL statements as a Python list (which might be handy for use with some database modules). A typical produced statement looks like the example in Listing 5 (wrapped for presentation).

INSERT INTO p
       (primary_key, seq, foreign_key_text__column, PCDATA)
VALUES (15447926390024014, 0, 527610371062647168,
        "Navigating through the tutorial is easy:");

You can see from the form of the INSERT INTO in Listing 5 that a table had been created to correspond with the <p> element in our DTD. Actually, the element is only really known to occur in the XML document; validation of the XML document against the DTD is a job that needs to be handled outside of these modules. But assuming the XML document is valid, the right tables and column were created by dtd2sql.

You can also see in the INSERT INTO that this particular <p> tag is nested inside a <text-column> element (some name mangling is needed to get valid SQL column names), to wit, that element that has the primary key 527610371062647168. It turns out also that this <p> element has some PCDATA contents, and that it has a seq column value of zero. The meaning of that list bit is that this <p> element is all alone within its container; had multiple <p> elements occurred in the same <text-column> they would be sequenced, beginning with one.


Putting it together

Once you have a bunch of data inside an RDBMS, you usually want to get it back out in structured and useful ways. Fortunately, with a basic understanding of the primary key and foreign key strategy employed, you can find anything you want. In fact, in many ways, the flexibility you have at this point is better than you would have had with an XPath query syntax. Look at an example in Listing 6.

SELECT "Paragraph", p.seq, p._XML
  FROM title,panel,body,text__column TC,p
 WHERE title.foreign_key_panel = panel.primary_key
   AND body.foreign_key_panel = panel.primary_key
   AND TC.foreign_key_body = body.primary_key
   AND p.foreign_key_text__column = TC.primary_key
   AND title.PCDATA="About Haskell"
 ORDER BY p.seq
;

A word or two of explanation is worthwhile. The JOINs are all of the same form: a foreign_key_X field is JOIN'ed with the primary_key of some table X. Once all the JOINs are in place, you can add a more substantive condition, along with ORDER'ing, GROUP'ing, and the like. In this case you want to look at all the paragraphs (<p> elements) of the <panel> whose <title> is "About Haskell." The result looks something like Listing 7.

C:\mysql2\bin>mysql -u root -pgnosis test < haskell.sql
Paragraph  seq  _XML
Paragraph  1    Haskell is just one of a number of functional programming...
Paragraph  2    Among functional languages, Haskell is in many ways the...
Paragraph  3    On a minor note, Haskell is syntactically easier to get...


Conclusion

This column illustrates the command-line usage of dtd2sql and xml2sql. For quick testing and actual shell usage, this is probably the approach you'll want to use. However, as with most things in Python, it is very simple to reuse the modules within your own code. The self-test code (the command-line usage) provides a straightforward template for any importing module to follow. I look forward to hearing about the wonderful uses readers manage to put these modules to -- as you have done for so many others.


Resources

About the author

Author photo: David Mertz

David Mertz believes that without data, life itself would be impossible ... well, OK, at least the Internet would get less hype. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12008
ArticleTitle=XML Matters: Using Python modules xml2sql and dtd2sql
publish-date=06012001
author1-email=mertz@gnosis.cx
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).