You love XML and the flexibility and interoperability that it offers, but you can do some things to make your interaction with XML and the tools that you use to work with it significantly easier. Picking up some basic good habits when you work with XML will ensure that you get the most efficient use out of your XML documentations and applications.
Adopt 10 good habits
Here are the top 10 good XML habits to adopt:
- Define your XML and encoding
- Use a DTD or XSD
- Remember to validate
- Validation isn't always the answer
- XML structure versus attributes
- Use XPath to find information
- You don't always need a parser to extract information
- When to use SAX over DOM parsing
- When to DOM over SAX parsing
- Use a good XML editor
Define your XML and encoding
When you create an XML document quickly, it can be very tempting to create the basic structure and eschew the normal XML document requirements of specifying the XML declaration and the encoding type of the data that the XML document contains.
Consider the XML document in Listing 1.
Listing 1. XML document minus the XML declaration and data encoding type
<phrases> <phrase lang="en">Hello</phrase> <phrase lang="it">Buongiorno</phrase> <phrase lang="fr">Salut!</phrase> </phrases>
As a human, you can look at that document and identify it as XML, but it is more difficult for a computer to achieve the same determination. You can make the process more explicit and identifiable by adding the XML declaration to the top of the file. This is a single line that specifies that the document is XML, and also describes a version number and the character encoding used in the XML data. For example:
<?xml version="1.0" encoding="us-ascii"?>
The content of the encoding specification should be accurate, too. The encoding is used by XML parsers to ensure that the individual character is loaded correctly from the XML document. For example, continuing the phrase-based example in Listing 1, the addition of a Russian entry into your document would cause a problem because currently you specify an encoding that does not support the extended character set required by the Russian phrase for hello.
Specifying the wrong encoding might mean that parsers process the document incorrectly; for example, reading a multibyte extended character as just a sequence of individual bytes might lead to corrupt data and bad output.
Use a DTD or XSD
Once you have the XML declaration in place, you should then ensure that the valid structure of your XML file is defined with a DTD or an XSD. Either solution allows XML parsers to check and confirm that the contents of the XML file match the structure appropriate for the data that you are trying to model.
For example, given a simple XML structure for a contact database, you want to define a structure that allows for the contact's name, address, and phone numbers to be specified. Using a DTD means that you can map out the structure and ensure that each of the contacts within the structure match the layout.
For example, Listing 2 shows a DTD for the contacts database.
Listing 2. A DTD for the contacts database
<!ELEMENT phone (#PCDATA)> <!ATTLIST phone type (home | work | mobile) #REQUIRED> <!ELEMENT contact (#PCDATA | name | phone | address)*> <!ELEMENT contacts (#PCDATA | contact)*> <!ELEMENT country (#PCDATA)> <!ELEMENT road (#PCDATA)> <!ELEMENT address (#PCDATA | road | city | state | postcode | country)*> <!ATTLIST address type (home | work) #REQUIRED> <!ELEMENT state (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT postcode (#PCDATA)> <!ELEMENT city (#PCDATA)>
The DTD defines the elements, attributes (and the supported values of those attributes) required to describe a contact. You can see in Listing 2, for example, that a phone element has a type attribute, and that you also have attributes for the address and for components within the address.
Use of a DTD helps to ensure that the structure is valid and, when used in combination with a validation process, can identify any problems. When used with an XML-capable editor, DTDs can also help with editing and automated completion of the content.
XSDs, or schemas, perform many of the same functions as DTDs, but can be useful in different ways. For example, while some XML editors require a DTD for automated completion of content, schemas can provide more flexibility in the design of the actual hierarchy for the document. The tool you choose will depend on your own circumstances.
Remember to validate
Looking at Listing 3, can you spot the problem?
Listing 3. A validation example
<contacts> <contact> <name>Martin</name> <phone type="home">123 456 7890</phone> <phone type="mobile">123 456 7890</phone> <phone type="work">123 456 7890</phone> <address type="home"> <road>Home road</road> <city>Home city</city> <state>Home state</state> <zipcode>12434</zipcode> <country>USA</country> </address> </contact> <contact> <name>Sharon</name> <phone type="work">234 567 8901</phone> <phone>234 567 8901</phone> <address type="home"> <road>Other home road</road> <city>Other city</city> <state>Other state</state> <zipcode>39487</zipcode> <country>USA</country> </address> <address type="work> <road>Work building, work road</road> <city>Work city</city> <state>Work state</state> <zipcode>12347</zipcode> <country>USA</country> </address> </contact> </contacts>
Finding the problem by hand is tedious. But run the file through xmllint, a free tool that verifies the content and structure of the XML file, and you can see the output when executed against this file in Listing 4.
Listing 4. Output after running the Listing 3 through xmllint
$ xmllint contacts.xml contacts.xml:27: parser error : Unescaped '<' not allowed in attributes values <road>Work building, work road</road> ^ contacts.xml:27: parser error : attributes construct error <road>Work building, work road</road> ^ contacts.xml:27: parser error : Couldn't find end of Start Tag address line 26 <road>Work building, work road</road> ^ contacts.xml:32: parser error : Opening and ending tag mismatch: contact line 15 and address </address> ^ contacts.xml:33: parser error : Opening and ending tag mismatch: contacts line 1 and contact </contact> ^ contacts.xml:34: parser error : Extra content at the end of the document </contacts>
Although this looks very complicated compared to the original problem (one of the attributes wasn't closed), it does give you a place to start.
Incidentally, xmllint supports a number of different command line options to help
select the diagnosis method and results. One of the most useful options is the
--noout option, which prevents xmllint from echoing the content when the file is parsed. For short files this is not a problem, but for longer files it can be an issue.
If you are using a DTD, then use the
--postvalid option to tell xmllint to validate the content against the DTD and ensure that the content is not only valid XML, but also that it matches the structure of the DTD. If the DTD that you generated for the contacts file in Use a DTD or XSD is added to the file, and the attribute definition error is corrected, then a different error is produced, as seen here in Listing 5.
Listing 5. xmllint finds a different error
$ xmllint --noout --postvalid contacts.xml contacts.xml:9: element address: validity error : Element zipcode is not declared in address list of possible children contacts.xml:21: element address: validity error : Element zipcode is not declared in address list of possible children contacts.xml:28: element address: validity error : Element zipcode is not declared in address list of possible children Document contacts.xml does not validate
Using xmllint in this way is a quick, convenient way to confirm the structure of a document is valid. xmllint is available as part of the libxml2 toolkit, which is bundled with Linux, UNIX®, and Mac OS X, but requires a separate download for Windows®. For more information on xmllint and libxml2, see the Resources.
Validation isn't always the answer
Using xmllint and similar tools to validate your XML files, particularly if you use a DTD, is a great way to validate the content of your XML files. The solution, however, does have its limitations. What about the content of the XML file for instance?
With a DTD or XSD, you can specify explicit contents for attributes. You only create attributes with a string or ID that can be part of a restricted list of available options, but the content of elements cannot be controlled or limited in the same way.
For example, in the contacts example, the telephone numbers element contains numbers and spaces. But there's nothing to stop a user adding alphabetic characters to that element. Doing so won't bring up an error during validation using xmllint, and editors and other XML-aware solutions won't address or identify the problem. The failure of your application because it identified a non-standard data type might be the way you actually learn about the problem.
In short, XML validation only ensures the structure is correct, not the data.
The easiest way to address this is to write a parser that reads the XML file and actually validates the data content. Don't go overboard in verifying the content though; you only need to go as far as ensuring that the data meets the requirements of your application.
XML structure versus attributes
Opinion is divided on whether it is better to use attributes or elements to describe the information that you want to represent in the XML file.
As a general rule, you should use elements (that is, the data between the tags) to define the information contained within a file. Attributes should be used to provide extended qualification of the data that you describe.
Both elements and attributes have limitations. Attributes, for example, cannot be repeated within a tag, a classic case of where elements have an advantage over attributes. The ability to support repeating information in this way makes them very practical. In contrast, using elements to qualify the data can be sometimes be more complex to process, too.
The phone numbers in the contacts example provides a good explanation of the benefits. In the example, shown here in Listing 6, attributes are used to qualify the type of phone number (such as work, home, or mobile).
Listing 6. Qualifying the type of phone numbers
<phone type="home">123 456 7890</phone> <phone type="mobile">123 456 7890</phone> <phone type="work">123 456 7890</phone>
With this structure, it is easy to pick out numbers as a whole (by ignoring attributes), or to pick out a specific phone number type (by using the attribute).
Compare that structure to one designed using only elements in Listing 7.
Listing 7. Using only elements to qualify the phone number
<phone> <type>home</type> <number>123 456 7890</number> </phone> <phone> <type>mobile</type> <number>123 456 7890</number> </phone> <phone> <type>work</type> <number>123 456 7890</number> </phone>
Now it is difficult to see the wood for the trees. Although, in theory, any XML parser or a suitable XPath definition can pull out the information you want, you gained very little, while making the XML document difficult to read.
Use XPath to find information
When working with XML data, finding the information you want can be complex. You can, of course, write a parser to pick out the material that you need, but sometimes, you really just need to find a small fragment of the information in the file very quickly.
For example, if you wanted to extract a list of all the countries in your contacts XML file so that you could see how widely spread your contacts were, you could use XPath to pick out the information.
XPath enables you to pull out the data from an XML file by using the structure of the XML file as part of the query. You can, for example, extract the data for a specific element by giving the path to the element within the XML file:
$ xpath contacts.xml '//contact/address/country'
You can dissect the content like this:
- The initial double slash (//) specifies to look anywhere within the document for the specified element (contact).
- The next slash and element name specify the next element to pick out (address)—that is, look for the 'ddress element within the contact element.
- The final one repeats the process, this time picking the country.
Note that in the example, you qualified the type of address to select the information from, so it will pick all addresses. You can see the result of the XPath query in Listing 8.
Listing 8. Result of the XPath query
$ xpath contacts.xml '//contact/address/country' Found 3 nodes: -- NODE -- <country>USA</country>-- NODE -- <country>USA</country>-- NODE -- <country>USA</country>
If you want to pick out more specific data, you can specify the element contents, or attribute contents that you want to match. For example, to select only mobile phone numbers, you need to specify the attribute type and value. To do this, use an at sign (@), which specifies that you want to search an attribute, and then specify the value you want to match (see Listing 9).
Listing 9. Selecting only mobile phone numbers
$ xpath contacts.xml '//contact/phone[@type="mobile"]' Found 1 nodes: -- NODE -- <phone type="mobile">123 456 7890</phone>
Listings 8 and 9 use a command line tool. Many XML toolkits provide native methods to work with XPath elements, and you can extract data using the XPath specification to use in your applications directly, without having to work with a parser to get the information.
You don't always need a parser to extract information
Although it seems counter-intuitive, you don't always need to use a full XML parser employing SAX, DOM or other techniques like XPath or XQuery to pull out the information that you want from XML files.
XML files contain data in a structured format, and although sometimes you need that information in its structured format. More often than not, when you are quickly looking for a piece of information, a more simple solution will work.
Often you can get away with just using grep, or Perl, or something similar to extract the data you want without actually parsing the structure or content of the document as an XML file.
For example, you can pick out phone numbers using grep (see Listing 10).
Listing 10. Picking out phone numbers using grep
$ grep '<phone' contacts.xml <phone type="home">123 456 7890</phone> <phone type="mobile">123 456 7890</phone> <phone type="work">123 456 7890</phone> <phone type="work">234 567 8901</phone> <phone>234 567 8901</phone>
You've picked out the information you want, without worrying about the fact that it is XML, or indeed concerning yourself with the structure.
When all you want is a quick piece of information, simplified processing techniques are just as capable of finding the information you want, without the overhead associated with a traditional parsing solution.
When to use SAX over DOM parsing
When you build a parser for your documents to pull out the information that you want, it is often difficult to determine when to use a SAX-based processor, and when to use a DOM-based processor.
The easiest way to make the decision is to consider both the complexity of the documents and what you want to do with the information. If you convert or translate documents, or the document is particularly large, then SAX is your best choice.
SAX parses the document element by element, triggering a method or function to be called when the element is identified. If you convert an XML document to another format, for example translating XML to HTML, then SAX is the most efficient way. You don't have to load the entire document into memory, just react to the elements and structure being identified.
The downside with SAX is if you need to save or record the structure, or to understand the document as a whole and pick out individual elements from the document (for example, selecting a single contact in its entirety). To do this you need to build complex processes that load the XML, record the data into a structure, and are then capable of identifying the elements into the output target.
When to use DOM over SAX parsing
DOM processing loads the entire document and its structure into memory and allows you to refer to and use the structure of your XML document within your application. For example, with the contacts example, you could read the entire contacts database into memory, and then select all the phone numbers by iterating over the contacts, and then within each contact, iterate over each phone number.
Because DOM retains the structure, and more importantly understands and works with the structure, you can easily work with the structure as a whole or on an individual bases. Staying with the contacts example, inserting a new contact with SAX would be complex. But with DOM, you can just insert a new XML element representing the new contact into the existing XML document.
The limitation of DOM is that processing the file in a stream—for example, translating to HTML—is made more complex, because you have to process the document by iterating over each element individually within the structure.
Furthermore, because DOM loads the entire XML document into memory during the parsing, DOM parsers can be slower and obviously requires more memory. The DOM process provides some benefits related to this; for example, you can process an XML document parsed using DOM multiple times from a single parse. With SAX, you have to repeat the parse multiple times to achieve the same result.
See Resources to find out more details and examples of using DOM and SAX.
Use a good XML editor
If you regularly write and use XML, then a good XML editor is a must. XML editors differ from standard text editors because they understand the structure and layout of XML. They can offer a whole range of features that make it easier to work with XML, including:
- Completion—Start typing the characters for a closing element, and they can finish typing the rest for you.
- Content completion—If you are using an XML file with a DTD, then they can fill in and format parts of the content for you. For example, in the contacts DTD, the type attribute to the phone element was a required element. With an intelligent XML editor the attribute (with an empty value) is automatically introduced to the text when you create the phone tag.
- Inline formatting—An editor can make your XML easier to read and easier to understand. This can either be done live while you edit, or by a separate format command. The end result is XML you can understand and identify more quickly.
- Built-in validation—The editor can validate XML documents for errors while you type, highlighting different issues right in the editor so you know what to fix.
- Built-in translation and conversion—Some XML editors include interfaces to XPath, XQuery, and in some cases XSLT and other transformations, so you can see the results within your editing environment.
- Learning and manipulation—Sometimes you create the XML structure before the DTD. In those cases, an editor that can read your XML file, learn the structure and then create a DTD for validation that can save you a lot of time and energy.
Examples of good XML editors include Eclipse and oXygenXML, but plenty of other choices are out there.
Learning good habits in XML can make all the difference between taking advantage of the functionality offered by XML and struggling against the XML standard to get the basics of validation and parsing right. This article should help you to adopt 10 good habits that improve your effectiveness and efficiency as you work with XML documents and data.
- The Java XPath API (Elliote Rusty Harold, developerWorks, June 2007): Learn all about XPath, an XML technology for navigating and selecting information from XML documents.
- Understanding DOM (Nicholas Chase, developerWorks, March 2007): Get a solid introduction to the Document Object Model (DOM) and the structure of a DOM document, including how to use Java technology to create a Document from an XML file, make changes to it, and retrieve the output.
- Understanding SAX (Nicholas Chase, developerWorks, July 2003): Learn about an alternative to the DOM known as the Simple API for XML (SAX). SAX allows you to process a document as it's being read, which avoids the need to wait for all of it to be stored before taking action.
- Introduction to XML (Doug Tidwell, developerWorks, August 2002): Get a basic grounding in XML from this classic tutorial.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- The technology bookstore: Browse for books on these and other technical topics.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
- libxml2: Get the xmllint tool, the XML C parser and toolkit developed for the Gnome project, which is bundled with Linux, UNIX, and Mac OS X. It requires a separate download for Windows.
- oXygenXML: Get an XML editor with a rich environment to create, validate, and process XML documents.
- The Eclipse IDE: Get the popular development platform, which includes an XML editing function designed to work with the applications you are developing for XML processing.
- IBM trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- XML zone discussion forums: Participate in any of several XML-related discussions.
- developerWorks XML zone: Share your thoughts: After you read this article, post your comments and thoughts in this forum. The XML zone editors moderate the forum and welcome your input.
- developerWorks blogs: Check out these blogs and get involved in the developerWorks community.