What is XML?
XML, or Extensible Markup Language, is a markup language that you can use to create your own tags. It was created by the World Wide Web Consortium (W3C) to overcome the limitations of HTML, the Hypertext Markup Language that is the basis for all Web pages. Like HTML, XML is based on SGML -- Standard Generalized Markup Language. Although SGML has been used in the publishing industry for decades, its perceived complexity intimidated many people that otherwise might have used it (SGML also stands for "Sounds great, maybe later"). XML was designed with the Web in mind.
HTML is the most successful markup language of all time. You can view the simplest HTML tags on virtually any device, from palmtops to mainframes, and you can even convert HTML markup into voice and other formats with the right tools. Given the success of HTML, why did the W3C create XML? To answer that question, take a look at this document:
<p><b>Mrs. Mary McGoon</b> <br> 1401 Main Street <br> Anytown, NC 34829</p>
The trouble with HTML is that it was designed with humans in mind. Even without viewing the above HTML document in a browser, you and I can figure out that it is someone's postal address. (Specifically, it's a postal address for someone in the United States; even if you're not familiar with all the components of U.S. postal addresses, you could probably guess what this represents.)
As humans, you and I have the intelligence to understand the meaning and intent of most documents. A machine, unfortunately, can't do that. While the tags in this document tell a browser how to display this information, the tags don't tell the browser what the information is. You and I know it's an address, but a machine doesn't.
To render HTML, the browser merely follows the instructions in the HTML document. The paragraph tag tells the browser to start rendering on a new line, typically with a blank line beforehand, while the two break tags tell the browser to advance to the next line without a blank line in between. While the browser formats the document beautifully, the machine still doesn't know this is an address.
To wrap up this discussion of the sample HTML document, consider the task of extracting the postal code from this address. Here's an (intentionally brittle) algorithm for finding the postal code in HTML markup:
If you find a paragraph with two
<br> tags, the postal code is the second word after the first comma in the second break tag.
Although this algorithm works with this example, there are any number of perfectly valid addresses worldwide for which this simply wouldn't work. Even if you could write an algorithm that found the postal code for any address written in HTML, there are any number of paragraphs with two break tags that don't contain addresses at all. Writing an algorithm that looks at any HTML paragraph and finds any postal codes inside it would be extremely difficult, if not impossible.
Now let's look at a sample XML document. With XML, you can assign some meaning to the tags in the document. More importantly, it's easy for a machine to process the information as well. You can extract the postal code from this document by simply locating the content surrounded by the
</postal-code> tags, technically known as the
<address> <name> <title>Mrs.</title> <first-name> Mary </first-name> <last-name> McGoon </last-name> </name> <street> 1401 Main Street </street> <city>Anytown</city> <state>NC</state> <postal-code> 34829 </postal-code> </address>
There are three common terms used to describe parts of an XML document: tags, elements, and attributes. Here is a sample document that illustrates the terms:
<address> <name> <title>Mrs.</title> <first-name> Mary </first-name> <last-name> McGoon </last-name> </name> <street> 1401 Main Street </street> <city state="NC">Anytown</city> <postal-code> 34829 </postal-code> </address>
- A tag is the text between the left angle bracket (
<) and the right angle bracket (
>). There are starting tags (such as
<name>) and ending tags (such as
- An element is the starting tag, the ending tag, and everything in between. In the sample above, the
<name>element contains three child elements:
- An attribute is a name-value pair inside the starting tag of an element. In this example,
stateis an attribute of the
<city>element; in earlier examples,
<state>was an element (see A sample XML document).
Now that you've seen how developers can use XML to create documents with self-describing data, let's look at how people are using those documents to improve the Web. Here are a few key areas:
- XML simplifies data interchange. Because different organizations (or even different parts of the same organization) rarely standardize on a single set of tools, it can take a significant amount of work for applications to communicate. Using XML, each group creates a single utility that transforms their internal data formats into XML and vice versa. Best of all, there's a good chance that their software vendors already provide tools to transform their database records (or LDAP directories, or purchase orders, and so forth) to and from XML.
- XML enables smart code. Because XML documents can be structured to identify every important piece of information (as well as the relationships between the pieces), it's possible to write code that can process those XML documents without human intervention. The fact that software vendors have spent massive amounts of time and money building XML development tools means writing that code is a relatively simple process.
XML enables smart searches. Although search engines have improved steadily over the years, it's still quite common to get erroneous results from a search. If you're searching HTML pages for someone named "Chip," you might also find pages on chocolate chips, computer chips, wood chips, and lots of other useless matches. Searching XML documents for
<first-name>elements that contained the text
Chipwould give you a much better set of results.
I'll also discuss real-world uses of XML in Case studies .