The Extensible Markup Language, or XML for short, is a new technology for Web applications that has the official recommendation of the World Wide Web Consortium (W3C). XML is a descendant of the Standard Generalized Markup Language (SGML), a markup standard created by former IBMer Dr. Charles Goldfarb, that lets you create your own tags.
When people first hear about XML, they often ask why we need another markup language. Everybody's browser supports HTML today, so why create more tags? Given that lots of HTML tags haven't been implemented the same way by the big browser vendors, why let anybody and everybody create their own tags?
The answer to these questions is that HTML and XML serve different functions: HTML tags describe how to render things on the screen, while XML tags describe what things are. Put another way, HTML tags are designed for the interaction between humans and computers; XML tags are designed for the interaction between two computers.
To see this difference, look at the HTML and XML versions of a short document. Listing 1 shows the HTML version.
Listing 1. The HTML version of an address
<p><b>Mrs. Mary McGoon</b> <br> 1401 Main Street <br> Anytown, NC 34829</p>
When this document is rendered in a browser, it looks something like this:
Mrs. Mary McGoon
1401 Main Street
Anytown, NC 34829
Anyone familiar with postal addresses in the United States will recognize this document as someone's address. Even if you're from another country where postal codes and other conventions are different, you can still surmise that this is someone's address. Imagine writing code to interpret this document, however. To extract the zip code from this address, our algorithm might look like this: Given a <p> tag that contains two <br> tags, take the text of the second <br> tag. In that text, everything up to the comma is the name of the city, the two-character token following the comma is the name of the state, and the final token is the zip code.
While this algorithm would work for our sample HTML document, it's easy to think of a perfectly valid address that breaks our algorithm. We've also completely sidestepped the issue of distinguishing a <p> tag that contains an address from any other <p> tag. While the address formats beautifully in a browser, our HTML markup isn't nearly as well suited for use by another program.
Now let's take a look at an XML version of the same document in Listing 2.
Listing 2. The XML version of the same address
<?xml version="1.0"?> <address> <name> <title>Mrs. </title> <first-name>Mary</first-name> <last-name>McGoon</last-name> </name> <street>1401 Main Street</street> <city>Anytown</city> <state>NC</state> <zipcode>34829</zipcode> </address>
As with our HTML document, anyone familiar with U.S. postal addresses will recognize this document as an address. More importantly, a computer can recognize the parts of this address as well. Here's a much more robust algorithm for finding the zip code in our XML document:
The zip code is the text of the <zipcode> tag.
This algorithm is much simpler to code, and it would be difficult, if not impossible, to write a valid address that breaks this algorithm. A computer can understand all of the parts of the address and how they relate to each other, and the computer can decide the best way to render that data. For example, the XML document might be rendered like this:
Mrs. Mary McGoon
1401 Main Street
Anytown, NC 34829
In rendering the XML tags in this style, you could convert them into HTML markup that's virtually identical to the earlier HTML document. If you want to print a mailing label for this address, you might render the document like this:
Figure 1. Mailing label with bar code and address
In this case, you print Mrs. McGoon's zip code as a bar code for the benefit of the scanners at the post office. The most important concept here is that content and presentation are separate. The data and its structure are tagged in a presentation-independent way, and the decision of how to render it is delayed as long as possible.
Everyone wants to know how XML will change the Web. First of all, XML will not replace HTML. The two markup languages are designed for different purposes, and they will coexist on the Web for many years to come.
Now that I've laid to rest Web developers' biggest concern about XML, let's consider what impact XML will have on the Web. XML will establish a universal data format on the Web. Better business-to-business communication, better agents, and better searches will all be made possible by XML.
If you look at the Web today, you'll find several universal technologies, including TCP/IP, HTML, and Java.
- TCP/IP is the universal connectivity protocol; everything from mainframes to laptops to cellular phones can connect to the Web using it.
- HTML is the universal rendering language. Although not all browsers support all functions, there is a core set of HTML tags that can be rendered on any browser.
- Finally, Java's promise of "write once, run anywhere" makes supporting the wide variety of devices on the Web much easier.
Because of these ubiquitous technologies, it's relatively straightforward to create a Web application that runs on any platform. XML completes the picture by enabling universal data. You can build an XML document that describes a data structure, and that structured data can be sent anywhere across the Web. XML will change the Web because of its power and flexibility as a data interchange format.
One of the challenges in conducting e-business is communicating with other organizations, whether they are partners, suppliers, competitors, or even other groups within the same company. XML simplifies business-to-business communication because the only thing that any two organizations have to agree on is the XML tag set that will be used to represent data. Neither organization has to know how the other's back-end systems are organized. If my systems run OS/390 and your systems run Linux, that doesn't matter. If my databases are relational and yours are object-oriented, that doesn't matter. If my code was written in C++ and yours was written in Java, that doesn't matter. The only thing that's important is that we agree on a standard set of tags for data interchange.
Once we've agreed on a tag set, each of us can write the mapping code to transform XML documents into whatever format we need to work with our back-end systems. For example, an XML document that's received from a partner might be parsed, then converted into a transaction that drives some business process on my system. Even better, if another company joins our consortium, we don't have to write more code to interact with the systems of the new company. We simply require that company to follow the document rules we defined in our XML tag set.
When writing an agent, one of the challenges is to make sense of incoming data. A good agent interprets information intelligently, then responds to it accordingly. If the data sent to an agent is structured with XML, it's much easier for the agent to understand exactly what the data means and how it relates to other pieces of data it may already know. As we illustrated in our sample HTML document, writing code to interpret the data contained in HTML tags is difficult and error prone. With XML, the structure of the data is easily determined and manipulated.
A major problem with today's Web is that search engines can't process HTML intelligently. For example, if you're searching for someone named Chip, you might get pages for chocolate chip cookies, RAM chips, poker chips, and guys named Chip. On the other hand, if you were searching for documents that contained a tag with a value of "Chip," you would get much better results. Being able to limit searches to those XML documents that use a certain set of tags would allow you to weed out a massive amount of unrelated content.
As an aside, being able to limit search results to documents that use a particular tag set is one of the market forces that will drive the acceptance of XML. Say that a group of automobile dealers defines a tag set for used cars, and that several popular search engines promise great results because their search engines look only at XML documents using those tags. If you're an auto dealer, you can either join the market and support that tag set or be left out of the market completely. If your inventory is not described using the standard XML markup, would-be car buyers using an XML search engine will never find you.
XML is poised to change the Web, enabling a whole new generation of e-business applications. Just as HTML and graphical browsers sparked an exponential growth in Web use, XML's enhancements to e-business will start another period of exponential growth. Let's get started!
- Check out the XML Specification for the official word on XML.
- Find out more about XML with the Introduction to XML tutorial. It takes about 45 minutes to complete, but in 10 minutes you can learn the basic vocabulary.
MC Dug-T is developerWorks' Minister of Science, droppin' the XML, Java, and Web services 411 on the public. In his travels, he gets mad props from his peeps worldwide for the stone-cold, stoopid-fresh style sheets he leaves behind. All his mad-phat nollidge will soon be published by O'Reilly and Associates in the Strictly Non-Fiction book XSLT (ISBN 0596000537, pre-order your copy today at amazon.com) which will then start slayin' soft-sellin' suckas at tha local booksella. Discussing the book in a recent dW interview, he boasted, "I'm gonna empty mah dome into one supa-fly tome."
For relaxation, he likes to put his hands up in the air, and in his words, "wave 'em around like I just don't care." When not chillin' with his worldwide XML krew, he maxes at the crib in Raleigh with his wife, cooking teacher CT-ONE, and their six-year-old shortie, Lily the Flayva Princess. You can send him a shout-out at email@example.com.