XML document rules
If you've looked at HTML documents, you're familiar with the basic concepts of using tags to mark up the text of a document. This section discusses the differences between HTML documents and XML documents. It goes over the basic rules of XML documents, and discusses the terminology used to describe them.
One important point about XML documents: The XML specification requires a parser to reject any XML document that doesn't follow the basic rules. Most HTML parsers will accept sloppy markup, making a guess as to what the writer of the document intended. To avoid the loosely structured mess found in the average HTML document, the creators of XML decided to enforce document structure from the beginning.
(By the way, if you're not familiar with the term, a parser is a piece of code that attempts to read a document and interpret its contents.)
There are three kinds of XML documents:
- Invalid documents don't follow the syntax rules defined by the XML specification. If a developer has defined rules for what the document can contain in a DTD or schema, and the document doesn't follow those rules, that document is invalid as well. (See Defining document content for a proper introduction to DTDs and schemas for XML documents.)
- Valid documents follow both the XML syntax rules and the rules defined in their DTD or schema.
- Well-formed documents follow the XML syntax rules but don't have a DTD or schema.
An XML document must be contained in a single element. That single element is called the root element,
and it contains all the text and any other elements in the document. In the following example, the XML document is contained in a single element, the
<greeting> element. Notice that the document has a comment that's outside the root element; that's perfectly legal.
<?xml version="1.0"?> <!-- A well-formed document --> <greeting> Hello, World! </greeting>
Here's a document that doesn't contain a single root element:
<?xml version="1.0"?> <!-- An invalid document --> <greeting> Hello, World! </greeting> <greeting> Hola, el Mundo! </greeting>
An XML parser is required to reject this document, regardless of the information it might contain.
XML elements can't overlap. Here's some markup that isn't legal:
<!-- NOT legal XML markup --> <p> <b>I <i>really love</b> XML. </i> </p>
If you begin a
<i> element inside a
<b> element, you have to end it there as well. If you want the text
XML to appear in italics, you need to add a second
<i> element to correct the markup:
<!-- legal XML markup --> <p> <b>I <i>really love</i></b> <i>XML.</i> </p>
An XML parser will accept only this markup; the HTML parsers in most Web browsers will accept both.
You can't leave out any end tags. In the first example below, the markup is not legal because there are no end paragraph (
</p>) tags. While this is acceptable in HTML (and, in some cases, SGML), an XML parser will reject it.
<!-- NOT legal XML markup --> <p>Yada yada yada... <p>Yada yada yada... <p>...
If an element contains no markup at all it is called an empty element;
the HTML break (
<br>) and image (
<img>) elements are two examples. In empty elements in XML documents, you can put the closing slash in the start tag. The two break elements and the two image elements below mean the same thing to an XML parser:
<!-- Two equivalent break elements --> <br></br> <br /> <!-- Two equivalent image elements --> <img src="../img/c.gif"></img> <img src="../img/c.gif" />
XML elements are case sensitive. In HTML,
<H1> are the same; in XML, they're not. If you try to end an
<h1> element with a
</H1> tag, you'll get an error. In the example below, the heading at the top is illegal, while the one at the bottom is fine.
<!-- NOT legal XML markup --> <h1>Elements are case sensitive</H1> <!-- legal XML markup --> <h1>Elements are case sensitive</h1>
There are two rules for attributes in XML documents:
- Attributes must have values
- Those values must be enclosed within quotation marks
Compare the two examples below. The markup at the top is legal in HTML, but not in XML. To do the equivalent in XML, you have to give the attribute a value, and you have to enclose it in quotes.
<!-- NOT legal XML markup --> <ol compact> <!-- legal XML markup --> <ol compact="yes">
You can use either single or double quotes, just as long as you're consistent.
If the value of the attribute contains a single or double quote, you can use the other kind of quote to surround the value (as in
name="Doug's car"), or use the entities
" for a double quote and
' for a single quote. An entity is a symbol, such as
that the XML parser replaces with other text, such as
Most XML documents start with an XML declaration that provides basic information about the document to the parser. An XML declaration is recommended, but not required. If there is one, it must be the first thing in the document.
The declaration can contain up to three name-value pairs (many people call them attributes, although technically they're not). The
version is the version of XML used; currently this value must be
encoding is the character set used in this document. The
ISO-8859-1 character set referenced in this declaration includes all of the characters used by most Western European languages. If no
encoding is specified, the XML parser assumes that the characters are in the
UTF-8 set, a Unicode standard that supports virtually every character and ideograph from the world's languages.
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
standalone, which can be either
defines whether this document can be processed without reading any other files. For example, if the XML document doesn't reference any other files, you would specify
If the XML document references other files that describe what the document can contain (more about those files in a minute), you could specify
standalone="no" is the default, you rarely see
standalone in XML declarations.
There are a few other things you might find in an XML document:
Comments: Comments can appear anywhere in the document; they can even appear before or after the root element. A comment begins with
<!--and ends with
-->. A comment can't contain a double hyphen (
--) except at the end; with that exception, a comment can contain anything. Most importantly, any markup inside a comment is ignored; if you want to remove a large section of an XML document, simply wrap that section in a comment. (To restore the commented-out section, simply remove the comment tags.) Here's some markup that contains a comment:
<!-- Here's a PI for Cocoon: --> <?cocoon-process type="sql"?>
Processing instructions: A processing instruction is markup intended for a particular piece of code. In the example above, there's a processing instruction (sometimes called a PI) for Cocoon, an XML processing framework from the Apache Software Foundation. When Cocoon is processing an XML document, it looks for processing instructions that begin with
cocoon-process, then processes the XML document accordingly. In this example, the
type="sql"attribute tells Cocoon that the XML document contains a SQL statement.
<!-- Here's an entity: --> <!ENTITY dw "developerWorks">
Entities: The example above defines an entity for the document. Anywhere the XML processor finds the string
&dw;, it replaces the entity with the string
developerWorks. The XML spec also defines five entities you can use in place of various special characters. The entities are:
<for the less-than sign
>for the greater-than sign
"for a double-quote
'for a single quote (or apostrophe)
&for an ampersand.
XML's power comes from its flexibility, the fact that you and I and millions of other people can define our own tags to describe our data. Remember the sample XML document for a person's name and address? That document includes the
<title> element for a person's courtesy title, a perfectly reasonable choice for an element name. If you run an online bookstore, you might create a
<title> element for the title of a book. If you run an online mortgage company, you might create a
<title> element for the title to a piece of property. All of those are reasonable choices, but all of them create elements with the same name. How do you tell if a given
<title> element refers to a person, a book, or a piece of property? With namespaces.
To use a namespace, you define a namespace prefix and map it to a particular string. Here's how you might define namespace prefixes for our three
<?xml version="1.0"?> <customer_summary xmlns:addr="http://www.xyz.com/addresses/" xmlns:books="http://www.zyx.com/books/" xmlns:mortgage="http://www.yyz.com/title/" > ... <addr:name><title>Mrs.</title> ... </addr:name> ... ... <books:title>Lord of the Rings</books:title> ... ... <mortgage:title>NC2948-388-1983</mortgage:title> ...
In this example, the three namespace prefixes are
Notice that defining a namespace for a particular element means that all of its child elements belong to the same namespace. The first
<title> element belongs to the
addr namespace because its parent element,
One final point: The string in a namespace definition is just a string. Yes, these strings look like URLs, but they're not. You could define
xmlns:addr="mike" and that would work just as well. The only thing that's important about the namespace string is that it's unique; that's why most namespace definitions look like URLs. The XML parser does not go to
http://www.zyx.com/books/ to search for a DTD or schema, it simply uses that text as a string. It's confusing, but that's how namespaces work.