Three-letter extensions have been used to identify file types since at least the late 1960s and are still used today. Some operating systems use four letters or two or even one instead of three, but the basic convention remains filename-period-extension. When files are moved between heterogeneous systems, the name and the extension are often the only metadata that move with them.
If you store XML documents in a file system, use a standard file extension. Doing so makes it much easier for everyone to find, recognize, and process XML files. By far, the most common extension is .xml, but numerous others are used for specific subsets of XML, as Table 1 shows.
Table 1. Common XML file extensions
| Extension | Meaning |
| .xml | Generic XML document |
| .ent | Parsed entity, document fragment |
| .dtd | Document Type Definition (DTD) |
| .rdf | Resource Description Framework (RDF) XML syntax |
| .atom | Atom syndication feed |
| .owl | Web Ontology Language |
| .xhtml | Extensible Hypertext Markup Language (XHTML) |
| .xsd | W3C XML Schema Language schema |
| .xsl | XSL Transformations |
| .fo | XSL Formatting Objects |
| .rng | RELAX NG XML syntax |
| .sch | Schematron schema |
| .svg | Scalable Vector Graphics (SVG) |
| .rss | RSS (Really Simple Syndication, Rich Site Summary, or RDF Site Summary -- depending on who defines the acronym) syndication feed |
| .plist | Appleâs property list format |
Resources served by a Web server might not be in a file system. However, if the resources are XML documents, do still try to make sure that the URLs for these resources end with one of the above extensions, as appropriate to their detailed type.
When a Web server transmits a file, it doesnât just send the file name and contents. It also sends a lot of metadata about the file in the HTTP header, as shown in Listing 1:
Listing 1. Sample metadata
HTTP/1.1 200 OK Date: Sun, 23 Jan 2005 18:21:33 GMT Server: Apache/2.0.52 (Unix) mod_ssl/2.0.52 OpenSSL/0.9.7d Last-Modified: Sun, 10 Oct 2004 16:17:21 GMT ETag: "3e06d-16a05-2dbc8640" Accept-Ranges: bytes Content-Length: 92677 Content-Type: application/xhtml+xml |
Notice the Content-Type header in the last line. The value of this header --
in this case application/xhtml+xml -- is a MIME
media type (possibly accompanied by optional information about the character set of the document).
Web browsers and other clients use this metadata to decide how to process the file -- for example,
to determine whether they can display it natively or have to pass it to a helper application. MIME types
are also used in other contexts, including e-mail, and by a few experimental operating systems, notably
BeOS. Linux and other UNIX® systems also use MIME types, but they mostly do so by mapping file
extensions to MIME types rather than by tagging files with MIME types directly. The real, practical use
of MIME types is on the Internet.
The basic content type for generic XML documents is application/xml. The type text/xml is also registered, but this type has been deprecated because of some unfortunate interactions with other parts of the HTTP protocol. (Using text/xml indicates that the document is encoded in ASCII, even if the documentâs XML declaration states otherwise.) Other basic registered MIME types are :
- application/xml-dtd for DTDs
- application/xml-external-parsed-entity for document fragments
For more specific XML format types, the convention is to use the type application/foo+xml, where "foo" refers to the specific XML vocabulary -- for example, application/rdf+xml for RDF, application/xhtml+xml for XHTML, application/svg+xml for SVG, and so forth. In this way, generic XML processors are able to recognize the document as XML while still allowing processors for the specific format to recognize it as well. Table 2 lists some of the media types you might encounter.
Table 2. XML MIME media types
| Media type | Document format |
| image/svg+xml* | SVG |
| application/atom+xml* | Atom Feed Syndication Format |
| application/mathml+xml* | Mathematical Markup Language |
| application/beep+xml | Blocks Extensible Exchange Protocol |
| application/cpl+xml | Call Processing Language |
| application/soap+xml | A SOAP message |
| application/epp+xml | Extensible Provisioning Protocol |
| application/rdf+xml | RDF XML syntax |
| application/xhtml+xml | XHTML |
| application/xop+xml | XML-binary Optimized Packaging |
| application/xslt+xml* | XSLT stylesheet |
| application/xmpp+xml | Extensible Messaging and Presence Protocol |
| application/voicexml+xml* | VoiceXML |
*Registration in progress
You can't just pick a new MIME media type out of the air for every new format you create. You must publish new types in a formal specification (often an IETF Request for Comments) and register them with the Internet Assigned Numbers Authority (IANA). However, you can designate experimental subtypes without registration. These subtypes must begin with x-. For example, if I needed a custom type for the television listing markup language I invented for an example in my book, the XML 1.1 Bible, I could call it application/x-tvml+xml. The application type tells processors to treat this file as non-ASCII data. The +xml at the end of the subtype tells the processors that itâs XML, the x- warns them that this is an unregistered type, and the tvml tells them what kind of data it is.
The final way you can identify an XML file is to open it and look. This approach isnât the fastest, and perhaps I shouldnât even discuss it in this series because itâs completely inappropriate for large collections of XML documents. However, sometimes, itâs the only truly reliable way to tell whether a file or stream contains XML. While you might simply throw the file/stream into a parser and hope for the best, thatâs a relatively heavyweight solution. A few good heuristics based solely on the first few bytes will tell you if a file or stream is likely to be XML, and therefore worth checking further with a parser. For example, every well-formed XML document is guaranteed to begin with a less-than sign (<), optionally preceded by initial white space. In practice, the vast majority of XML documents begin in one of three ways:
<?xml<!DOCTYPE<foo, wherefoois any XML name
Character set issues make the detection a little trickier. All three of these may or may not be preceded by a Unicode byte order mark in either UTF-8, big-endian UTF-16, or little-endian UTF-16. Furthermore, any number of character sets besides Unicode can be used, including ASCII, ISO-8859-1 (Latin-1), and EBCDIC. Still, because these sets overlap a lot within the character range of the likely initial strings, you can whittle things down to just a few common byte sequences, shown here in hexadecimal:
- FE FF 00 3C 00 3F
- FF FE 3C 00 3F 00
- 3C 3F 78 6D
- EF BB BF 3C 3F
- 4C 6F A7 94
- 3C
These heuristics arenât perfect -- most notably, they do identify most malformed HTML documents
as possible XML. And you can improve on them in a few corner cases by stripping off initial
white space (tab, carriage return, linefeed, and space) before the first < (3C) or checking that
the character after the first < is ?, !, or an XML name-start character. However, in practice, any document that doesnât begin with one of the sequences above is unlikely to be XML. If you check these characters first, you can filter out a lot of the chaff and save time by parsing
only the most likely candidates.
Another way to determine which files contain XML is simply to remember where you put them. However, even if this method is good enough for your own applications, you might well come across other applications that need to access the same data but donât have detailed knowledge of your personal file naming conventions. When you follow (or at least do not gratuitously deviate from) the standard conventions for file names and MIME media types, your documents are more accessible to everyone, and you noticeably enhance XMLâs ability to interchange data across heterogeneous systems.
- Read "RFC 2046, Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types," which describes the basic structure of MIME types, including the type/subtype division and the use of the x- prefix for unregistered types.
- Visit the public MIME type registry, which is maintained by the IANA and lists all registered XML types and all other types.
- Check out "RFC 3023, XML Media Types," which describes the basic set of MIME media types for XML documents and lays out the system by which new types are chosen.
- Find out why text is an inappropriate media type for XML documents in Architecture of the World Wide Web, Volume One.
- Twenty years ago, Apple Computer invented a better way of identifying file types that didnât make the file name serve double duty. In essence, they stored an extra four-letter code with each file in its resource fork. Apple tried to abandon this scheme in Mac OS X, but reversed course after a massive developer revolt. Apple now supports both type codes and file name extensions. "Finder Interface," Chapter 7 of Inside Macintosh: Macintosh Toolbox Essentials also describes this scheme.
- Be, Inc. may be defunct, but you can explore the BeOS in the Haiku project -- MIME-based file system and all.
- Read the XML 1.1 Bible by Elliotte Rusty Harold. It's also available on Amazon.com.
- Find out more about DB2®, the IBM software solution for information management. At its core is a powerful family of relational database management system (RDBMS) servers.
- Find hundreds more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.
Comments (Undergoing maintenance)





