Skip to main content

Managing XML data: Identify XML documents

File extensions and MIME types

Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University
Photo of Elliot Rusty Harold
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.

Summary:  The name of an XML file does not have to end in .xml. In fact, an XML document doesn’t have to be in a file at all. It can be a database record, a piece of a file, a transitory stream of bytes in memory that’s never written to disk, or a combination of several different files. However, many XML documents do reside on hard disks and other fixed media. When they do, it’s useful to be able to identify them quickly. This article summarizes the common file extensions and MIME media types that are used for XML documents. Sometimes, it’s just easier to go with the flow than to invent new conventions.

View more content in this series

Date:  29 Apr 2005
Level:  Intermediate
Activity:  3011 views
Comments:  

Three-letter extensions have been used to identify file types since at least the late 1960s and are still used today. Some operating systems use four letters or two or even one instead of three, but the basic convention remains filename-period-extension. When files are moved between heterogeneous systems, the name and the extension are often the only metadata that move with them.

If you store XML documents in a file system, use a standard file extension. Doing so makes it much easier for everyone to find, recognize, and process XML files. By far, the most common extension is .xml, but numerous others are used for specific subsets of XML, as Table 1 shows.


Table 1. Common XML file extensions
ExtensionMeaning
.xmlGeneric XML document
.entParsed entity, document fragment
.dtdDocument Type Definition (DTD)
.rdfResource Description Framework (RDF) XML syntax
.atomAtom syndication feed
.owlWeb Ontology Language
.xhtmlExtensible Hypertext Markup Language (XHTML)
.xsdW3C XML Schema Language schema
.xslXSL Transformations
.foXSL Formatting Objects
.rngRELAX NG XML syntax
.schSchematron schema
.svgScalable Vector Graphics (SVG)
.rssRSS (Really Simple Syndication, Rich Site Summary, or RDF Site Summary -- depending on who defines the acronym) syndication feed
.plistApple’s property list format

Resources served by a Web server might not be in a file system. However, if the resources are XML documents, do still try to make sure that the URLs for these resources end with one of the above extensions, as appropriate to their detailed type.


MIME media types

When a Web server transmits a file, it doesn’t just send the file name and contents. It also sends a lot of metadata about the file in the HTTP header, as shown in Listing 1:


Listing 1. Sample metadata
HTTP/1.1 200 OK
Date: Sun, 23 Jan 2005 18:21:33 GMT
Server: Apache/2.0.52 (Unix) mod_ssl/2.0.52 OpenSSL/0.9.7d
Last-Modified: Sun, 10 Oct 2004 16:17:21 GMT
ETag: "3e06d-16a05-2dbc8640"
Accept-Ranges: bytes
Content-Length: 92677
Content-Type: application/xhtml+xml

Notice the Content-Type header in the last line. The value of this header -- in this case application/xhtml+xml -- is a MIME media type (possibly accompanied by optional information about the character set of the document). Web browsers and other clients use this metadata to decide how to process the file -- for example, to determine whether they can display it natively or have to pass it to a helper application. MIME types are also used in other contexts, including e-mail, and by a few experimental operating systems, notably BeOS. Linux and other UNIX® systems also use MIME types, but they mostly do so by mapping file extensions to MIME types rather than by tagging files with MIME types directly. The real, practical use of MIME types is on the Internet.

text/xsl

The most infamous example of just making up MIME types out of whole cloth is the text/xsl pseudotype that Microsoft® Internet Explorer uses to identify XSLT stylesheets. The type doesn’t exist outside Microsoft’s imagination. No such type has ever been registered with the IANA, nor is it ever likely to be registered because the XSL specifications now follow the lead of RFC 3023 and recommend application/xslt+xml as the MIME type. Unfortunately, a lot of other software and documentation have simply parroted Microsoft’s error (and make no mistake -- it is an error: text is rarely an appropriate media type for XML documents of any kind, XSL or otherwise) rather than checking the specs to find out what they really say.

The basic content type for generic XML documents is application/xml. The type text/xml is also registered, but this type has been deprecated because of some unfortunate interactions with other parts of the HTTP protocol. (Using text/xml indicates that the document is encoded in ASCII, even if the document’s XML declaration states otherwise.) Other basic registered MIME types are :

  • application/xml-dtd for DTDs
  • application/xml-external-parsed-entity for document fragments

For more specific XML format types, the convention is to use the type application/foo+xml, where "foo" refers to the specific XML vocabulary -- for example, application/rdf+xml for RDF, application/xhtml+xml for XHTML, application/svg+xml for SVG, and so forth. In this way, generic XML processors are able to recognize the document as XML while still allowing processors for the specific format to recognize it as well. Table 2 lists some of the media types you might encounter.


Table 2. XML MIME media types
Media typeDocument format
image/svg+xml*SVG
application/atom+xml*Atom Feed Syndication Format
application/mathml+xml*Mathematical Markup Language
application/beep+xmlBlocks Extensible Exchange Protocol
application/cpl+xmlCall Processing Language
application/soap+xmlA SOAP message
application/epp+xmlExtensible Provisioning Protocol
application/rdf+xmlRDF XML syntax
application/xhtml+xmlXHTML
application/xop+xmlXML-binary Optimized Packaging
application/xslt+xml*XSLT stylesheet
application/xmpp+xmlExtensible Messaging and Presence Protocol
application/voicexml+xml*VoiceXML

*Registration in progress

You can't just pick a new MIME media type out of the air for every new format you create. You must publish new types in a formal specification (often an IETF Request for Comments) and register them with the Internet Assigned Numbers Authority (IANA). However, you can designate experimental subtypes without registration. These subtypes must begin with x-. For example, if I needed a custom type for the television listing markup language I invented for an example in my book, the XML 1.1 Bible, I could call it application/x-tvml+xml. The application type tells processors to treat this file as non-ASCII data. The +xml at the end of the subtype tells the processors that it’s XML, the x- warns them that this is an unregistered type, and the tvml tells them what kind of data it is.


Heuristics

The final way you can identify an XML file is to open it and look. This approach isn’t the fastest, and perhaps I shouldn’t even discuss it in this series because it’s completely inappropriate for large collections of XML documents. However, sometimes, it’s the only truly reliable way to tell whether a file or stream contains XML. While you might simply throw the file/stream into a parser and hope for the best, that’s a relatively heavyweight solution. A few good heuristics based solely on the first few bytes will tell you if a file or stream is likely to be XML, and therefore worth checking further with a parser. For example, every well-formed XML document is guaranteed to begin with a less-than sign (<), optionally preceded by initial white space. In practice, the vast majority of XML documents begin in one of three ways:

  • <?xml
  • <!DOCTYPE
  • <foo, where foo is any XML name

Character set issues make the detection a little trickier. All three of these may or may not be preceded by a Unicode byte order mark in either UTF-8, big-endian UTF-16, or little-endian UTF-16. Furthermore, any number of character sets besides Unicode can be used, including ASCII, ISO-8859-1 (Latin-1), and EBCDIC. Still, because these sets overlap a lot within the character range of the likely initial strings, you can whittle things down to just a few common byte sequences, shown here in hexadecimal:

  • FE FF 00 3C 00 3F
  • FF FE 3C 00 3F 00
  • 3C 3F 78 6D
  • EF BB BF 3C 3F
  • 4C 6F A7 94
  • 3C

These heuristics aren’t perfect -- most notably, they do identify most malformed HTML documents as possible XML. And you can improve on them in a few corner cases by stripping off initial white space (tab, carriage return, linefeed, and space) before the first < (3C) or checking that the character after the first < is ?, !, or an XML name-start character. However, in practice, any document that doesn’t begin with one of the sequences above is unlikely to be XML. If you check these characters first, you can filter out a lot of the chaff and save time by parsing only the most likely candidates.


Summary

Another way to determine which files contain XML is simply to remember where you put them. However, even if this method is good enough for your own applications, you might well come across other applications that need to access the same data but don’t have detailed knowledge of your personal file naming conventions. When you follow (or at least do not gratuitously deviate from) the standard conventions for file names and MIME media types, your documents are more accessible to everyone, and you noticeably enhance XML’s ability to interchange data across heterogeneous systems.


Resources

  • Read "RFC 2046, Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types," which describes the basic structure of MIME types, including the type/subtype division and the use of the x- prefix for unregistered types.

  • Visit the public MIME type registry, which is maintained by the IANA and lists all registered XML types and all other types.

  • Check out "RFC 3023, XML Media Types," which describes the basic set of MIME media types for XML documents and lays out the system by which new types are chosen.

  • Find out why text is an inappropriate media type for XML documents in Architecture of the World Wide Web, Volume One.

  • Twenty years ago, Apple Computer invented a better way of identifying file types that didn’t make the file name serve double duty. In essence, they stored an extra four-letter code with each file in its resource fork. Apple tried to abandon this scheme in Mac OS X, but reversed course after a massive developer revolt. Apple now supports both type codes and file name extensions. "Finder Interface," Chapter 7 of Inside Macintosh: Macintosh Toolbox Essentials also describes this scheme.

  • Be, Inc. may be defunct, but you can explore the BeOS in the Haiku project -- MIME-based file system and all.

  • Read the XML 1.1 Bible by Elliotte Rusty Harold. It's also available on Amazon.com.

  • Find out more about DB2®, the IBM software solution for information management. At its core is a powerful family of relational database management system (RDBMS) servers.

  • Find hundreds more XML resources on the developerWorks XML zone.

  • Learn how you can become an IBM Certified Developer in XML and related technologies.

About the author

Photo of Elliot Rusty Harold

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=82329
ArticleTitle=Managing XML data: Identify XML documents
publish-date=04292005
author1-email=elharo@metalab.unc.edu
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers