DOM, SAX, and when each is appropriate
SAX analyzes an XML stream as it goes by, much like an old ticker tape. Consider the following XML code snippet:
<?xml version="1.0"?> <samples> <server>UNIX</server> <monitor>color</monitor> </samples>
A SAX processor analyzing this code snippet would generate, in general, the following events:
Start document Start element (samples) Characters (white space) Start element (server) Characters (UNIX) End element (server) Characters (white space) Start element (monitor) Characters (color) End element (monitor) Characters (white space) End element (samples)
The SAX API allows a developer to capture these events and act on them.
SAX processing involves the following steps:
- Create an event handler.
- Create the SAX parser.
- Assign the event handler to the parser.
- Parse the document, sending each event to the handler.
The advantages of this kind of processing are much like the advantages of streaming media. Analysis can get started immediately, rather than waiting for all of the data to be processed. Also, because the application is simply examining the data as it goes by, it doesn't need to store it in memory. This is a huge advantage when it comes to large documents. In fact, an application doesn't even have to parse the entire document; it can stop when certain criteria have been satisfied. In general, SAX is also much faster than the alternative, the DOM.
On the other hand, because the application is not storing the data in any way, it is impossible to make changes to it using SAX, or to move backwards in the data stream.
The DOM is the traditional way of handling XML data. With DOM, the data is loaded into memory in a tree-like structure.
For instance, the same document used as an example in How SAX processing works would be represented as nodes, shown here:
The rectangular boxes represent element nodes, and the ovals represent text nodes.
DOM uses parent-child relationships. For instance, in this case
samples is the root element with five children: three text nodes (the whitespace), and the two element nodes,
One important thing to realize is that the
monitor nodes actually have values of
Instead, they have text
color ) as children.
DOM, and by extension tree-based processing, has several advantages. First, because the tree is persistent in memory, it can be modified so an application can make changes to the data and the structure. It can also work its way up and down the tree at any time, as opposed to the one-shot deal of SAX. DOM can also be much simpler to use.
On the other hand, a lot of overhead is involved in building these trees in memory. It's not unusual for large files to completely overrun a system's capacity. In addition, creating a DOM tree can be a very slow process.
Whether you choose DOM or SAX is going to depend on several factors:
- Purpose of the application: If you are going to have to make changes to the data and output it as XML, then in most cases, DOM is the way to go. It's not that you can't make changes using SAX, but the process is much more complex, as you'd have to make changes to a copy of the data rather than to the data itself.
- Amount of data: For large files, SAX is a better bet.
- How the data will be used: If only a small amount of the data will actually be used, you may be better off using SAX to extract it into your application. On the other hand, if you know that you will need to refer back to large amounts of information that has already been processed, SAX is probably not the right choice.
- The need for speed: SAX implementations are normally faster than DOM implementations.
It's important to remember that SAX and DOM are not mutually exclusive. You can use DOM to create a stream of SAX events, and you can use SAX to create a DOM tree. In fact, most parsers used to create DOM trees are actually using SAX to do it!