Contents


Tip

Output large XML documents, Part 1

A survey of existing XML output options

Comments

Content series:

This content is part # of # in the series: Tip

Stay tuned for additional content in this series.

This content is part of the series:Tip

Stay tuned for additional content in this series.

One of the most common problems in the XML domain is outputting large documents. While the process of reading in XML is fairly well understood, there is little in the way of best practices for the output of XML. In cases where the output is fairly small, say less than 1,000 records, this is not a significant problem; developers use APIs like DOM and JDOM, or output the XML in raw character-based content using I/O streams. However, as datasets being output grow to hold thousands, or even tens of thousands, of members, these solutions begin to break down. This tip examines these problems, explores the available alternatives, and lays out a plan for exhaustively covering XML output.

most common ways

Output alternatives

You have several alternatives for outputting XML. Before looking into a solution for output, it's worth detailing some of the solutions that you shouldn't use. Here are the to output XML:

  1. SAX
  2. DOM
  3. JAXP
  4. Another in-memory API, like JDOM or dom4j
  5. Raw I/O streams

I'll look at each in turn before laying out a solution that I'll examine through the next several tips in this series.

SAX

The first option, SAX, is really a non-option. I've included it in the list because most developers getting started with XML hear about SAX and how quick it is for XML processing. While SAX is traditionally considered the fastest and slimmest API for XML, it does not have the ability to output XML (or anything else, for that matter). In fact, if you examine the SAX package (org.xml.sax), you won't find a single output method. It is designed from the ground up to read XML, rather than write it.

Note: It is possible to modify incoming XML by using an XMLFilter. (I'll talk a lot more about filters later in this tip and in future tips.) However, this is still not outputting XML. It's also possible to use raw I/O streams within SAX callbacks to output XML -- but that's really just a variant of option 5 in the list above, so I'll deal with it in Raw I/O streams.

DOM

The Document Object Model, DOM, is by far the most commonly used API for XML output. DOM is an in-memory model of XML, meaning that it stores each element, attribute, character fragment, and XML construct in memory. You can read an XML document or stream into a DOM tree, or build a tree from scratch. It's equally easy to write out a DOM tree, and most parser software packages offer utility classes to do just this. For example, Apache Xerces comes with several samples, including dom.Writer which takes in a DOM Node and prints out the XML representation of that Node. Listing 1 shows a portion of that code, which handles the bulk of the printing logic.

Listing 1. Printing a DOM tree
    public void write(Node node) {

        // is there anything to do?
        if (node == null) {
            return;
        }

        short type = node.getNodeType();
        switch (type) {
            case Node.DOCUMENT_NODE: {
                Document document = (Document)node;
                if (!fCanonical) {
                    fOut.println("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
                    fOut.flush();
                    write(document.getDoctype());
                }
                write(document.getDocumentElement());
                break;
            }

            case Node.DOCUMENT_TYPE_NODE: {
                DocumentType doctype = (DocumentType)node;
                fOut.print("<!DOCTYPE ");
                fOut.print(doctype.getName());
                String publicId = doctype.getPublicId();
                String systemId = doctype.getSystemId();
                if (publicId != null) {
                    fOut.print(" PUBLIC '");
                    fOut.print(publicId);
                    fOut.print("' '");
                    fOut.print(systemId);
                    fOut.print('\'');
                }
                else {
                    fOut.print(" SYSTEM '");
                    fOut.print(systemId);
                    fOut.print('\'');
                }
                String internalSubset = doctype.getInternalSubset();
                if (internalSubset != null) {
                    fOut.println(" [");
                    fOut.print(internalSubset);
                    fOut.print(']');
                }
                fOut.println('>');
                break;
            }

            case Node.ELEMENT_NODE: {
                fOut.print('<');
                fOut.print(node.getNodeName());
                Attr attrs[] = sortAttributes(node.getAttributes());
                for (int i = 0; i < attrs.length; i++) {
                    Attr attr = attrs[i];
                    fOut.print(' ');
                    fOut.print(attr.getNodeName());
                    fOut.print("=\"");
                    normalizeAndPrint(attr.getNodeValue());
                    fOut.print('"');
                }
                fOut.print('>');
                fOut.flush();

                Node child = node.getFirstChild();
                while (child != null) {
                    write(child);
                    child = child.getNextSibling();
                }
                break;
            }

            case Node.ENTITY_REFERENCE_NODE: {
                if (fCanonical) {
                    Node child = node.getFirstChild();
                    while (child != null) {
                        write(child);
                        child = child.getNextSibling();
                    }
                }
                else {
                    fOut.print('&');
                    fOut.print(node.getNodeName());
                    fOut.print(';');
                    fOut.flush();
                }
                break;
            }

            case Node.TEXT_NODE: {
                normalizeAndPrint(node.getNodeValue());
                fOut.flush();
                break;
            }

            case Node.PROCESSING_INSTRUCTION_NODE: {
                fOut.print("<?");
                fOut.print(node.getNodeName());
                String data = node.getNodeValue();
                if (data != null && data.length() > 0) {
                    fOut.print(' ');
                    fOut.print(data);
                }
                fOut.println("?>");
                fOut.flush();
                break;
            }
        }

        if (type == Node.ELEMENT_NODE) {
            fOut.print("</");
            fOut.print(node.getNodeName());
            fOut.print('>');
            fOut.flush();
        }
    }

I won't go into any real detail about this code, but notice that each and every node in the document is iterated over, and each exists in memory. So in a DOM tree that holds 1,000, 2,000, or even 10,000 nodes, you're never even going to get to this printing code; well before your DOM tree is built, you'll get out-of-memory errors. Storing 1,000 nodes is a memory-consumptive process, and most machines will choke. Also consider that most data is going to require two, three, or even more nodes; each element is one node, the data within that element is another node, and there will be one additional node per attribute. So a document with 10,000 individual pieces of data could actually have to store 20,000, 30,000, or even upwards of 50,000 individual nodes to represent that data. Needless to say, DOM simply cannot handle this amount of data, let alone output the data to a file.

JAXP

The Java API for XML Processing (JAXP) is another red herring, so to speak. JAXP is not itself an API for parsing; it is merely a wrapper API for adding a convenience layer on top of SAX and DOM. Therefore, it is these underlying APIs that control the behavior of JAXP. In other words, using SAX through JAXP has the same non-write problems as SAX alone does, and using DOM through JAXP has the same memory-consumption problems that DOM alone does. JAXP doesn't provide you any real option that I haven't already discussed.

In-memory models

Several APIs in addition to DOM and SAX have become popular in recent years. Two notable examples of these are JDOM and dom4j. There are others, but all are generally similar in terms of how they operate: They use various types of in-memory models. While many are not as memory-heavy as most DOM implementations, they still hold at least some data in memory for every piece of data in the XML tree. This means that you eventually run into the same problems that you do with DOM: memory overflow. You may get more mileage out of these APIs, but ultimately your hardware is going to limit your ability to load and write data.

Raw I/O streams

The final option is to use raw I/O streams. In Java code, for example, you could use java.io.OutputStream or java.io.Writer to spit out characters that happen to be XML-conformant. For example, Listing 1 has several statements in which XML characters are written directly to an output stream. While this is a viable option in that it doesn't require all your XML to be represented in memory, it does have a lot of problems of its own. First, using raw streams means that you have to be very fastidious about escaping characters like apostrophes and quotation marks. This often makes for some very ugly output, and creates a lot of room for error. Additionally, you have to keep up with the formatting of the tree yourself -- you can't deal with elements, sub-elements, and attributes. Instead, you have to keep up with these details on your own. This creates even more room for error. In other words, I/O streams are an option, but not a very good one.

What to do?

So, while we haven't looked at any code yet, you should be pretty clear on how poor the options are for handling large XML datasets, and outputting that data without incurring a tremendous amount of work and error-checking. In this series of tips, I'll present an entirely different option, one based on some SAX extensions that allows both filtering and output. Once you make it through the next five tips, you'll be comfortably handling large XML documents with ease, all without taxing your memory resources one bit.

While you're waiting on the next tip, you should find application code of your own that uses one of the options detailed above. Think about profiling that code, and see how much of a memory hog it is. In other words, prepare to compare your old code with some new techniques that I'll lay out in the next several weeks. And until the next tip, I'll see you online!


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12244
ArticleTitle=Tip: Output large XML documents, Part 1
publish-date=03262003