 | Level: Intermediate Dennis Sosnoski (dms@sosnoski.com), President, Sosnoski Software Solutions, Inc.
01 Jan 2003 Enterprise Java expert Dennis Sosnoski checks out the speed and memory usage of several frameworks for XML data binding in Java. These include all the code generation approaches discussed in Part 1, the Castor mapped binding approach discussed in an earlier article, and a surprise new entry in the race. If you're working with XML in your Java applications you'll want to learn how these data binding approaches stack up!
Part 1 provides background on why you'd want to use data binding for XML,
along with an overview of the available Java frameworks for data binding. If
you haven't already read Part 1, you'll probably want to at least glance over it now. In this part
I'm going straight to the issue of performance without further discussion
of the whys and hows!
Performance tests
For performance tests of the data binding frameworks, I generated documents
containing mock airline flight timetable information. These are based
on the same structure I defined in the earlier article on mapped
data binding with Castor (see Resources). Here's a sample of that structure, herein referred
to as the compact format because it uses mainly attributes for data:
Listing 1. Compact document format
<?xml version="1.0"?>
<timetable>
<carrier ident="AR" rating="9">
<URL>http://www.arcticairlines.com</URL>
<name>Arctic Airlines</name>
</carrier>
<carrier ident="CA" rating="7">
<URL>http://www.combinedlines.com</URL>
<name>Combined Airlines</name>
</carrier>
<airport ident="SEA">
<location>Seattle, WA</location>
<name>Seattle-Tacoma International
Airport</name>
</airport>
<airport ident="LAX">
<location>Los Angeles, CA</location>
<name>Los Angeles International
Airport</name>
</airport>
<route from="SEA" to="LAX">
<flight carrier="AR" depart="6:23a"
arrive="8:42a" number="426"/>
<flight carrier="CA" depart="8:10a"
arrive="10:52a" number="833"/>
<flight carrier="AR" depart="9:00a"
arrive="11:36a" number="433"/>
</route>
<route from="LAX" to="SEA">
<flight carrier="CA" depart="7:45a"
arrive="10:20a" number="311"/>
<flight carrier="AR" depart="9:27a"
arrive="12:04p" number="593"/>
<flight carrier="AR" depart="12:30p"
arrive="3:07p" number="102"/>
</route>
</timetable> |
 | |
Note that the airport name information in Listing 1 usually is a single line of code. To accomodate column size, some lines of code are split and appear on two lines.
In addition to the compact format, I also tried a variation with more use of child
elements for data values (only staying with attributes for IDs and IDREFs).
Here's the same data presented in that format, which I refer to here as
the full format:
Listing 2. Full document format
<?xml version="1.0"?>
<timetable>
<carrier ident="AR">
<rating>9</rating>
<URL>http://www.arcticairlines.com</URL>
<name>Arctic Airlines</name>
</carrier>
<carrier ident="CA">
<rating>7</rating>
<URL>http://www.combinedlines.com</URL>
<name>Combined Airlines</name>
</carrier>
<airport ident="SEA">
<location>Seattle, WA</location>
<name>Seattle-Tacoma International Airport</name>
</airport>
<airport ident="LAX">
<location>Los Angeles, CA</location>
<name>Los Angeles International Airport</name>
</airport>
<route from="SEA" to="LAX">
<flight carrier="AR">
<number>426</number>
<depart>6:23a</depart>
<arrive>8:42a</arrive>
</flight>
<flight carrier="CA">
<number>833</number>
<depart>8:10a</depart>
<arrive>10:52a</arrive>
</flight>
<flight carrier="AR">
<number>433</number>
<depart>9:00a</depart>
<arrive>11:36a</arrive>
</flight>
</route>
<route from="LAX" to="SEA">
<flight carrier="CA">
<number>311</number>
<depart>7:45a</depart>
<arrive>10:20a</arrive>
</flight>
<flight carrier="AR">
<number>593</number>
<depart>9:27a</depart>
<arrive>12:04p</arrive>
</flight>
<flight carrier="AR">
<number>102</number>
<depart>12:30p</depart>
<arrive>3:07p</arrive>
</flight>
</route>
</timetable> |
Often, the relative performance of XML frameworks differs greatly depending on the size of documents being used, so I included both large and small documents in these performance tests. The large documents
(time-comp.xml and time-full.xml) use identical data values in
the two different formats shown above. Because of this, the sizes are considerably different
(106 KB for the compact format versus 211 KB for the full format). The small
documents are in collections, each containing 34 documents ranging in size
from 1.4-3.3 KB for the compact format (ttcomp) and 2.2-5.8 KB for
the full format (ttfull).
As with the large documents, corresponding documents in the small document
collections contain the same data values. The full set of documents used in
the tests is available from the Downloads page (see Resources).
 |
Data binding dictionary
Here's a mini-dictionary of some terms I'll use in this article:
Marshalling is the process of generating an XML representation for an
object in memory. As with Java object serialization, the representation needs to
include all dependent objects: objects referenced by our main object, objects
referenced by those objects, and so on.
Unmarshalling is the reverse process of marshalling, building an object
(and potentially a graph of linked objects) in memory from an XML representation.
Mapping is the set of rules used for explicit marshalling and
unmarshalling of objects to and from XML documents. Data binding approaches that use
code generation (based on a DTD or W3C XML Schema description of documents) normally
have implicit mappings built into the constructed objects, so in this article the term
mapping is only used for approaches that associate user-defined Java objects with
XML documents.
|
|
I would prefer to test with more document variations than just the two formats
used for these results. However, the amount of effort involved in adding more documents
for a data binding test is substantial because of the need to provide
W3C XML Schema (Schema) and Document Type Definition (DTD) descriptions
for code generation, along with mapping files and
base classes for the mapped versions. The two formats used here, with
both large and small document variations, should at least give a fairly
representative picture of how the data binding alternatives perform for typical
business documents. They probably allow the mapped binding approaches to show
better memory usage than would be typical of general documents, though, because
most of the data values in these documents can be converted to primitive types.
This results in a very compact internal representation. With documents where most
of the data values need to be kept as Strings, the memory advantage of the mapped
binding approaches would be diminished.
All test results were obtained using a 1.4GHz Athlon system with 256MB of DDR RAM,
running RedHat Linux 7.2. I used Sun's JDK 1.4.1 for Linux in all tests.
The specific versions of each data binding framework tested are as follows:
JAXB Beta 1, Castor 0.9.4.1, JBind 1.0 Beta 12/07, Quick 4.3.1, and Zeus Beta
3.5 (JiBX is a special case -- see So what's JiBX? following the test results for
details). All tests except JBind and JiBX used the Piccolo SAX2 parser, version 1.0.3.
This is the fastest SAX2 parser I'm aware of, and generally meets or beats the speed
of the XMLPull parser used for the JiBX tests (XPP3 version 1.1.2). JBind was unable
to work with the Piccolo parser, so for testing JBind I used Xerces Java 2, version
2.2.0.
To provide a performance comparison between data binding and other alternative approaches I also
ran a timing test of the same files using just the SAX2 parser, and timing
and memory tests using the dom4j document model (a performance leader among the
document models, and one that allows different SAX2 parsers to be used for parsing
input documents). For these tests, I used dom4j version 1.3.
I used the same basic framework for these timing and memory usage tests as
in my earlier tests with document models (see the author's document model performance
article in Resources). This benchmark framework first reads all documents
into internal memory buffers, then times multiple passes of input and output operations on the documents.
The test results shown in Input timings and Output timings are the best times over
several passes. This should be
representative of long-term performance in a server-type environment where the
same code is executed repeatedly.
Input timings
Figures 1 and 2 give the timing results for reading an XML document (unmarshalling
it, in data binding terms) and
constructing an in-memory representation using the dom4j document model and
various data binding approaches. In these charts you can regard the first timing
value, for SAX2, as
a base time for parsing the documents. The document models and data binding
implementations use the parse results to build their representations in memory,
so they're never going to be faster than the parser itself. The two data binding
tests based on mappings, rather than code generation, are noted in the captions.
Figure 1. Reading large documents to memory
Figure 2. Reading small documents to memory
dom4j is able to construct its in-memory representation
of the documents in less than twice the amount of time taken by the parser alone.
The only data binding framework that beats this performance is JiBX. JAXB, Quick,
and Zeus all turn in respectable performance figures compared to dom4j, but take
nearly twice as long as JiBX overall. Castor is very slow by
comparison, both with mapped bindings and with generated code.
JBind performs a full order of magnitude slower than most of the binding frameworks
in these tests. A small part of this poor performance is due to the slower parser used
for the JBind tests (because it failed to work with the parser used for the other
tests). A larger part is probably due to JBind forcing document validation against the
Schema on input, which can add considerable overhead. Most of the poor performance is
probably attributable to the JBind framework itself, though, which uses a very indirect
approach to binding (building on top of a DOM document model, in the current
implementation).
All the tests except for JBind were run without full validation. Most of the data
binding frameworks include a certain inherent level of validation (assuring, for instance, that the
content model of elements is matched) just by their design. Most can
also use validating parsers (such as Xerces Java 2) for full checking of documents on
input, and some (including JAXB) can perform full validation of bound data in memory.
Since the main concern in these tests was performance, I disabled optional validation
wherever possible (including using both property file and unmarshaller/marshaller
settings in Castor).
Output timings
Figures 3 and 4 give the timing results for generating the XML text
serialization (marshalling it, in data binding terms) of an in-memory
representation using dom4j and various
data binding approaches. These charts use the same vertical scale as the
previous pair to simplify comparisons, but differ in that there's no
equivalent to the SAX2 parser figure.
Figure 3. Writing large documents from memory
Figure 4. Writing small documents from memory
dom4j offers better performance than any of the data binding approaches
in this area, beating JiBX by a smidgen and Zeus by not much more. The other
data binding frameworks take about twice as long, with Quick the slowest of
all (no pun intended, of course). There's not nearly as much variation here
as in the input tests, though the fact that dom4j does better than any of the
data binding frameworks suggests that they all still have room for improvement.
Memory usage
Figures 5 and 6 show the other part of the performance story, looking at
memory usage. Running out of memory can be a problem when using very large
documents (generally in the 5+ MB range) with document models. How do
the data binding approaches compare in the amount of memory used for the
document representation?
Figure 5. Large document memory usage
Figure 6. Small document memory usage
The differences here are much larger than in the time performance comparisons,
and show a very different pattern. While dom4j performed well in the
time measurements, in terms of memory usage it's much worse than any of the
data binding frameworks (except for JBind, which builds on an internal document
model equivalent to dom4j's representation). Compared to the best performers in
this area, dom4j takes more than 10 times the memory to represent the same data.
The two mapped binding approaches use the same internal structure for the bound
data, so they show identical memory usage. This gives them a tie for first place
in the memory efficiency arena, turning in a performance several times better
than the data binding approaches using generated code. This is partially
because the mapped binding uses a very compact representation for data values.
The mapped binding converts most of them to int values in these tests (a String with even one or two characters will take up 20 bytes or more in most Java Virtual Machines (JVM), versus only 4 bytes
for an int). The overhead of this conversion adds to read
and write times, but it does have other benefits beyond just the memory size
reduction. For actually working with the data, ints are far
more convenient and efficient than Strings.
Besides the more extensive use of primitive values in the mapped bindings,
another reason for the greater memory efficiency of this approach is that
generated code approaches usually add control information to the actual data
present in each bound object. This control information pads the size
of the objects, reducing one of the main benefits of data binding.
The data binding frameworks using generated code consume at least several
times the memory of the mapped bindings in these tests, but (with the
exception of JBind) are still much smaller than dom4j's document model
representation. This is no surprise -- a document model such as dom4j needs to construct
objects to represent every component of the document (including the actual
data text, along with structure components such as elements and attributes),
while the data bindings only need to hold the actual data. Much of
that data is still stored as Strings with the generated code bindings, but
some values can be converted to ints and others to object references.
Zeus is the only data binding approach considered here that directly stores
all data as Strings, which contributes to giving it the largest memory usage
of the general data bindings. JBind's memory usage is still larger, by far.
This is partially due to its internal use of a document model, but the amount
of memory used by JBind is several times larger than that needed by a document
model (such as dom4j) alone. Judging from this memory usage, it looks like JBind
creates many additional objects to link between the binding facade and the actual
data in the document model.
Startup time
Figures 1 through 6 illustrate how the data binding frameworks perform
in extended test runs that are representative of server environments. I thought it
would also be interesting to see how these frameworks compare when used in
a single-execution environment, such as where an application just uses the
data binding code to read or write a configuration file. Figure 7 shows the
results.
Figure 7. Startup time
Figure 7 shows the amount of time -- from when the benchmark program starts executing until after the round-trip operation returns (unmarshalling to objects, then marshalling the objects back out to a document) -- on a single short document.
The difference from the previous timing figures is that here most
of the time is spent in classloading and native code generation by the JVM for the data
binding framework code. By comparing these results with the earlier timing charts,
you can see that this startup time is generally several times larger than the actual processing time for even
a fairly large document. If you're only working with a few documents per
execution of your program, this startup time is going to be a more significant
factor than the best case times shown earlier.
The size of the jar files used by the data binding framework is one major
influence on this startup time. JiBX is the smallest, with a total size of less than
60KB for the runtime and parser. JAXB, Castor, and JBind are the largest, weighing
in at roughly 1MB each. The time is also affected by the initialization required for
each framework. In the case of Castor with a mapped binding this includes
processing the mapping definition file, and for JBind it includes processing the
Schema definition for the document.
So what's JiBX?
Now that I've shown the performance results, I should probably say something about
the framework that came in at the head of the pack in almost every test. Well, the
fact is that it's a ringer -- JiBX is a data binding framework designed for
performance, so if it's meeting its design requirements it should be the
top performer in these tests.
JiBX actually originated from this series of articles. When I began looking at
the available data binding frameworks I was surprised to see that
they didn't perform all that well compared with document models such as
dom4j. This was contrary to my expectations, since the data binding approach
actually reduces the amount of document information kept in memory
-- a document model holds on to
everything, while a data binding only needs the actual data. I thought that
an approach that works with less data should generally be faster than one that works with
more.
In looking at how the existing data binding frameworks operate, I saw two aspects
that didn't look good from a performance standpoint. The first was extensive use of
reflection in many of the frameworks. Reflection is a way of accessing information
about a Java language class at runtime. It can be used to access fields and methods
in instances of a class, giving a way of dynamically hooking together classes at
runtime without the need for any source code links between the classes. Reflection
is a very powerful Java Technology feature, but suffers a performance disadvantage
when compared to calling a method or accessing a field directly in compiled code.
The second aspect I questioned was the
use of a SAX2 parser for unmarshalling documents. SAX2 is a very useful
standard for parsing XML, but its event driven approach is not well suited
to data binding and similar applications. The problem here is that the code processing
the SAX2 events needs to maintain state information for everything it processes, and
this adds both complexity and overhead.
I created the code that grew into JiBX to test some ways around these
problematic aspects of the other data binding frameworks, and to experiment with extending
the mapped binding approach beyond what's supported by Castor. Instead of reflection, JiBX
uses byte code enhancement to add hooks into application code at project
build time. Instead of SAX2, JiBX is based on a pull parser architecture (currently
XMLPull). Rather than generating code from a DTD or Schema, JiBX works with a binding
definition that associates user-supplied classes with XML structure.
These techniques are not unique to JiBX. Byte code enhancement is used
by many JDO (Java Data Objects) implementations for basically the same purpose as in JiBX (to add
access hooks to existing compiled code). The original JAXB code (since discarded) was based on a pull
parser architecture similar to XMLPull. The mapped approach to data
binding is supported (although with some limitations) by both Castor and Quick. Even
though the individual techniques aren't new, the combination of them still
makes for a very interesting alternative to the other data binding frameworks.
I'll give a full rundown on JiBX in Part 3 of this article. JiBX is still at an
early development stage. For the performance tests, I hand wrote the code that would
normally be added through byte code enhancement and ran it using the then-current version
of the JiBX runtime. As of this article going to publication, I'm still wrapping up the
enhancement code, and there are a number of other features I'd love to see added. If
you can't wait until Part 3 to find out more about JiBX, check Resources for a link to the JiBX site. You can even start contributing to the future development of
JiBX, as well as making use of JiBX in your own applications.
Conclusions
This look at data binding performance shows some interesting results, but doesn't
fundamentally change the recommendations from Part 1. Castor provides the
best current support for data binding using code generation from W3C XML
Schema definitions. Its unmarshalling performance is weak compared to other
alternatives, but it does give good memory utilization and a fairly fast startup
time. The Castor developers say that they plan to focus on performance issues
prior to their 1.0 release, so you may also see some improvement in the unmarshalling
performance by then.
JAXB still looks like a good choice for the code generation approach in the
future (the beta license only allows evaluation use). The current
reference implementation beta is both bulky in terms of jar size and somewhat
inefficient in terms of memory usage, but here again you may see better performance in the
future. As of this writing, the current version is still a beta, and even after it's released commercial
or open source projects may improve performance over the reference implementation.
Since it will be a standard part of the J2EE platform, JAXB is definitely going to
play an important role in working with XML and Java technologies.
The performance results also confirm the use of JBind, Quick, and Zeus as most appropriate
for applications with special requirements rather than for general usage. JBind's XML Code
approach can provide a great basis for an application built around processing of an XML
document, but the performance of the current implementation is liable to be a problem. Quick
and Zeus offer code generation from DTDs, but as I mentioned in Part 1, it's generally pretty
easy to convert DTDs to Schemas. On the downside, Quick seems overly complex to use and
Zeus supports only Strings for bound data values (no primitives or
object references using ID-IDREF or an equivalent).
For mapped approaches to data binding, Castor has the advantage of a fairly stable
implementation and substantial real-world usage. Quick can be used for this type of
binding as well, but again seems complex to set up. JiBX is new and not yet in full
usage, but offers excellent performance along with a high degree of flexibility.
If you haven't read Part 1, you may want to refer back to it to learn more about
the features of these data binding frameworks. Part 1 also discusses the tradeoffs between
code generation and mapped approaches to data binding. In Part 3, I'll present the
new JiBX framework in more depth. This includes how JiBX maps Java objects to XML,
along with the byte code enhancement process JiBX uses at build time to minimize
runtime overhead. Check back for full details on this exciting approach to pumping
up framework performance!
Resources
-
Part 1 of this series on data binding provides background on why you'd want to use data binding for XML, along with an overview of the available Java frameworks for data binding (developerWorks, January 2003).
-
Download the full set of documents used in the tests for this article.
- If you need background on XML, try the developerWorks "Introduction to XML" tutorial (August 2002).
- Review the author's previous developerWorks articles covering performance (September 2001) and usage (February 2002) comparisons for Java XML document models.
- Read Brett McLaughlin's overview of Quick in "Converting between Java objects and XML with Quick," which shows you how to use this framework to quickly and painlessly turn your Java data into XML documents, without the class generation semantics required by other data binding frameworks (developerWorks, August 2002).
- For an introduction to the basics of object-relational data binding (similar in intent to the JDO standard, but not compatible), read "Getting started with Castor JDO," by Bruce Snyder (developerWorks, August 2002).
- Get the details on the Java Data Objects (JDO) API for persistence of Java language objects.
- Find out more about the Java Architecture for XML Binding (JAXB), the evolving standard for Java Platform data binding.
- Take a closer look at the Castor framework, which supports both mapped and generated bindings.
- Get to know JBind, a framework that focuses less on allowing Java language applications to easily work with XML, and more on building application code frameworks around XML.
- The Quick framework is based on a series of development efforts that predate both the Java Platform and XML. It provides an extremely flexible framework for working with XML on the Java Platform.
- Explore the details of Zeus, which (like Quick) generates code based on DTD descriptions of XML documents but is simpler to use -- and more limited -- than Quick.
- Learn more about the new JiBX framework for mapped bindings.
About the author  | 
|  | Dennis Sosnoski (dms@sosnoski.com) is the founder and lead consultant of Seattle-area Java consulting company Sosnoski Software Solutions, Inc., specialists in J2EE, XML, and Web services support. Dennis's professional software development
experience spans over 30 years, with the last several years focused on server-side Java technologies. He's a frequent speaker on XML in Java and J2EE technologies at conferences nationwide, and chairs the Seattle Java-XML SIG. |
Rate this page
|  |