My JiBX XML data binding framework is a fast and flexible tool for translating Java objects to and from XML documents. Most frameworks for XML data binding take the approach of generating Java classes from XML schemas, with framework code to implement the binding built into the generated classes. JiBX instead uses classworking techniques to enhance compiled Java class files with added methods to implement the bindings. This approach allows JiBX to work with both existing classes and generated classes, and also gives the benefits of very fast operation with a relatively small runtime.
JiBX data binding implements more complex code generation than most other frameworks using bytecode enhancement. In the course of developing JiBX, I've had to deal with a number of challenges to make this type of complex code generation workable. In this article, I'm going to summarize some of those challenges and the solutions I found in the course of getting JiBX to the 1.0 production release. I'll start with a look at the JiBX bytecode generation architecture.
My development of JiBX has been guided by a few specific goals. The first goal was that it would support flexible binding to existing classes, rather than require the use of generated classes. The second was that it would be fast, using bytecode enhancement to add the binding code directly to the application classes (as opposed to the less invasive but slower technique of reflection). The third was that it would support not only different ways of using the same class within a single binding, but also multiple bindings to the same classes. Other goals have been added along the way to the production release, but these three initial goals have proven to be the main influences on the architecture of the framework.
JiBX is composed of two major components: the binding compiler, which handles the actual bytecode enhancement of classes; and the runtime, which is used by the generated bytecode for actual marshalling (generating XML from objects) and unmarshalling (generating objects from XML) of documents. The runtime has gradually grown over time as more options were added, but the structure of the runtime code has stayed basically the same as when I started the project. The binding compiler, on the other hand, has grown in both size and complexity, and the bytecode enhancement core has been restructured several times to add functionality and improve the quality of the code. In the 1.0 release, the binding compiler is more than four times the size of the runtime (at 228K vs. 54K) and many times the complexity. Because this column is concerned with classworking, I'm only going to discuss the binding compiler component.
I'll start off with a sample of bindings in
action. Figure 1 shows the pair of bindings I used as a JiBX example in the
last column. The two bindings define different XML formats for the same Java
classes. In the diagram, I've highlighted the differences between the two bindings
(and the corresponding differences in the two documents) using colors -- blue
for the handling of the Name class reference, green
for the properties of the Customer class providing
address information, and red for the phone property.
Figure 1. Example bindings

Figure 1 demonstrates some of the basic flexibility of the binding compiler, though only in a very limited way. Still, this pair of bindings provides a good starting point for a look at how the binding compiler does its job.
To handle marshalling and unmarshalling, JiBX adds new classes and methods to the classes included in the binding. For all classes bound to XML structures (as opposed to simple text values), JiBX creates methods to actually implement the conversion to and from XML. For top-level mapped classes in a binding (those that can be converted to and from separate documents), JiBX also adds marshalling and unmarshalling interfaces, along with the methods defined by those interfaces. Finally, for both top-level and other mapped classes JiBX generates separate support classes that provide a level of indirection, implementing interfaces to call the appropriate marshalling/unmarshalling implementation methods. This combination of methods and classes may seem convoluted, but is required to support the level of flexibility allowed by JiBX bindings.
If you compile the Java source code from Figure 1 to class files, then compile the binding definitions using the JiBX binding compiler, you'll get the set of classes and methods shown in Figure 2. In this case, there were no methods present in the original classes, so all the methods shown in the diagram were added by the JiBX binding compiler.
Figure 2. Class diagram after binding

There's no need to cover the Figure 2 information in depth here, but I'll
give a quick overview. At the highest logical level (but the middle of the
diagram), the added classes simple.JiBX_binding1Factory and simple.JiBX_binding2Factory provide runtime access to the
compiled bindings, mainly through the createMarshallingContext() and createUnmarshallingContext() methods. The JiBX runtime
provides a way (using the org.jibx.runtime.BindingDirectory class) for the user to
access the factory class for a particular binding, and once the binding factory
has been found, these methods can be used to create marshalling and unmarshalling
contexts that control the conversions between Java objects and XML
documents.
The second pair of added classes, simple.JiBX_binding1Customer_access and simple.JiBX_binding2Customer_access at the bottom of the
diagram, are the indirection
support classes I mentioned in the first paragraph of this section. These classes act as
runtime glue to associate a binding with the particular methods of a mapped
class (in this case, simple.Customer) that implement
the marshalling and unmarshalling operations. Each binding factory class
references the support classes for every class with a <mapping>
definition in that binding.
The detailed marshalling and unmarshalling implementation code gets added
directly to the bound classes. In the simple.Customer
class, these methods include a JiBX_binding1_newinstance_1_0() method (used to create a
new instance of the class), a pair of JiBX_bindingX_marshal_1_0() and JiBX_bindingX_unmarshal_1_0() methods (used for marshalling
and unmarshalling the content of an element corresponding to the class), and the
JiBX_binding2_marshalAttr_1_0() and JiBX_binding2_unmarshalAttr_1_0() methods (used for marshalling and unmarshalling attributes of
the element corresponding to the class). The Customer
class also has several methods used by the interfaces added to a top-level
mapped object. In the simple.Name, there are just
three added methods: JiBX_binding1_newinstance_1_0(),
JiBX_binding1_marshal_1_0(), and JiBX_binding1_unmarshal_1_0().
The particular methods added to each class demonstrate one nice feature of
the binding compiler: Rather than blindly adding generic methods for every
binding to every class used in that binding, the binding compiler only creates
the methods actually needed. If a method has already been added to a class that
matches the needs for a binding, that method will be reused rather than adding a
new method. That's why each class in the Figure 2 diagram gets only one "newinstance" method, and why there's only one marshalling and one unmarshalling method for the Name class -- the handling of the data from that class is
the same in both bindings. I'll return to this point later in the article when I
discuss some of the details of the bytecode generation.
Before moving on, I'll point out that some classes cannot be handled with
direct method insertion. System classes, for instance, are not modifiable, while
user-defined interface classes can be modified, but can't include actual
implementation code. For bindings that work with interfaces or unmodifiable
classes, JiBX instead adds the necessary marshalling/unmarshalling code to a
special helper class (called the "munge" class) as static methods. This approach doesn't provide the full
flexibility possible when classes can be directly modified (for instance, you
can't use private fields of unmodifiable classes in bindings, as you can with
modifiable classes), but allows a usable level of support within the limits of
the Java language and JVM license.
To control the process of adding all the methods and classes described in the
last section, the JiBX binding compiler first creates an internal representation
of each binding in the form of a code generation tree structure. This tree
structure reflects the nesting of operations required for marshalling and
unmarshalling. Each component of the tree implements a code generation
interface, which defines methods used for different types of
code generation (for instance, getAttributeMarshal(), genContentMarshal(), genContentPresentTest(), and so on). Each of these method calls
takes the information for the method currently under construction as a
parameter. The called component appends the appropriate bytecode instructions
to the method under construction before returning from the call, allowing for
a modular approach to code generation, where each type of component serves a
particular function but the components can be combined in different ways to
match the requirements of each binding.
Figures 3 and 4 show the code generation tree structures built from the Figure 1 bindings. The upper structures of both trees are identical, but the lower levels are organized very differently, reflecting the differences in the two binding definitions.
Figure 3. Code generation model for binding1

Figure 4. Code generation model for binding2

The generic handling of code generation method calls by the components is
a sequence of three steps: Generate any needed setup
code, call the child component(s) to add their code generation, and finish by
generating any needed wrapup code. In the case of an element wrapper
component, for instance, the genContentMarshal() call
generates code to first write the element start tag, then calls the child
component to handle marshalling its content, and finishes by generating code to
write the element end tag. If the child component includes one or more
attributes, though, this sequence is complicated by the need to write the
attributes to the element start tag as a separate step.
The object binding component is what actually adds new methods to a
class included in the binding. When the genContentMarshal() method is executed, for instance, the object binding component first checks if the marshalling method has already been generated in
the class of the bound component. If not, the object binding component creates a
new marshalling method to be added to the bound component. It also generates the
code for that new method, calling the child component as part of the code
generation. Once the bound component
marshalling method has been found or generated, the object binding component
just adds the code to call that bound component method to the original method it was
passed.
Early versions of the binding compiler did not go through this step of constructing a code generation model, instead generating code directly from the binding. The code generation model was added to make the bytecode generation more modular and easier to maintain. Despite some problems along the way (which I'll discuss in the next section), the model approach has worked well for these purposes.
JiBX 1.0 uses the BCEL framework for bytecode manipulation. Although BCEL has worked for this purpose, it's sometimes proven more difficult to use than I'd expected when I began the project. Some of the problems I ran into are specific to BCEL, but others apply to other bytecode frameworks as well. In this section, I'll cover both types of problems and the workarounds I applied in the course of the JiBX 1.0 development.
I found several issues in the BCEL framework that made it awkward to use.
The first such issue is the separation between the APIs for reflective access
to classes (in the org.apache.bcel.classfile package)
and those for manipulating and constructing classes (in the org.apache.bcel.generic package). These parallel APIs
create the need to maintain two types of class information in many cases.
BCEL's implementation of bytecode handling also seems overly complex, with
great flexibility but a large number of objects to manipulate. These objects include an
actual org.apache.bcel.generic.ClassGen object for
a class under construction, a org.apache.bcel.generic.ConstantPoolGen object associated with the
class, an org.apache.bcel.generic.InstructionFactory object
used to create instructions for a particular class and constant pool, and an
org.apache.bcel.generic.InstructionList object for a
sequence of bytecode instructions being generated.
For the JiBX binding compiler, I found it easiest to just hide all the BCEL
details within several wrapper classes. For example, the class used to
construct a new method (org.jibx.binding.classes.MethodBuilder) handles all
the BCEL details of method construction while providing calls to append each
required type of instruction to the current list. While you lose some of the
flexibility of the BCEL InstructionList (which allows
instructions to be inserted into and deleted from a list rather than just
appended) with this approach, it works very well for the sequential bytecode generation used by the
binding compiler. Similar classes wrap BCEL information for existing classes and
methods.
Hiding the BCEL details in my wrapper also gave me convenient places to add
more functionality, as in the class and method comparison code.
This comparison code defines hashCode() and equals() methods for checking when one class or method can
be substituted for another. The binding compiler uses these comparisons both to
make sure that pre-existing binding methods it finds in classes are still
needed, and to avoid creating duplicate classes or methods.
In the category of general problems with bytecode generation, I'd list one big issue: the need to make sure that the code you're generating is going to be valid according to the JVM specification. If any detail of a generated bytecode instruction sequence is wrong, the class containing that bytecode will fail verification when you attempt to load it into a JVM. This failure leads to the painfully difficult task of working backward from the class that failed verification to find the code that generated the wrong instruction.
Much of the time I've spent on JiBX development went to track down
invalid bytecode generation problems. Along the way, I've made several changes to
the code to make these problems easier to find. The first such change was
adding an option to run BCEL's own verification on a generated class, using the
Justice verifier included in the org.apache.bcel.verifier package. Using Justice validation had the advantage of
providing much more detailed information about errors than is supplied by the
JVM (including the stack state and a portion of the problem code). However, I
found that there were cases where Justice complained about
issues that the JVM classloader ignored. To avoid spending time on this type of
problem, I added another option to try loading the modified classes directly
within the binding compiler. This second option provided a kind of smoke test
for the modified classes -- if they could be loaded successfully by the binding
compiler, users could be sure the same classes would load successfully when
needed by their application at runtime.
I also added an option to print out the code generation model tree for a
binding. The Justice output gave a snapshot of the problem. After I disassembled
the modified bytecode of the invalid class (using BCEL's
org.apache.bcel.util.Class2HTML utility) and compared
the output with Justice's snapshot of the problem code, I could see where
the error fit in with the larger flow. Once I understood what the code in
question was attempting to do, I could then relate the error to the code generation
model tree and isolate the model component where the generation was
going awry. Even then, I still often had to debug through the code of the
component to actually find the problem. This type of problem
became so distasteful that I started avoiding any changes to the code generation
for fear of unknowingly breaking something (and this fear despite a continually
growing set of test bindings executed as part of the project build).
As Ben Franklin said, "An ounce of prevention is worth a pound of cure." As long as I was trying to backtrack from generated code to find problems, I was on the cure side of this saying. Instead, I needed to move to the prevention side. To do so, I added code in the method builder class that tracked the stack state at each point in the code generation, then threw a runtime exception if I tried to add an instruction that didn't match the stack state. In essence, I was implementing my own on-the-fly validator for the bytecode. I found the results were well worth the added effort.
Listing 1 shows an example of the error output generated from this on-the-fly validation. In this case, I introduced an error to the code generation handling a collection, adding an extra item to the stack in the body of the item unmarshalling loop. This extra item meant that the stack state at the end of the loop didn't match that at the beginning, so an error occurred when the code generation tried to append a branch back to the beginning of the loop. The error report includes the component that generated the branch instruction (the "generated by ..." line) and the two stack states (for the "to" and "from" branch locations), as well as the stack trace for the point where the error occurred.
Listing 1. Bytecode generation error reporting
java.lang.IllegalStateException: Stack size mismatch on branch
in method simple.JiBX_MungeAdapter.JiBX_mybinding7a_unmarshal
generated by org.jibx.binding.def.NestedCollection@1ebde03
from stack:
0: java.lang.Object[]
1: java.lang.Object[]
to stack:
0: java.lang.Object[]
at org.jibx.binding.classes.BranchWrapper.setTarget
(BranchWrapper.java:184)
at org.jibx.binding.classes.BranchWrapper.setTarget
(BranchWrapper.java:201)
at org.jibx.binding.def.NestedCollection.genContentUnmarshal
(NestedCollection.java:172)
at org.jibx.binding.def.ObjectBinding.genUnmarshalContentCall
(ObjectBinding.java:750)
...
|
After I added the stack state tracking, bytecode problems became traceable to a particular line of code. This accountability didn't always make the fix easy, but it avoided all the earlier issues with finding the cause of a problem. Because of this greater ease of debugging, I was able to add some significant new features to the bytecode generation before the 1.0 release.
The JiBX 1.0 bytecode generation architecture makes an interesting case study for anyone looking at implementing complex bytecode transformations. Many of the lessons I learned during the development apply across different bytecode frameworks, and knowing about the pitfalls in advance can help you avoid making the same mistakes as I did along the way to the 1.0 production release of JiBX.
Next month I'm going to leave JiBX 1.0 behind and move on to the changes now in progress for 2.0. These changes include a complete rewrite of the binding compiler code generation, with an added option for using source code enhancement as an alternative to bytecode enhancement. Handling source code generation in the same framework as bytecode generation promises to create some interesting new problems. Check back next month for a look at the approach I'm taking to make this unlikely combination work together for the next generation of the JiBX framework.
Learn
- "Data
binding with JAXB" (Daniel Steinberg, developerWorks, May 2003) and "Programming With XMLBeans" (Abhinav Chopra, developerWorks, September 2004): Two other frameworks for XML data binding.
- The Classworking toolkit series: Read the complete set of articles by Dennis Sosnoski.
- Java
programming dynamics series: Another tour of the Java class structure, reflection, and classworking.
- The Java technology
zone: Find articles about every aspect of Java programming.
Get products and technologies
- JiBX: The author's fast and flexible XML data binding framework. This Web site includes online documentation of the user API, but for the details of the internals discussed in this article, you'll need to download the distribution
and run the "devdoc" Ant build target to construct the internal Javadocs in the /build/docs directory.
Discuss
- developerWorks blogs: Get involved in the developerWorks community.

Dennis Sosnoski is the founder and lead consultant of Seattle-area Java technology consulting company Sosnoski Software Solutions Inc., specialists in XML and Web services training and consulting. His professional software development experience spans 30 years, with the last several years focused on server-side XML and Java technologies. He is a frequent speaker at conferences nationwide. He's also the lead developer of the open source JiBX XML Data Binding framework built around the Java classworking technology.





