Skip to main content

Classworking toolkit: Inside JiBX code generation

Find out how JiBX implements class file enhancement for XML data binding

Dennis Sosnoski (dms@sosnoski.com), Java and XML consultant, Sosnoski Software Solutions Inc.
Dennis Sosnoski
Dennis Sosnoski is the founder and lead consultant of Seattle-area Java technology consulting company Sosnoski Software Solutions Inc., specialists in XML and Web services training and consulting. His professional software development experience spans 30 years, with the last several years focused on server-side XML and Java technologies. He is a frequent speaker at conferences nationwide. He's also the lead developer of the open source JiBX XML Data Binding framework built around the Java classworking technology.

Summary:  The JiBX framework builds on classworking techniques for fast and flexible conversions between Java objects and XML. But generating correct and verifiable bytecode isn't always easy, and lead developer Dennis Sosnoski has gone through some painful classworking experiences along the way to the 1.0 production release. He shares his insights in this article, discussing the internal structures used for code generation and the steps he's gone through to make sure that the generated code follows JVM rules.

View more content in this series

Date:  06 Sep 2005
Level:  Intermediate
Activity:  1514 views
Comments:  

My JiBX XML data binding framework is a fast and flexible tool for translating Java objects to and from XML documents. Most frameworks for XML data binding take the approach of generating Java classes from XML schemas, with framework code to implement the binding built into the generated classes. JiBX instead uses classworking techniques to enhance compiled Java class files with added methods to implement the bindings. This approach allows JiBX to work with both existing classes and generated classes, and also gives the benefits of very fast operation with a relatively small runtime.

JiBX data binding implements more complex code generation than most other frameworks using bytecode enhancement. In the course of developing JiBX, I've had to deal with a number of challenges to make this type of complex code generation workable. In this article, I'm going to summarize some of those challenges and the solutions I found in the course of getting JiBX to the 1.0 production release. I'll start with a look at the JiBX bytecode generation architecture.

Architecture

My development of JiBX has been guided by a few specific goals. The first goal was that it would support flexible binding to existing classes, rather than require the use of generated classes. The second was that it would be fast, using bytecode enhancement to add the binding code directly to the application classes (as opposed to the less invasive but slower technique of reflection). The third was that it would support not only different ways of using the same class within a single binding, but also multiple bindings to the same classes. Other goals have been added along the way to the production release, but these three initial goals have proven to be the main influences on the architecture of the framework.

JiBX is composed of two major components: the binding compiler, which handles the actual bytecode enhancement of classes; and the runtime, which is used by the generated bytecode for actual marshalling (generating XML from objects) and unmarshalling (generating objects from XML) of documents. The runtime has gradually grown over time as more options were added, but the structure of the runtime code has stayed basically the same as when I started the project. The binding compiler, on the other hand, has grown in both size and complexity, and the bytecode enhancement core has been restructured several times to add functionality and improve the quality of the code. In the 1.0 release, the binding compiler is more than four times the size of the runtime (at 228K vs. 54K) and many times the complexity. Because this column is concerned with classworking, I'm only going to discuss the binding compiler component.

Ask the expert: Dennis Sosnoski on JVM and bytecode issues

For comments or questions about the material covered in this article series, as well as anything else that pertains to Java bytecode, the Java binary class format, or general JVM issues, visit the JVM and Bytecode discussion forum, moderated by Dennis Sosnoski.

A binding example

I'll start off with a sample of bindings in action. Figure 1 shows the pair of bindings I used as a JiBX example in the last column. The two bindings define different XML formats for the same Java classes. In the diagram, I've highlighted the differences between the two bindings (and the corresponding differences in the two documents) using colors -- blue for the handling of the Name class reference, green for the properties of the Customer class providing address information, and red for the phone property.


Figure 1. Example bindings
Example bindings

Figure 1 demonstrates some of the basic flexibility of the binding compiler, though only in a very limited way. Still, this pair of bindings provides a good starting point for a look at how the binding compiler does its job.

Class file implants

To handle marshalling and unmarshalling, JiBX adds new classes and methods to the classes included in the binding. For all classes bound to XML structures (as opposed to simple text values), JiBX creates methods to actually implement the conversion to and from XML. For top-level mapped classes in a binding (those that can be converted to and from separate documents), JiBX also adds marshalling and unmarshalling interfaces, along with the methods defined by those interfaces. Finally, for both top-level and other mapped classes JiBX generates separate support classes that provide a level of indirection, implementing interfaces to call the appropriate marshalling/unmarshalling implementation methods. This combination of methods and classes may seem convoluted, but is required to support the level of flexibility allowed by JiBX bindings.

If you compile the Java source code from Figure 1 to class files, then compile the binding definitions using the JiBX binding compiler, you'll get the set of classes and methods shown in Figure 2. In this case, there were no methods present in the original classes, so all the methods shown in the diagram were added by the JiBX binding compiler.


Figure 2. Class diagram after binding
Class diagram after binding

There's no need to cover the Figure 2 information in depth here, but I'll give a quick overview. At the highest logical level (but the middle of the diagram), the added classes simple.JiBX_binding1Factory and simple.JiBX_binding2Factory provide runtime access to the compiled bindings, mainly through the createMarshallingContext() and createUnmarshallingContext() methods. The JiBX runtime provides a way (using the org.jibx.runtime.BindingDirectory class) for the user to access the factory class for a particular binding, and once the binding factory has been found, these methods can be used to create marshalling and unmarshalling contexts that control the conversions between Java objects and XML documents.

The second pair of added classes, simple.JiBX_binding1Customer_access and simple.JiBX_binding2Customer_access at the bottom of the diagram, are the indirection support classes I mentioned in the first paragraph of this section. These classes act as runtime glue to associate a binding with the particular methods of a mapped class (in this case, simple.Customer) that implement the marshalling and unmarshalling operations. Each binding factory class references the support classes for every class with a <mapping> definition in that binding.

The detailed marshalling and unmarshalling implementation code gets added directly to the bound classes. In the simple.Customer class, these methods include a JiBX_binding1_newinstance_1_0() method (used to create a new instance of the class), a pair of JiBX_bindingX_marshal_1_0() and JiBX_bindingX_unmarshal_1_0() methods (used for marshalling and unmarshalling the content of an element corresponding to the class), and the JiBX_binding2_marshalAttr_1_0() and JiBX_binding2_unmarshalAttr_1_0() methods (used for marshalling and unmarshalling attributes of the element corresponding to the class). The Customer class also has several methods used by the interfaces added to a top-level mapped object. In the simple.Name, there are just three added methods: JiBX_binding1_newinstance_1_0(), JiBX_binding1_marshal_1_0(), and JiBX_binding1_unmarshal_1_0().

The particular methods added to each class demonstrate one nice feature of the binding compiler: Rather than blindly adding generic methods for every binding to every class used in that binding, the binding compiler only creates the methods actually needed. If a method has already been added to a class that matches the needs for a binding, that method will be reused rather than adding a new method. That's why each class in the Figure 2 diagram gets only one "newinstance" method, and why there's only one marshalling and one unmarshalling method for the Name class -- the handling of the data from that class is the same in both bindings. I'll return to this point later in the article when I discuss some of the details of the bytecode generation.

Before moving on, I'll point out that some classes cannot be handled with direct method insertion. System classes, for instance, are not modifiable, while user-defined interface classes can be modified, but can't include actual implementation code. For bindings that work with interfaces or unmodifiable classes, JiBX instead adds the necessary marshalling/unmarshalling code to a special helper class (called the "munge" class) as static methods. This approach doesn't provide the full flexibility possible when classes can be directly modified (for instance, you can't use private fields of unmodifiable classes in bindings, as you can with modifiable classes), but allows a usable level of support within the limits of the Java language and JVM license.

Code generation model

To control the process of adding all the methods and classes described in the last section, the JiBX binding compiler first creates an internal representation of each binding in the form of a code generation tree structure. This tree structure reflects the nesting of operations required for marshalling and unmarshalling. Each component of the tree implements a code generation interface, which defines methods used for different types of code generation (for instance, getAttributeMarshal(), genContentMarshal(), genContentPresentTest(), and so on). Each of these method calls takes the information for the method currently under construction as a parameter. The called component appends the appropriate bytecode instructions to the method under construction before returning from the call, allowing for a modular approach to code generation, where each type of component serves a particular function but the components can be combined in different ways to match the requirements of each binding.

Figures 3 and 4 show the code generation tree structures built from the Figure 1 bindings. The upper structures of both trees are identical, but the lower levels are organized very differently, reflecting the differences in the two binding definitions.


Figure 3. Code generation model for binding1
Code generation model for binding1

Figure 4. Code generation model for binding2
Code generation model for binding2

The generic handling of code generation method calls by the components is a sequence of three steps: Generate any needed setup code, call the child component(s) to add their code generation, and finish by generating any needed wrapup code. In the case of an element wrapper component, for instance, the genContentMarshal() call generates code to first write the element start tag, then calls the child component to handle marshalling its content, and finishes by generating code to write the element end tag. If the child component includes one or more attributes, though, this sequence is complicated by the need to write the attributes to the element start tag as a separate step.

The object binding component is what actually adds new methods to a class included in the binding. When the genContentMarshal() method is executed, for instance, the object binding component first checks if the marshalling method has already been generated in the class of the bound component. If not, the object binding component creates a new marshalling method to be added to the bound component. It also generates the code for that new method, calling the child component as part of the code generation. Once the bound component marshalling method has been found or generated, the object binding component just adds the code to call that bound component method to the original method it was passed.

Early versions of the binding compiler did not go through this step of constructing a code generation model, instead generating code directly from the binding. The code generation model was added to make the bytecode generation more modular and easier to maintain. Despite some problems along the way (which I'll discuss in the next section), the model approach has worked well for these purposes.


Digging into bytecode

JiBX 1.0 uses the BCEL framework for bytecode manipulation. Although BCEL has worked for this purpose, it's sometimes proven more difficult to use than I'd expected when I began the project. Some of the problems I ran into are specific to BCEL, but others apply to other bytecode frameworks as well. In this section, I'll cover both types of problems and the workarounds I applied in the course of the JiBX 1.0 development.

BCEL quirks

I found several issues in the BCEL framework that made it awkward to use. The first such issue is the separation between the APIs for reflective access to classes (in the org.apache.bcel.classfile package) and those for manipulating and constructing classes (in the org.apache.bcel.generic package). These parallel APIs create the need to maintain two types of class information in many cases.

BCEL's implementation of bytecode handling also seems overly complex, with great flexibility but a large number of objects to manipulate. These objects include an actual org.apache.bcel.generic.ClassGen object for a class under construction, a org.apache.bcel.generic.ConstantPoolGen object associated with the class, an org.apache.bcel.generic.InstructionFactory object used to create instructions for a particular class and constant pool, and an org.apache.bcel.generic.InstructionList object for a sequence of bytecode instructions being generated.

Wrapping up BCEL

For the JiBX binding compiler, I found it easiest to just hide all the BCEL details within several wrapper classes. For example, the class used to construct a new method (org.jibx.binding.classes.MethodBuilder) handles all the BCEL details of method construction while providing calls to append each required type of instruction to the current list. While you lose some of the flexibility of the BCEL InstructionList (which allows instructions to be inserted into and deleted from a list rather than just appended) with this approach, it works very well for the sequential bytecode generation used by the binding compiler. Similar classes wrap BCEL information for existing classes and methods.

Hiding the BCEL details in my wrapper also gave me convenient places to add more functionality, as in the class and method comparison code. This comparison code defines hashCode() and equals() methods for checking when one class or method can be substituted for another. The binding compiler uses these comparisons both to make sure that pre-existing binding methods it finds in classes are still needed, and to avoid creating duplicate classes or methods.

Bytecode headaches

In the category of general problems with bytecode generation, I'd list one big issue: the need to make sure that the code you're generating is going to be valid according to the JVM specification. If any detail of a generated bytecode instruction sequence is wrong, the class containing that bytecode will fail verification when you attempt to load it into a JVM. This failure leads to the painfully difficult task of working backward from the class that failed verification to find the code that generated the wrong instruction.

Much of the time I've spent on JiBX development went to track down invalid bytecode generation problems. Along the way, I've made several changes to the code to make these problems easier to find. The first such change was adding an option to run BCEL's own verification on a generated class, using the Justice verifier included in the org.apache.bcel.verifier package. Using Justice validation had the advantage of providing much more detailed information about errors than is supplied by the JVM (including the stack state and a portion of the problem code). However, I found that there were cases where Justice complained about issues that the JVM classloader ignored. To avoid spending time on this type of problem, I added another option to try loading the modified classes directly within the binding compiler. This second option provided a kind of smoke test for the modified classes -- if they could be loaded successfully by the binding compiler, users could be sure the same classes would load successfully when needed by their application at runtime.

I also added an option to print out the code generation model tree for a binding. The Justice output gave a snapshot of the problem. After I disassembled the modified bytecode of the invalid class (using BCEL's org.apache.bcel.util.Class2HTML utility) and compared the output with Justice's snapshot of the problem code, I could see where the error fit in with the larger flow. Once I understood what the code in question was attempting to do, I could then relate the error to the code generation model tree and isolate the model component where the generation was going awry. Even then, I still often had to debug through the code of the component to actually find the problem. This type of problem became so distasteful that I started avoiding any changes to the code generation for fear of unknowingly breaking something (and this fear despite a continually growing set of test bindings executed as part of the project build).

Preventative medicine

As Ben Franklin said, "An ounce of prevention is worth a pound of cure." As long as I was trying to backtrack from generated code to find problems, I was on the cure side of this saying. Instead, I needed to move to the prevention side. To do so, I added code in the method builder class that tracked the stack state at each point in the code generation, then threw a runtime exception if I tried to add an instruction that didn't match the stack state. In essence, I was implementing my own on-the-fly validator for the bytecode. I found the results were well worth the added effort.

Listing 1 shows an example of the error output generated from this on-the-fly validation. In this case, I introduced an error to the code generation handling a collection, adding an extra item to the stack in the body of the item unmarshalling loop. This extra item meant that the stack state at the end of the loop didn't match that at the beginning, so an error occurred when the code generation tried to append a branch back to the beginning of the loop. The error report includes the component that generated the branch instruction (the "generated by ..." line) and the two stack states (for the "to" and "from" branch locations), as well as the stack trace for the point where the error occurred.


Listing 1. Bytecode generation error reporting
java.lang.IllegalStateException: Stack size mismatch on branch
 in method simple.JiBX_MungeAdapter.JiBX_mybinding7a_unmarshal
 generated by org.jibx.binding.def.NestedCollection@1ebde03
 from stack:
  0: java.lang.Object[]
  1: java.lang.Object[]
 to stack:
  0: java.lang.Object[]

    at org.jibx.binding.classes.BranchWrapper.setTarget
    		(BranchWrapper.java:184)
    at org.jibx.binding.classes.BranchWrapper.setTarget
    		(BranchWrapper.java:201)
    at org.jibx.binding.def.NestedCollection.genContentUnmarshal
    		(NestedCollection.java:172)
    at org.jibx.binding.def.ObjectBinding.genUnmarshalContentCall
    		(ObjectBinding.java:750)
    ...

After I added the stack state tracking, bytecode problems became traceable to a particular line of code. This accountability didn't always make the fix easy, but it avoided all the earlier issues with finding the cause of a problem. Because of this greater ease of debugging, I was able to add some significant new features to the bytecode generation before the 1.0 release.


A look ahead

The JiBX 1.0 bytecode generation architecture makes an interesting case study for anyone looking at implementing complex bytecode transformations. Many of the lessons I learned during the development apply across different bytecode frameworks, and knowing about the pitfalls in advance can help you avoid making the same mistakes as I did along the way to the 1.0 production release of JiBX.

Next month I'm going to leave JiBX 1.0 behind and move on to the changes now in progress for 2.0. These changes include a complete rewrite of the binding compiler code generation, with an added option for using source code enhancement as an alternative to bytecode enhancement. Handling source code generation in the same framework as bytecode generation promises to create some interesting new problems. Check back next month for a look at the approach I'm taking to make this unlikely combination work together for the next generation of the JiBX framework.


Resources

Learn

Get products and technologies

  • JiBX: The author's fast and flexible XML data binding framework. This Web site includes online documentation of the user API, but for the details of the internals discussed in this article, you'll need to download the distribution and run the "devdoc" Ant build target to construct the internal Javadocs in the /build/docs directory.

Discuss

About the author

Dennis Sosnoski

Dennis Sosnoski is the founder and lead consultant of Seattle-area Java technology consulting company Sosnoski Software Solutions Inc., specialists in XML and Web services training and consulting. His professional software development experience spans 30 years, with the last several years focused on server-side XML and Java technologies. He is a frequent speaker at conferences nationwide. He's also the lead developer of the open source JiBX XML Data Binding framework built around the Java classworking technology.

Comments



Trademarks

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=93309
ArticleTitle=Classworking toolkit: Inside JiBX code generation
publish-date=09062005
author1-email=dms@sosnoski.com
author1-email-cc=jaloi@us.ibm.com