Efficient Navigation on large XML Structures

XML is great for platform-independent structured data exchange. JAXB is an incredibly handy piece of Java technology, which releases the developer from the cumbersome task of manually writing (de)serialization code and providing him regular Java POJOs as a programming interface. However, when it comes to navigation in large, highly cross-linked object graphs obtained from pure JAXB, you may face serious performance challenges.

This article will give some insights on how to solve these challenges in an elegant and yet efficient way.

The Challenge

The model I use as a reference example in this article is the Vehicle Electric Container, which can be used to describe the complete electrical network of a modern car. To give you a feeling of what I am talking about when saying “large highly cross-linked object graphs”, here are some key numbers:

If you want to navigate freely on the structure, then using JAXB out of the box is not enough. I will explain how this can be achieved without losing the advantages of JAXB, without writing clumsy code and without violating SOLID principles. But, first things first.

The Model

Example Section of the Model

Example Section of the Model

The above diagram shows a small portion of the model which I will use to explain the challenges and, later, the solution. It represents the schematic of an electrical system - the left side (highlighted in blue) represents the electrical components, the right side (highlighted in green) the connectivity among them.

A required function on this model could be: Give me all ComponentNodes that are connected to ComponentPort X.

There are many possible ways to calculate the result, but the most straight-forward and intuitive way is:

  1. Navigate to the ConnectionEnds referencing the ComponentPort X.
  2. Navigate to the Connection of the found ConnectionEnds and find the other sides (ConnectionEnds).
  3. Navigate from the other ConnectionEnds to the referenced ComponentPorts and from there up the hierarchy, over the ComponentConnector to the ComponentNode.
  4. You have reached your result!

So far so good. The problem with this is, that in pure XML / JAXB most of the above navigations are simply not available. Why is this? Let’s have a look at the XML and Java JAXB representation.

The Schema

XML is not designed especially for object-oriented serialization / deserialization, it is a far more general concept. To use it for OO serialization, the OO concepts must be mapped on defined XML concepts. In our case, the transformation pattern from the UML Model to XML Schema is the following:

The resulting XML Schema definition for some classes in the above diagram can be found in the listing below (I omitted all details that are not relevant for the example):

   <xs:complexType name="ComponentPort">
      <xs:complexContent>
         <xs:extension base="vec:ConfigurableElement">
            <xs:sequence>
               <xs:element name="Identification" type="xs:string"/>
               ...
            </xs:sequence>
         </xs:extension>
      </xs:complexContent>
   </xs:complexType>

   <xs:complexType name="ComponentConnector">
      <xs:complexContent>
         <xs:extension base="vec:ConfigurableElement">
            <xs:sequence>
               <xs:element name="Identification" type="xs:string"/>
               ...
               <xs:element name="ComponentPort"
                           type="vec:ComponentPort"
                           minOccurs="0"
                           maxOccurs="unbounded"/>
            </xs:sequence>
         </xs:extension>
      </xs:complexContent>
   </xs:complexType>

   <xs:complexType name="ConnectionEnd">
      <xs:complexContent>
         <xs:extension base="vec:ConfigurableElement">
            <xs:sequence>
               <xs:element name="Identification" type="xs:string"/>
               ...
               <xs:element name="ConnectedComponentPort" type="xs:IDREF"/>
            </xs:sequence>
         </xs:extension>
      </xs:complexContent>
   </xs:complexType>

The Java Classes

If you feed this schema into the “JAXB XML Schema to Java Compiler” (XJC), you will receive Java classes similar to the ones in the listing below.

There are some notable details in the generated classes you need to be aware of:

  1. All classes (e.g. VecComponentPort) know nothing about incoming references. There is also no reference to the containing class (e.g. VecComponentConnector) nor is there a possibility for reverse navigation to referencing classes (e.g. VecConnectionEnd). Therefore, the desired navigation described above is blocked. There are two reasons for this:
    1. JAXB classes can also be used for serialization. Bidirectional associations would redundantly store information, allow inconsistent data structures and would add more complexity to the serializer.
    2. Due to the general purpose of XML, it is impossible for the XJC to determine incoming references for all XML use cases.
  2. The Java representation of IDREF associations (e.g. VecConnectionEnd.connectedComponentPort) are defined with the base type Object, although the UML model defines it as VecComponentPort. The reason for this is, that the syntax defined by the XML Schema is less strict than the definition of the UML model. In XML Schema IDREF(S) references are not typed and can point to any element with a valid xs:ID. Therefore, it is impossible for XJC determine the correct type.
@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "ComponentPort", propOrder = {
    "identification"
})
public class VecComponentPort
    extends VecConfigurableElement
    implements Serializable
{
    @XmlElement(name = "Identification", required = true)
    protected String identification;

    public String getIdentification() {
        return identification;
    }

    public void setIdentification(String value) {
        this.identification = value;
    }
}
@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "ComponentConnector", propOrder = {
    "identification",
    "componentPorts"
})
public class VecComponentConnector
    extends VecConfigurableElement
    implements Serializable
{

    @XmlElement(name = "Identification", required = true)
    protected String identification;

    @XmlElement(name = "ComponentPort")
    protected List<VecComponentPort> componentPorts;

    public String getIdentification() {
        return identification;
    }

    public void setIdentification(String value) {
        this.identification = value;
    }

    public List<VecComponentPort> getComponentPorts() {
        if (componentPorts == null) {
            componentPorts = new ArrayList<VecComponentPort>();
        }
        return this.componentPorts;
    }
}
@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "ConnectionEnd", propOrder = {
    "identification",
    "connectedComponentPort"
})
public class VecConnectionEnd
    extends VecConfigurableElement
    implements Serializable
{

    @XmlElement(name = "Identification", required = true)
    protected String identification;

    @XmlElement(name = "ConnectedComponentPort", required = true, type = Object.class)
    @XmlIDREF
    @XmlSchemaType(name = "IDREF")
    protected Object connectedComponentPort;

    public String getIdentification() {
        return identification;
    }

    public void setIdentification(String value) {
        this.identification = value;
    }

    public Object getConnectedComponentPort() {
        return connectedComponentPort;
    }

    public void setConnectedComponentPort(Object value) {
        this.connectedComponentPort = value;
    }
}

Dead End Approaches

For an implementation with pure JAXB classes, I have seen two approaches which both have their own drawbacks. In order to create awareness for the problems arising from these, I will discuss the two approaches briefly.

The Pragmatic Approach

The pragmatic approach is: “If the direct way is blocked, I will go around!”. However, as in the real world, detours are normally longer.

In this approach, just the navigation from ComponentPort to the connected ComponentPorts would require:

  1. Start at the root (ConnectionSpecification), walk through the hierarchy and iterate over all ConnectionEnds.
  2. Check if the referenced ComponentPort is the same as the one where we started.
  3. If that is the case, select the other ConnectionEnds. For this operation you will have to have remembered you current Connection as context, because backwards navigation in the hierarchy is not possible.
  4. Navigate to the connected ComponentPorts.
  5. If you now want to know the corresponding ComponentNodes (the transitive parents of the ComponentPorts), you must restart a full tree search, this time on the left side.

If this example does not speak for itself, here are some reasons why you should never do it as just described:

The Caching Approach

In this approach, you calculate all reverse navigations in advance and store the results somewhere for future use - let’s call this “somewhere” Association Cache. After deserialization of the XML file, you basically carry out a complete model traversal and place all relevant associations in a cache for reverse lookup.

This solves the performance issue, by replacing the continuous overhead for each operation by a singular overhead at startup time. However, the stability and maintainability issues remain. There still is a complex model traversal necessary, your code still requires knowledge about possible navigations and you are violating the Single Responsibility Principle.

The Single Responsibility Principle states, that a code unit should have a responsibility for a single functionality, and that responsibility should be exclusive. In conflict to this, the “navigation responsibility” in this approach is shared between the JAXB generated class (for forward directions) and the association cache for backwards directions.

Another drawback of this approach is, that the cache has to be some kind of singleton object with a scope restricted to the current model. This object must be made accessible for all locations where a navigation in backwards direction is necessary. This can lead to very clumsy code.

Solution Outline

Describing the solution in all details with all the tiny knits and knots would break the scope of this article. Therefore I will just cover the main concepts and you will have to fill out the blanks in between by yourself or contact me directly for a discussion.

A clean solution of the challenge can be defined by some basic requirements:

The way to achieve all of this, is the most obvious and simplest solution you can think of:

We just make all associations navigable bidirectionally, with almost no runtime impact!

You may know bidirectional associations in Hibernate. The approach to this solution is nearly the same. To get it up and running we have two building blocks:

  1. We need to modify the JAXB classes in such a way, that they support bidirectional navigation.
  2. We need to initialize the associations correctly.

JAXB Class Customization

To support bidirectional navigations, we want JAXB classes that look like the modified VecComponentPort in the listing below. As you can see, it defines a parentComponentConnector property for navigation to the containing context element, as well as a refConnectionEnd property for all ConnectionEnds that define references to this ComponentPort. Both are marked as @XmlTransient, as bidirectional navigations are not supported by JAXB itself and they will be ignored by the JAXB (Un)Marshaller.

@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "ComponentPort", propOrder = {
    "identification",
})
public class VecComponentPort
    extends VecConfigurableElement
    implements Serializable
{

    @XmlElement(name = "Identification", required = true)
    protected String identification;

    @XmlTransient
    private VecComponentConnector parentComponentConnector;

    @XmlTransient
    private Set<VecConnectionEnd> refConnectionEnd = new HashSet<VecConnectionEnd>();

    public String getIdentification() {
        return identification;
    }

    public void setIdentification(String value) {
        this.identification = value;
    }

    public VecComponentConnector getParentComponentConnector() {
        return parentComponentConnector;
    }

    public Set<VecConnectionEnd> getRefConnectionEnd() {
        return refConnectionEnd;
    }
}

But how do we achieve these modifications? As always, there are many ways. You could modify all classes manually, but this is a cumbersome task and modifying generated classes is an absolute NO GO!

Luckily, XJC provides a plugin mechanism that allows the execution of custom plugins during the generation process. An XJC plugin has access to the code model of the generated classes and can modify them. An example for a simple plugin can be found here.

The following code snippet can create a private field, with an optional initialization value and a corresponding public getter method in an existing class:

    public JFieldVar build() {
        final JFieldVar field = targetClass.field(JMod.PRIVATE, baseType, name);
        if (init != null) {
            field.init(init);
        }

        field.annotate(codeModel.ref(XmlTransient.class));

        final JMethod getter = targetClass.method(JMod.PUBLIC, baseType, getterName);
        getter.body()
                ._return(JExpr.ref(field.name()));

        if (getterJavadoc != null) {
            getter.javadoc()
                    .addAll(getterJavadoc);
        }

        return field;
    }

Additionally, XJC plugins can be parameterized through JAXB bindings. With a JAXB Binding Customization, you can provide specific parameters for certain schema elements. This works for both JAXB standard functionality and XJC plugins. So, we implemented a JAXB plugin, that takes the information on references from the binding and modifies the generated code accordingly.

A part of the resulting binding is listed below. It contains two customizations:

  1. <nav:parent/> defines, that a parent association should be created for this class, with the given type and name.
  2. <nav:backref/> defines, that a reverse navigation should be created in the target class, with the given name.

As you may notice, this binding also fixes the type-safety issue of IDREF associations.

   <jxb:bindings node="//xs:complexType[@name='ComponentPort']">
      <nav:parent name="parentComponentConnector" type="VecComponentConnector"/>
   </jxb:bindings>
   <jxb:bindings node="//xs:complexType[@name='ConnectionEnd']">
      <nav:parent name="parentConnection" type="VecConnection"/>
      <jxb:bindings node=".//xs:element[@name='ConnectedComponentPort']">
         <nav:backref name="refConnectionEnd"/>
         <jxb:property>
            <jxb:baseType name="de.foursoft.harness.somepackage.VecComponentPort"/>
         </jxb:property>
      </jxb:bindings>
   </jxb:bindings>

To improve the maintainability of this approach, the custom JAXB binding is not created manually. As mentioned in the beginning, the XML schema representation is not as concise as the UML model. Luckily, in our case, the XML schema is generated from an UML model that contains all the information we require. Therefore, we use the UML model to generate the JAXB binding via some powerful XSLT-scripts as well. Through this approach, it is guaranteed that the JAXB binding is always consistent to the XML schema and no elements are missed. As a side effect, we save the time required for writing and adapting it manually.

Model Initialization

With the first step we now have JAXB classes that fit our needs. However, we still need to initialize them correctly, as our extensions are custom and JAXB will ignore them. For the initialization of our extensions, we use the JAXB functionality of the Unmarshaller.Listener interface. An implementation of this interface will be notified about every unmarshalled object and its corresponding parent in the XML tree.

The advantage of using this mechanism is, that we do not have to implement our own model traversal and it is ensured that we are notified about every single instance, regardless of the enclosing context. Nevertheless, there are two topics to deal with:

  1. The JAXB Unmarshaller is only able to handle one single Listener. For design reasons we decided to have a specific listener instance per Class. Therefore, the Listener interface is implemented by a ModelPostProcessorManager which delegates the notifications to the different ModelPostProcessor instances.
  2. The initialization requires two phases:
    1. The first phase is controlled by JAXB and we use the Listener.afterUnmarshall() events to collect all relevant objects for later initialization. The actual initialization cannot occur at this moment, as JAXB has not yet initialized the IDREF associations at this stage.
    2. Once the unmarshalling process by JAXB is completed, we start the second phase for actual initialization. This is achieved by wrapping the Unmarshaller with the ExtendedUnmarshaller. The ExtendedUnmarshaller is responsible for the correct configuration of the Unmarshaller and it triggers the second phase by calling the ModelPostProcessorManager.doPostProcessing().
Model Initialization Draft

Model Initialization Draft

The initialization of the associations is carried out by a ReflectiveAssociationPostProcessor. This implementation uses reflection to gather the necessary information and to initialize the associations accordingly. To provide it with all required information, two annotations have been introduced that are added by the custom XJC plugin to store this information: @XmlParent and @XmlBackreference.

@XmlParent is used to mark a field in a class for initialization with the corresponding parent object. @XmlBackreference marks an IDREF association as bidirectional and defines the name of the reverse navigation property in target class.

@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "ConnectionEnd", propOrder = {
    "identification",
    "connectedComponentPort"
})
public class VecConnectionEnd
    extends VecConfigurableElement
    implements Serializable
{
    @XmlElement(name = "Identification", required = true)
    protected String identification;

    @XmlElement(name = "ConnectedComponentPort", required = true, type = Object.class)
    @XmlIDREF
    @XmlSchemaType(name = "IDREF")
    @XmlBackReference(destinationField = "refConnectionEnd")
    protected VecComponentPort connectedComponentPort;
    @XmlTransient
    @XmlParent
    private VecConnection parentConnection;

    // Rest of the class is omitted.
  ....
}

More Thoughts on This

I think the outlined solution speaks for itself, but keep in mind that this is just a short summary. Especially the concrete implementation of the model initialization should be carried out with extra caution, as otherwise you will have numerous opportunities to create new performance black holes.

When implemented correctly (without excessive performance optimizations), this solution adds about 10% of additional overhead to the pure JAXB unmarshalling time. For this, you receive a comfortable, intuitive navigation between JAXB objects with a constant runtime complexity.

However, that is not the end of the road! What happens when you want to navigate into the model, e.g. look-up objects by ID or search for specific criteria such as “find all objects of a specific class”? For now, I will leave this to your own creativity.

Wie sind Ihre Erfahrungen mit großen XML Strukturen? Haben Sie andere Lösungsansätze? Ich freue mich über einen regen Austausch - kontaktieren Sie mich gern.

Johannes Becker

Geschäftsbereichsleiter
E-Mail: becker@4soft.de
4Soft GmbH
Mittererstraße 3
80336 München