Hybrid Parser Architectural Pattern

Posted by
Author Jürgen Salecker
Siemens AG, CT SE 2
Otto-Hahn-Ring 6, 81730 München, Germany
Copyright Siemens AG 2006, all rights reserved; permission is granted to http://www.developerlife.com to reprint for publication purpose on the WEB page.

The Hybrid Parser architectural pattern applies to software systems which need to parse documents but are constrained by memory resources and processing power available. The pattern combines the processing advantages concerning execution speed and memory resources of event driven parsers with the programming comfort of a fully-fledged document object model, provided by an object tree parser. An event driven parser is usually by a factor of about 10 faster than an object tree parser but has the significant disadvantage that it is a cumbersome procedure to create an object model out of parsing events.

This pattern combines the advantages of both parsing techniques, the solution in a nutshell: Parsing events are collected and used to construct a parse tree which contains only node location information. Once the addressed node has been reached all following parsing events are used to construct a document object model (DOM) of just this part of the document. This avoids the necessity to hold the complete document in memory, when the focus is only on a part of the document. The extracted part is provided as a complete object model for comfortable processing with a DOM parser by an application.

Context

You need to extract a relatively small amount of information out of a larger (XML, for example) document. This extraction has to be done as efficiently and effectively as possible, both in terms of processing speed and required resources, such as memory and with a minimized implementation effort.

Example

The figure below shows a principle XML document, where the application needs to extract the two DOM sub-nodes trees identified by the two circles.

Figure 1, Principle DOM

With an object tree parser (DOM parser in XML technology) the complete document needs to be represented in memory first. Then the application is able to extract the two DOM sub-nodes by using the comfortable DOM API.

In the case an event based parser (SAX parser in XML technology) is used the application is responsible to construct an object model out of the generated parse events, which is a cumbersome procedure. Once the model is complete it is necessarily not a DOM model instead it is an application specific model, i.e. might even contain business logic.

Problem

Let’s assume an event based parser is used exclusively to extract the relevant information – the two DOM sub-nodes – shown in the figure above. In this case the processing speed will be sufficient, however the programming effort required creating an object model just out of parse events will be significant. Using an object tree parser instead, will solve the problem to create an object model, but the processing speed is usually not acceptable, at least for larger documents. It can be said that an event based parser is approximately by a factor of 10 faster than an object tree parser.

The following forces influence the solution:

  • Only a relatively small part of a document is required by the application.
  • Provide an effective parsing technique in terms of a minimized implementation effort, because it is a cumbersome procedure to exclusively use parse events to construct an object model of an interesting part of the document. Instead the architect should focus on the business logic and not waste his or her resources “i.e. implementation time” for processing parsing events to create a complete object model.
  • The main document to be parsed is too large to be present in memory at the same time.
  • Provide an elegant measure being able to forward the extracted part of the document to another application(s) as a generic object (i.e. instance of “org.w3c.dom.Node”) without containing any application specific logic.
  • The extracted object should be easily transformed into other document object models, like XOM, JDOM, Crimson DOM, DOM4J, etc. in order to gain advantages of the benefits of more specific APIs.
  • The computer resources in terms of memory and processing power available are limited, as it is typically the case in embedded systems. But similar situations might also occur in enterprise systems.

Therefore the usage of either an object tree parser or an event based parser alone does not produce satisfying results in terms of architectural elegance and resource consumption.

Solution

Combine a SAX parser with a DOM builder in order to gain advantage of the processing speed of a SAX parser and the programming comfort of a DOM parser, which provides a fully fledged document object model to the application.

The principle data flow of the hybrid parsing technique is shown in Figure 2 below:

Figure 2, Hybrid Parsing Technique

First define trigger locations (xpath collections) for the interesting parts of the document. Then use an event based parser (SAX) to locate (Travel xpath) the interesting areas within the document. Once the parser events from the SAX parser match a trigger node (Instantiate NodeDomHandler), it feeds all further parse events to a document tree builder. This document tree builder constructs (Construct DOM sub-object) a document sub tree out of the token events. The construction will go on as long the document tree builder accesses child nodes of the previous matched trigger node. The tree builder stores the extracted sub-document (Store DOM sub-object) once it has processed all child nodes of the trigger node. Finally it returns to the event based parser (SAX), and keeps on parsing.

Again the event based parser is searching for the next match (based on the defined trigger locations) which triggers the same procedure as described above. After the event based parser has finished parsing the complete document, the collected sub-documents – which are standalone document object models – are forwarded to classes for domain specific processing.

Implementation

The implementation relies on two major principles which are the event based parsing technique implemented by the class HybridHandler and the construction of DOM sub-objects out of the parsing events, implemented by the class DOMBuilder, a 3rd party software.

The XML processing framework from Apache Sun with Java 1.5.x has been used for an example implementation; the complete class diagram of the implementation of this pattern is shown in the figure below:

Figure 3, Hybrid Parser, Class Diagram

The slightly grey colored classes indicated 3rd party software from the packages:

3rd party interface

  • org.xml.sax.ContentHandler

3rd party classes:

  • com.sun.org.apache.xerces.internal.parsers.SaxParser
  • org.apache.xml.utils.DomBuilder
  • org.apache.sax.helpers.DefaultHandler

Software with similar functionality from other 3rd party packages might be used as well by taking into account slight adaptations.

As already mentioned the solution relies on two major principles the “event based parsing technique” together with “DOM sub-object creation out of SAX parsing events”.

Event Based Parsing Technique

The principle of this technique is shown in Figure 4, below. A SAX parser calls the method startElement() any time a new XML node ”
< ….>
” has been detected. The method endElement() is called once a closing XML element ”
< …>
” has been detected. The method characters() is called for anything detected between the start and the end of an XML node.

Figure 4, SAX Parsing Principle

The class HybridHandler overrides the methods (call-backs) startElement() and endElement(). The method characters() will be handled by its super class, the DefaultHandler (3rd party software).

The methods startElement() and endElement() are used exclusively to construct XML tree location information (xpath). The XML node name (including namespaces) is used to calculate the location information (xpath) of the current parse location. All other information provided by those two methods (XML attributes) is ignored, i.e. not relevant for this pattern.

If the calculated location information (xpath) matches with the xpath definition provided by the class xPath, the HybridHandler will instantiate the class NodeDomHandler and then forward the SAX parsing events to this newly created object.

DOM sub-object Creation

The class NodeDomHandler together with its super-class DOMBuilder implements the second principle of this pattern, the creation of a DOM sub-object. It uses the forwarded SAX events to construct a complete DOM sub-objects by including all details, XML attributes and name spaces. It uses the same instance of the SAX parser as the class HybridHandler.

NodeDomHandler overrides the methods startElement() and endElement() (call-backs from the SAX parser) and within those methods it calls its super class DOMBuilder for DOM sub-object creation, outlined below:

Figure 5, DOM sub-object Creation

The class NodeDomHandler calculates as well document location information (xpath) in the same way as HybridHandler, but with the exception: If the calculated location information matches with the xpath definitions provided by the class xPath it will instantiate itself recursively. An example XML document for this case is shown in the figure below:

Figure 6, Extracting XML Child Nodes

This provides a comfortable way to represent an XML tree as a tree of instantiated DOM sub-objects. In the example document model shown above (Figure 6) the DOM node three “./d/f” will be extracted as a DOM object from DOM node two. DOM nodes one and two have the same xpath “/a/b” but have different child nodes.

Once the NodeDOMHandler has reached its end condition (the same leaf level in the parse tree as it has been instantiated) it calls endNode() which adds the constructed DOM object to the user defined class via the method addNode() implemented by the class DomNodeCollection. Then it returns the SAX events back to its parent object. Its parent object could be either another instance of the same class NodeDomHandler or an instance of the class HybridHandler. Returning SAX events back to the previous initiator is initiated by the method getParser().setContentHandler() from interface HybridIF.

The class DomNodeCollection represents a class which holds the extracted DOM sub-object collection. This is the only class which might contain business specific code and is therefore not part of this pattern. It is shown in the class diagram (Figure 3) for the sake of completeness.

The class HybridHandler might be instantiated recursively in case XML documents are linked together via Xlink definition (see Figure 7 below). In those cases the HybridHandler has to be launched at the XML root document.

Figure 7, Recursive XML Document Processing

Known Uses

Please contact the author for references about known uses.

Consequences

The pattern provides the following benefits:

  • Light weight parsing of even large documents by resource constraint systems, both in terms of processing speed and memory resource required.
  • Combing the advantages of event based parsers with object tree based parsers.
  • Avoid the cumbersome procedure of having to implement a complete object model out of parse events.

The pattern provides following liabilities:

  • If the application requires more or less the complete document and enough memory is available one might consider using an object tree parser exclusively.
  • This pattern does not provide a direct binding of the XML document to the programming language in usage, because the document object model has to be queried via the API provided by the document object model (DOM).

Credits

This pattern has gone a long way to reach the point where it is now. As I decided to use XML in a development project somewhere in 2002 at Schlumberger/Norway, I searched the Internet for advice about how to parse large XML documents as efficient and effective as possible. In a forum (I forget the name of the original author) there was a post with the following four advices concerning the efficient processing of large XML documents:

  • Define trigger locations for interesting parts of an XML document.
  • Use an event based parser (SAX parser) to locate those locations.
  • Use a parse tree builder (DOM builder) to extract the defined location as a DOM sub-object.
  • Use a tree parser (DOM parser) to provide the binding to the programming language.

The core of the pattern is captured in those four statements. Also thanks to Kevlin Henney who basically invented the name of the pattern (in Q1/2003) once I explained the principle to him.

My first shepherd (EuroPlop 2005) pointed out that this pattern might be more generic as just to being applied for XML document processing. At this time I did not really believe, but at this comment got repeated by shepherd number three (Uwe Zdun, VikingPlop 9/2006), finally I was convinced. Shepherd number two (Peter Sommerlad, EuroPlop 7/2006) did provide very valuable support concerning the clarification of the essence of this pattern.

The writer’s workshop at VikingPlop 2006 composed of: James Siddle (SAP), Cecilia Haskins (NTNU), Juha Pärssinen (VTT), Met-Mari Nielsen (Runestone Game Development), Pavel Hruby (Independent), Klaus Marquardt (Dräger) did provide very useful suggestions in order to increase the quality of the paper even further. The UML diagrams became significantly clearer with the help of my colleague Silvano Cirujano-Cuesta, a UML specialist, at our Munich office. Finally my colleague Jürgen Schmitt initiated important last minute changes, concerning the correct description about XML child node processing.

Comments and Discussion

If you want to post comments on this article, please click here.

References

Acronym Descriptions References
W3C The World-Wide-Web consortium http://www.w3.org
XML Extensible Markup Language
XML is a simplified subset of SGML, it has been optimized for the WEB environment. XML retains 80% of SGML power, but it is only 20% as complex as SGML, XML documents are compatible with SGML.
http://www.w3.org/TR/2004/REC-xml-20040204/
DOM Document Object Model
A platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.
http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407
XPath XPath is a language for addressing parts of an XML document, designed to be used by both XSLT and XPointer. http://www.w3.org/TR/xpath
Xlink XML Linking Language
It allows elements to be inserted into XML documents in order to create and describe links between resources. It uses XML syntax to create structures that can describe links similar to the simple unidirectional hyperlinks of today’s HTML, as well as more sophisticated links
http://www.w3.org/TR/xlink/
API Application Programming Interface
An application programming interface (API) is a source code interface that a computer system or program library provides in order to support requests for services to be made of it by a computer program.
http://en.wikipedia.org/wiki/Application_programming_interface
SAX SAX is a serial access parser API for XML. SAX provides a mechanism for reading data from an XML document. It is a popular alternative to the Document Object Model (DOM). The name is acronymic ally derived from “Simple API for XML”. http://en.wikipedia.org/wiki/Simple_API_for_XML
XOM XOM™ is a new XML object model. It is an open source (LGPL), tree-based API for processing XML with Java that strives for correctness, simplicity, and performance, in that order. http://www.xom.nu/
JDOM JDOM, we want to provide a solution for using XML from Java that is as simple as Java itself. http://www.jdom.org/
DOM4J DOM4J, is an easy to use, open source library for working with XML, XPath and XSLT on the Java platform using the Java Collections Framework and with full support for DOM, SAX and JAXP. http://www.dom4j.org/