Wednesday, 23 May 2007

XML Pipelining in Aardvark

One thing I've tried to do with Aardvark is identity where bits of code are truly reusable and factor them out so that they can be used in other projects. From this, we've built up some nice general utility classes for doing stuff with Strings and Objects (everyone else has probably done this too!), some helper classes for doing nice things with XML and a simple framework for doing cheap and efficient databinding (that is, converting Objects to and from XML). These reusable classes are collected under the uk.ac.ed.ph.aardvark.commons package hierarchy.

One such generalisation that's proved really useful is our class for doing XML pipelining (uk.ac.ed.ph.aardvark.commons.xml.XMLPipeline).

What's XML pipelining?

All of the text-based Knowledge Objects in Aardvark are ultimately stored as XML, which is great for representing the underlying structure of the content. (For example, lists, paragraphs, key points, mathematics, ...). On its own, this XML is a bit abstract so needs to be processed to turn it into the various outputs the Aardvark produces (e.g. nice web pages, digital overheads, PDF files). XML pipelining basically works like a traditional factory conveyor belt: the raw XML gets passed along the conveyor belt and gets gradually refined into the target output format. Why do it like this? Well, the factory analogy applies here too. People in a factory generally get very good at doing one thing repetitively and that works with XML pipelining too - we can create "pipeline steps" that do a single thing rather well, and then join all of the required steps together to build up something more complex. This is good for a number of reasons:
  • Breaking a complex process down into steps makes it easier to work with;
  • Individual steps are usually simple so can be verified to work correctly and do their job well;
  • Steps can be reused in related pipelines;
  • Steps are often so general that they can be refined for reused in other projects.

How XML Pipelining works

(Warning: the rest of the post is very geeky!)

An XML pipeline normally consists of 3 components:
  1. A "source": that is, information flowing into the pipeline. In Aardvark, we assume that this is something which generates a stream of SAX events. (e.g. a SAX parser)
  2. Zero of more "handlers": these take incoming SAX events, do stuff to them, and send possibly different SAX events on to the next handler.
  3. A "serializer": This takes incoming SAX events and turns them into some kind of finished article. For example, it might create an XML document file or even use the incoming SAX events to build a Java Object or perform some kind of configuration work.
Not all components are necessary. For example, you can have a pipeline with no serializer. In this case, all of the data will "wash away" as it falls out the bottom of the pipeline. That sounds daft but can be useful if some of the handlers are building up information about the incoming data, such as hunting out hyperlinks or suchlike. An explicit source is also optional: we can simply fire SAX events directly at the first handler in the pipeline. We can also have pipelines with no handlers, which means that the data flowing out will be exactly the same as the data flowing in. Again, this sounds daft but can be a simple way of turning incoming SAX events into an XML document and is used in the Aardvark databinding classes. (The vanilla XML APIs in Java make this more awkward than it should be!)

What kind of handlers can we use?

Handlers generally fall into 2 categories:
  1. A SAX filter. This is a low-level filter that simply receives SAX events, does stuff to them and fires out new SAX events. SAX filters are great if you want to make minor perturbations to a document (e.g. do something to hyperlinks, miss sections out).
  2. An XSLT transform. This lets you make really major changes to the incoming data. In Aardvark, we use these to go from the "raw" document formats to more polished output formats. XSLT is much more expensive than SAX but is often necessary and actually performs very well, especially if you reuse your stylesheets.
It's common for there to be a mixture of these two types of handler in a pipeline. Be aware that most XSLT processors will build a DOM tree from incoming SAX events so it makes sense to group XSLT handlers together and have SAX stuff before and/or after all of the XSLT.

XML pipelining in Java

It's possible and fairly easy to do XML pipelining using the existing Java APIs but it's not quite as nice as it should be. One reason for this is that setting up a pipeline often requires a mixture of the standard SAX API and Java TrAX API (used for XSLT) and, being designed by two completly different bodies, they're not at all alike: a filter handler is represented by the org.xml.sax.XMLFilter interface; an XSLT handler is represented by the javax.xml.transform.sax.TransformerHandler interface. Making the pipeline work consists of configuring each handler to ensure it passes its output on to the next handler in the pipeline and the resulting code can be a bit messy. This is where XMLPipeline comes in.

Our uk.ac.ed.ph.aardvark.commons.xml.XMLPipeline class

The design of XMLPipeline is intentionally simple. (My first stab at this tried too hard to be clever and suffered as a result, so I learned from the mistakes made there!) It follows the 'builder' design pattern and is just a thin wrapper over all of the gubbins we usually need to do pipelining. Its main advantage is that it makes it really easy to assemble a pipeline, making the resulting code very easy to understand and less prone to errors and future changes.

To get started, create a new XMLPipeline(). You can then build the pipeline by adding a number of handlers using zero or more of the following methods:
  1. addFilterStep() lets you add a SAX filter to the pipeline. This is overloaded to accept either a "standard" org.xml.sax.helpers.XMLFilterImpl or a more general "lexical" filter (uk.ac.ed.ph.aardvark.commons.xml.XMLLexicalFilterImpl). The difference between the 2 filters is that the latter also receives information about comments, entities and DTDs.
  2. addTransformStep() lets you add an XSLT transform to the pipeline. This is overloaded to take either an implementation of javax.xml.transform.Source, which locates the stylesheet to be read in or loaded, or a javax.xml.transform.Templates, which is a stylesheet that has already been compiled for reuse.
Calls to these methods simply ensure that each handler gets configured to pass its output to the next handler "downstream".

Once you've added a number of handlers, you can choose to terminate the pipeline as follows:
  1. addSerializer() will serialize the resulting XML into the javax.xml.transform.Result you pass to this method. This is the most common way of terminating the pipeline - passing a javax.xml.transform.stream.StreamResult allows you to save the resulting XML to a String or file, which is a common use scenario.
  2. addTerminalStep() takes a generic SAX org.xml.sax.ContentHandler or org.xml.sax.ext.LexicalHandler and makes that the receiver of the pipeline's output. This can be useful if you want to plug a pipeline into another pipeline or someone else's SAX input.
Once you've added a terminal step, the pipeline will not allow you to add any more handlers. You can also choose not to terminate the pipeline, as mentioned earlier.

Once set up, you can run the pipeline in two ways:
  • Call execute() passing either a java.io.File or org.xml.sax.InputSource. This will parse the incoming XML and pass it to through the pipeline.
  • Call getStep(0) to receive the first handler in the pipeline and fire your own SAX events at it. (This is how our Object -> XML databinding works.)
And that's it! It's nice and easy. The XMLPipeline class also tries to help with any runtime XSLT errors by unravelling any Exceptions that are produced; in normal pipelines they tend to get wrapped up by each step in the pipeline and get lost in stacktrace noise. For other goodies, have a look at the JavaDoc or source.

No comments: