Implementing XQuery Update

By Michael Kay on December 19, 2007 at 02:18p.m.

I've been asked a number of times whether I planned to implement XQuery Updates, and my answer has always been that I would do it when the time was right (i.e. when the spec seemed to be stable enough and when I had a suitable gap in my hectic schedule). I was hoping someone might want it badly enough to sponsor the development, but one can't wait for ever!

Anyway I've started work on it and have made good progress in a couple of days. Parsing the syntax has required one or two tweaks to the tokenizer, which is a bit worrying - this is one of the few areas of the Saxon code that I'm frightened to touch in case something breaks; but at least there are now zillions of test cases. By and large the semantics of the spec pose no particular problems - in fact the spec comes very close to describing an implementation strategy that one can lift directly into the code.

One of the issues has been deciding what implementation of the tree model to work on. I decided to start with Saxon's "linked tree" (formerly known as the "standard tree") which has had very little attention paid to it for the last five years or so since the tinytree came along. (The TinyTree is intrinsically non-updateable, because it relies heavily on static allocation of space in fixed arrays.). There are a few features of the linked tree that make update tricky, notably the use of sequence numbers allocated at tree-building time to make sorting into document order efficient; but there are ways around that. Once that's working I can think about supporting DOM, XOM, etc if appropriate.

Simple things in XQUpdate seem to work really nicely: being able to do things like bulk renaming or deletion of nodes in a one-liner is a lot easier than the equivalent in XSLT using the modified identity template pattern (and of course infinitely easier than doing a hand-written tree traversal in XQuery 1.0). Doing slightly more complicated things, however, seems to get a bit messy, because of the "snapshot semantics" - basically the query never has read access to its own updates. But I think with experience the coding patterns will emerge to solve most of these problems.

The trickiest design area is probably the interface between the query language and the external environment - what should the command line interface look like, and the API? In particular, when should the system be prepared to overwrite an input file with an output file reflecting the results of the update, and how much user control do you provide over such decisions? I'm still working on that.