More experiments with compilation

By Michael Kay on September 15, 2006 at 02:18p.m.

A couple of months ago (see http://saxonica.blogharbor.com/blog/_archives/2006/7/24/2157486.html), I reported on some experiments with compiling path expressions into Java code. The results were not enormously encouraging, but I've been taking the idea forward, on a slighly different front.

This time round, rather than compiling the low-level code for navigating a TinyTree, I'm compiling the higher level logic for evaluating path expressions, FLOWR expressions, and the like. I've done enough work to compile the whole of the XMark XQuery benchmark into Java code, with the exception of one query (Q18) that uses a user-defined function. The code generation is not particularly smart at the moment - in particular, it's always doing eager evaluation of expressions - but nevertheless, the speed of some of the queries has doubled. The bad news is that others are slower; but this is early days.

This is a bigger pilot than the previous experiment I reported, in that I've written code to compile about 50 of the expression classes generated by the Saxon parser and optimizer. That's about a fifth of the total, so there's a long way to go. The next stage is to put in some of the fundamentals like function calling and lazy evaluation, and convince myself that the performance gains make it worth doing. After that there's a long slog to get full coverage of the XSLT and XQuery languages, and to test everything. The testing won't be easy, because although there are now large XQuery and XSLT test suites around, many of the expressions they contain are compile-time constants.

Compiling to Java seems a sensible choice at the moment. Going straight to bytecode would be a lot more effort, and would introduce a high risk of bugs; and the extra benefit seems marginal. One unexpected benefit is that the generated Java gives a very clear view of the execution strategy for the query, which makes it much easier to spot opportunities for improving the optimizer.

There are some basic limitations in the approach. There's no saving in the time or memory used to build the source document. The generated Java code still calls the same iterators to navigate the axes over the source document, and the same serialization methods to produce the output. Together, those run-time routines account for a large part of the query cost, which isn't going to come down. But there's always scope for moving more code into the compiled part as opportunities show themselves. I'll set the target as a 50% improvement in execution time, but I think that's probably a stretch.