How long is a (piece of) string? - Saxon diaries

| No Comments | No TrackBacks

As I explained in my previous post, I've been re-examining the way functions work in Saxon. In particular, over the last week or two, I've been changing the way system functions (such as fn:string-length) work. There's a terrific amount of detail and complexity here, but I thought it might be interesting to take one simple function (fn:string-length) as an example, to see where the complexity comes from and how it can be reduced.

At first sight, fn:string-length looks pretty simple. How long is a (piece of) string? Just ask Java to find out: surely it should just map to a simple call on java.lang.String.length(). Well, no actually.

If we look to the specification, there are two complications we have to deal with. Firstly we are counting the number of Unicode characters, not (as Java does) the number of 16-bit UTF16 codepoints. In the case of surrogate pairs, one character occupies two codepoints, and that means that a naïve implementation of string-length() takes time proportional to the length of the string.

Secondly, there are two forms of the string-length() function. With zero arguments, it's defined to mean string-length(string(.)). That's different from nearly all other functions that have 0-argument and 1-argument forms, where (for example) name() means name(.). Saxon handles functions like name() by converting them statically to name(.), and that conversion doesn't work in this case. To illustrate the difference, consider an attribute code="003", defined in the schema as an xs:integer. The function call string-length(@code) returns 1 (it atomizes the attribute to produce an integer, converts the integer to the string "3", and then returns the length of this string. But @code!string-length() returns 3 - the length of the string value of the attribute node.

The other complexity applies specifically to string-length#0 (that is, the zero-argument form). Dynamic calls to context-dependent functions bind the context at the point where the function is created, not where it is called. Consider:

<xsl:for-each select="0 to 9">
<xsl:variable name="f" select="string-length#0"/>
<xsl:for-each select="21 to 50">
<xsl:value-of select="$f()"/>

This will print the value "1" three hundred times. In each case the context item at the point where $f is bound is a one-digit integer, so $f() returns the length of that integer, which is always one. The context item at the point where $f() is evaluated is irrelevant.

Now let's take a look at the Saxon implementation. There's a Java class StringLength which in Saxon 9.6 is about 200 lines of code (including blank lines, comments, etc), and this does most of the work. But not all: in the end all it does is to call StringValue.getStringLength(), which is what really does the work. Atomic values of type xs:string are represented in Saxon by an instance of the class StringValue, which encapsulates a Java CharSequence: often, but not always, a String. The reason for the encapsulating class is to provide type safety on methods like which returns a Sequence; StringValue implements AtomicValue which implements Item which implements Sequence, so the XDM data model is faithfully represented in the Java implementation classes.

In addition there's a class StringLengthCompiler which generates a bytecode implementation of the string-length function. This is another 60 or so lines.

Some functions also have a separate streaming implementation to accept streamed input, and one or two (string-join() and concat(), for example), have an implementation designed to produce streamed output. That's designed to ensure that an instruction like , which compiles down to a call on string-join() internally, doesn't actually assemble the whole output in memory, but rather writes each part of the result string to the output stream as it becomes available.

Since the introduction of dynamic function calls, many system functions have two separate implementations, one for static calls and one for dynamic calls. That's the case for string-length: the evaluateItem() method used for static calls is almost identical to the call() method used for dynamic calls. One reason this happened was because of a fear of performance regression that might occur if the existing code for static calls was generalized, rather than introducing a parallel path.

In 9.6, the implementation of dynamic calls to context-dependent functions like string-length#0 is rather fudged. In fact, the expression string-length#0 compiles into a call on function-lookup("fn:string", 0). The implementation of function-lookup() keeps a copy of both the static and dynamic context at the point where it is called, and this is then used when evaluating the resulting function. This is vastly more expensive than it needs to be: for functions like string-length#0 where there are no arguments other than the context, the function can actually be pre-evaluated at the point of creation. In the new 9.7 implementation, the result of the expression string-length#0 is a function implemented by the class ConstantFunction, which encapsulates its result and returns this result when it is called. (It's not quite as simple as this, because the constant function also has to remember its name and arity, just in case the user asks.)

The method StringValue.getStringLength() attempts to recognize cases where walking through the codepoints of the string to look for surrogate pairs is not actually necessary. In previous releases there was an extra bit kept in StringValue, set when the string was known to contain no surrogate pairs: so having walked the string once, it would never be done again. In 9.6 this mechanism is replaced with a different approach: Saxon includes several implementations of CharSequence that maintain the value as an array of fixed-size integers (8-bit, 16-bit, or 32-bit, as necessary). If the CharSequence within a StringValue is one of these classes (known collectively as UnicodeString), then the length of the string is the length of the array. And when getStringLength() is called on a string the first time, the string is left in this form, in the hope that future operations on the string will benefit. Of course, this will in some cases be counter-productive (and there's a further refinement in the implementation, which I won't go into, that's designed to overcome this).

There are a few other optimizations in the implementation of string-length() that are worth mentioning. Firstly, it's quite common for users to write

<xsl:if test="string-length($x) != 0">

Here we don't need to count surrogate pairs in the string: the string is zero-length if and only if the underlying CharSequence is zero-length. Saxon therefore does a static rewrite of such an expression to boolean(string($x)). (If $x is statically known to be a string, the call string($x) will then be further rewritten as $x.)

If string-length#1 is applied to a value that can be computed statically, then the string-length function is itself computed statically. (This optimization, for odd historical reasons, is often called "constant folding". It's possible only when there are no context dependencies.)

During type-checking, the implementation of string-join#0 keeps a note of whether a context item is known to exist. This is used during byte-code generation; if it's known that the context item won't be absent, then there is no need to generate code to check for this error condition. It's through tiny optimizations like this that generated bytecode ends up being faster than interpreted code.

In my current exercise refactoring the implementation of system functions such as string-length, I've been looking at how much of logic is duplicated either across the different implementations of a single function (streamed and unstreamed, static and dynamic, bytecode and interpreted) or across the implementations of functions that have a lot in common (such as string(), string-length(), and normalize-space()). I've found that with the exception of the core code in StringValue.getStringLength, and the optimization of string-length()=0, everything else can be vastly reduced. In place of the original StringLength class, there are now two (inner) classes StringLength_0 and StringLength_1 each of which consists of a single one-line method. The code for generating byte-code can also be considerably simplified by achieving more reuse across different functions.

The main essence of the reorganization is that the class StringLength (or rather, its two variants) are no longer Expressions, they are now Functions. Previously a call onto string-length($x) compiled to an expression, held as a node on the expression tree. Now it compiles into two object, a StringLength object which is a pure function, and a SystemFunctionCall object which is an expression that calls the function. The SystemFunctionCall object is generic across all functions, while the implementations of SystemFunction contain all the code that is specific to one function. This change was motivated primarily by the need to handle dynamic function calls (and hence first-class function objects) properly, but it has provided a stimulus for a refactoring that achieves much more than this.

So, how long is a piece of string? At least we now know how to work it out more efficiently. Sorry this little yarn wasn't shorter.

Functions, Function Calls, Function Items - Saxon diaries

| No Comments | No TrackBacks

XSLT and XQuery, in their 3.0 incarnations, have become fully functional programming languages, where functions are first class values that can be manipulated in the same way as other values, for example they can be bound to variables and passed as arguments to other functions. So you would expect that functions play a pretty central role in the Saxon implementation. In this article I shall review how functions work within Saxon, and how I think this needs to change.

One of the things that happens when features are added to a complex piece of software is that they tend to be added around the edges, rather than in the core. You can tell when something was added by how central it is to the class hierarchy. It shouldn't be that way, but there are good reasons why it happens. And when we look at how functions work in Saxon, that's what we see. For example, if we had always had dynamic function calls, then we would probably find that static function calls (where the function to be called is known at compile time) were treated simply as a special case of dynamic calls, much as ($x+1) is treated as a special case of ($x+$y). But because they came along later, dynamic calls are actually handled very differently.

I've been doing work recently on the design of Saxon's expression tree (the data structure produced by the compilation phase for use by the execution phase of stylesheet and query evaluation). One aspect of that has been working out how to write the expression tree to persistent storage, and then load it back again for execution. In doing this, I've been struck by the sheer number of different classes that are somehow related to functions, and by the lack of coherence between them.

The first thing we notice, which is in itself a bit smelly, is that there is no class called Function. In fact, for many functions, there is no object in Saxon that represents the function itself. For example, with system functions (recall that in XSLT 1.0 and XPath 1.0 that's the only kind there were), we have two things: a data table containing general information about functions, such as their type signatures and context dependencies, and a class SystemFunctionCall which actually represents a call to the function, not the function itself. The implementation of functions such as name() or substring() is in a class called Name or Substring which is a subclass of SystemFunctionCall. This in turn is a subclass of Expression, and as such it forms a node in the expression tree.

This works fine for static function calls, but what happens to an expression such as substring#3 or true#0 or name#0? Answer: different things in each of these three cases.

For substring#3, this is a pure function: there are no context dependencies. We create something called a SystemFunctionItem, which is an Item (and therefore a Sequence), we wrap this inside a Literal, and we put the Literal on the expression tree. This is much the same as the way we handle a constant such as "London" or 39. Internally, the SystemFunctionItem contains a reference to an instance of the Substring class, which is where things get a bit peculiar, because Substring is designed to be a function call, not a function. In fact, Substring has a dual personality, it tries to act in both roles depending which methods you call. That's a hack, and like all hacks, it comes back to bite you.

For true#0 there's a further bit of hackery, because the function call true() doesn't actually generate a SystemFunctionCall, it generates a BooleanValue: it's treated as a constant, just like "London" or 39. But we have to support dynamic calls on true(), so we introduced a SystemFunctionCall called True to handle this case, even though it doesn't have the dual personality: it acts only as a function, never as a function call.

The name#0 function is different again, because it encapsulates context. It's defined to return the name of the node that was the context node at the point where the name#0 expression was evaluated. So this should remind us that name#0 is not a value, it is an expression: the function it represents has no static existence, it can only be created dynamically by evaluating the expression with a particular dynamic context. We solve this with another hack: for context-dependent function "literals" like name#0 or position#0, the compiler actually generates a call on function-lookup("name", 0), which is an expression rather than a value, and which has to be evaluated at run-time.

The SystemFunctionItem class implements an internal interface called FunctionItem, and as one might expect, a FunctionItem is an Item and therefore a Sequence: that is, it's a run-time value. Other subclasses of FunctionItem are used for calls to user defined functions, Java extension functions, or constructor functions. But they are only ever used for dynamic calls, never for static calls.

Although there is no Function class, some functions are in fact represented as objects in their own right. The most important is UserFunction, which represents a user-written function in XSLT or XQuery. Another is InlineFunction (representing an anonymous inline function), and another is ExtensionFunctionDefinition, which represents a function whose implementation is provided in Java code: this is used both for user-written extension functions, and for many (but not all) vendor supplied extension functions in the Saxon namespace. But these classes have nothing in common with each other, there is no common superclass. This has the consequence that not only is there one set of machinery for static function calls and another quite different set for dynamic calls, but that in each case, there are many different flavours depending on what kind of function you happen to be calling.

Getting this right involves a cleaner data model. As always, a clean data model leads to cleanly structured code, and the data model should be designed to accurately reflect the specification, not for the convenience of the implementation. The specification says we have two objects of interest, a function, and a function call. Quite rightly, the spec no longer uses the term "function item" as something distinct from the function: a function is an item. There's also something called the "function implementation" which we should recognize, because two functions may share an implementation. For example, the expression name#0 returns a different function each time it is evaluated: these functions differ in the data they hold (a snapshot of the dynamic context), but they share the same implementation.

We should be trying to move towards a structure where we have a hierarchy of subclasses of Function (for system functions, constructor functions, user functions, Java extension functions, etc); where the Function is an Item; where a Function may reference a FunctionImplementation to provide for sharing; and where a FunctionCall is as far as possible independent of what kind of function it is calling, but simply has two flavours for static function calls and dynamic function calls.

It will be interesting to see how close we can get to that situation.

Redesigning the Saxon expression tree - Saxon diaries

| No Comments | No TrackBacks
I've been embarking on an exercise to redesign the Saxon expression tree. It's got a number of problems that it would be nice to fix; and there's major work ahead in being able to save and restore the tree as part of a compiled XSLT 3.0 package, so it would be nice to get the structure into better shape first.

One of the problems is that there's 150-odd different implementation classes for nodes on the expression tree, and that's before you start counting individual function calls; and there's far too much duplication between these classes. Implementing a new kind of expression involves far too much code. In addition, there's far too much scope to get this code wrong, leading to obscure bugs when a new kind of expression gets involved in some particular optimization rewrite, such as function or variable inlining. There's a steady flow of bugs, perhaps a couple a month, caused by tree corruptions of one kind or another, and it would be nice to make the whole structure more robust.

One thing I want to do is to be more systematic about the way in which static context information is held on the tree. At present the principle is that each expression saves that part of the static context that it thinks it might need. For example, some expressions need run-time access to the static base URI, some to the static namespace context, and some to the register of collation names. (This last case has been simplified in 9.6 because all collation names are now global to a Configuration, though the default collation can still vary from one expression to another). A recent shock discovery is that in XPath 3.0, general comparisons (that is, the humble "=" operator) can depend on the namespace context - if one operand turns out to be untyped atomic and another is a QName, then the untyped atomic value needs to be converted to a QName using the in-scope namespaces. This means the namespace context needs to be saved with every general comparison expression, which is heavy stuff. Given that the static context is almost always the same for all expressions in a module, it would be much better rather than saving what each expression thinks it might need, to save the changes to the static context on the tree, so each expression can discover the whole static context by processing a set of deltas held with its ancestors in the tree.

This leads to another point, which is that it would be nice to have parent pointers in the expression tree. At present you can navigate from a node to its children (that is, its subexpressions), but not in the other direction. Saxon gets around this by keeping a dynamic stack of expressions visited in some of its operations on the tree (such as type-checking and optimization) so the ancestor expressions are maintained dynamically during the recursive tree-walk rather than being maintained statically on the tree.

In 9.6, mainly to support XSLT 3.0 streamability analysis, we introduced a generalized mechanism for obtaining the subexpressions of any node: the operands() method. This returns a set of Operand objects, each of which contains a pointer to the child expression itself, and also properties in effect of the parent-child expression relationship, such as the XSLT 3.0 posture and sweep properties, and the operand usage, used in the 3.0 general streamability rules. This mechanism has proved very successful (and not just for streaming) in enabling more generalized operations on the tree. But a limitation is that modtifications to the tree, such as substitution one child expression for another (which is very common during type-checking and optimization) is still entirely ad-hoc, and has to be managed independently by each class of expression node.

As a first step in redesigning the tree for 9.7, I have extended the way in which we use Operand objects. As well as being used to navigate to subexpressions, they are also now used to modify subexpressions: the Operand object has a method setChildExpression() which can be used to replace the existing child expression with another. All structural changes to the tree are required to go via this method, which is enforced by encapsulating the reference to the child expression within the Operand object. The Operand also holds a reference to its "owner" expression, so when a child expression is changed, the single setChildExpression() method can take responsibility for housekeeping such as making sure the child expression has location information for use in error reporting, and making sure that expression properties cached on the parent node (such as the inferred type) are invalidated and recomputed when the children of the expression change.

This process is complicated by the fact that the nodes on the expression tree are highly diverse, in fact they don't even all represent expressions. The tree also has to cater, for example, for XSLT patterns and XQuery FLWOR expression clauses.

Making updates go through the Operand object enables many expressions to inherit a generic implementation of common rewrite methods such as typeCheck and optimize. For example, the default action of optimize is to call optimize() on each subexpression, and if any changes have occurred, replace the subexpression with its rewritten self. The redesign means that this "replace" operation can now be done in a generic way, meaning that the default optimize() method does not need to be tailor-made for each class of expression. The same is true of other methods such as primote().

I was hoping that channelling all updates through Operand.setChildExpression() would also make it easy to maintain parent pointers in the tree. Unfortunately this is not the case. It's easy enough, when B is set as a child of A, for B's parent pointer to be updated to point to A. The problem arises when B's parent pointer was previously set to C: what happens to C's children? Can B be a child of A and C simultaneously (making it not a tree at all?). It turns out that some rewrites on the tree involve creating a new structure over existing leaf nodes in the tree, which might then be discarded if not all the conditions for optimization are met. So we've updated the parent pointers in the leaf nodes to this new superstructure, which we then discarded, reverting to the original. It's difficult then to make sure that the parent pointers are reset properly when the rewrite is abandoned. It can be done in an ad-hoc way, of course, but we're looking for something more robust: an API for tree rewriting that doesn't allow the tree to become inconsistent. This is proving hard to achieve. We may have to resort to a different approach, doing a bulk sweep of the tree to set all parent pointers before the typecheck and optimize operations on each template or function. This is intellectually unsatisfactory because it means accepting that the tree could be temporarily inconsistent in the middle of such an operation, but it may be the best option available.

This is work in progress, but it's looking promising. I appreciate that for most users it's about as interesting as the repair works to the sewers beneath your street. Hopefully, though, the bottom line will be fewer product bugs, and more ability to take the product forward into new areas.

Sharing key indexes - Saxon diaries

| No Comments | No TrackBacks
For ever and a day, Saxon has tried to ensure that when several transformations are run using the same stylesheet and the same source document(s), any indexes built for those documents are reused across transformations. This has always required some careful juggling of Java weak references to ensure that the indexes are dropped from memory as soon as either the executable stylesheet or the source document are no longer needed.

I've now spotted a flaw in this design. It wasn't reported by a user, and it didn't arise from a test case, it simply occurred to me as a theoretical possibility, and I have now written a test case that shows it actually happens. The flaw is this: if the definition of the key includes a reference to a global variable or a stylesheet parameter, then the content of the index depends on the values of global variables, and these are potentially different in different transformations using the same stylesheet.

Fixing this is not easy, and poses some interesting dilemmas, given that no-one has hit the problem in a dozen years and it's quite likely that no-one ever will. It would be nice to roll back my discovery of the problem and pretend that it doesn't exist, but that's not my style.

One solution would be to abandon the strategy of reusing indexes. It's likely that very few users have workloads that benefit from this strategy. On the other hand, those that do could see a big performance hit. Removing an powerful optimization because of a problem that doesn't apply to these workloads is not very friendly.

The right solution is to identify those key definitions that allow indexes to be shared and distinguish them from indexes that can't be shared, and then manage the two kinds of index differently. Both the identification and the management can be done, but they introduce more complexity and hence more risk of bugs, and the whole thing feels like overkill given that we don't know any users who have the problem.

Civil engineers design structures to withstand storms that are likely to happen more than once in N years. It would be nice if we could adopt a similar approach here. But software doesn't seem to be like that; it seems to follow Murphy's law, which states that if something can go wrong, it will.

So with heavy heart, I think that there's no alternative to "doing the right thing", with all the complexity it entails.

Reference: Saxon bug 1968; W3C test case key-085a/b.

Another regex rewrite - Saxon diaries

| No Comments | No TrackBacks

In Saxon 9.5, a new regular expression engine was introduced. This was obviously a risky thing to do, but I'm not averse to such risks, and although there is often some short-term inconvenience to users, I believe such things usually bring long-term benefits. The risks relate to both functionality and performance. In both cases we were able to minimize the risks thanks to a good collection of test material, but we always knew that contact with real life problems might throw up things that the test suite didn't.

Why did we need the new engine? There were several factors. On Saxon-CE, the GWT technology didn't offer access to the JDK regex library, only to the Javascript regex engine, which has different syntax and semantics, and no support for non-BMP Unicode characters. We could have translated XPath regexes to Javascript regexes the way we were doing for the JDK, but the expansions to handle non-BMP characters are very cumbersome. (We could also have abandoned W3C regex conformance, but that's not my style.) For "server-side Saxon", I had been impatient with the existing design for some time. There are bugs I reported in the JDK regex engine 5 years ago which Sun/Oracle have shown no inclination to fix; the approach of translating one regex dialect to another is slow and clunky; and the JDK wasn't keeping pace with new Unicode versions. I felt it was time we brought the regex engine in-house.

I looked around for an open source regex engine suitable for forking, and chose the Apache Jakarta engine. Unlike some other options the code was clear and well commented, the performance seemed reasonable, and the license was compatible. I made all the changes needed to support the XPath regex syntax and semantics, and a few other changes including some optimizations, and it passed all the tests. In fact, it has proved very reliable; there have been no bug reports against its functionality.

But there have been two performance glitches.

  • One is the way it handles integer repetition bounds such as a{3,7}. The algorithm it uses for this is the same as Saxon uses in its schema processor, which is described in a paper by Thompson and Tobin [1]; it essentially translates such an expression to aaaa?a?a?a?, and then generates a finite state machine accordingly. This becomes problematic when the integer bounds start to become significantly large; the size of the finite state machine and the time taken to generate it both increase rapidly with size. (For the technically minded, I'm not sure how rapidly, but even if it's only linear, it's a significant problem.)

  • The second problem is the handling of backtracking. The Jakarta engine, like most regex engines, relies on backtracking to handle ambiguous expressions. Backtracking relies on maintaining a stack so you can go back to where you were. In Jakarta, this data was kept on the Java program stack; each checkpoint was represented by a recursive function call. In many cases each character position in the input string is a potential checkpoint, so the depth of recursion was essentially equal to the length of the input string. That puts a limit of around 1K to 2K on the size of the input, which is not really acceptable.

For the 9.5 release I mitigated this problem by introducing an optimization: in cases where repetition was unambiguous, the recursive calls were eliminated. This catches many cases like A*B where you know that you have to keep reading A's until you hit a B, and when you do hit a B you know you won't have to try again from somewhere in the middle of the sequence of A's. But unfortunately, ambiguous expressions are common (an example would be .*B), and we have to handle them.

So I did something that I very rarely do; I studied how the JDK regex engine avoided this problem, by following its activity through the debugger. I was surprised to see that the JDK engine doesn't actually build a finite state machine. Instead it uses a pure interpreter pattern: it builds an abstract syntax tree representing the grammar of the regex, and then evaluates the regex by direct interpretation of the constructs on this tree. The maximum recursive depth with this approach is represented by the depth of the abstract syntax tree (that is, it depends on the regex but not on the input string), and the data needed for backtracking is held on the heap rather than on the Java stack.

Over the last week I have been rewriting the back end of the (ex-)Jakarta regex engine to use a similar design. It's passing all the functionality tests; I still have some performance testing to do, but it's looking good. The front-end code, that is the regex parser, is largely unchanged, but it now generates an operation tree rather than an FSM. I retained many of the features of the existing design, such as the use of a variety of implementations of integer sets and predicates to support the different kind of character classes, but the operations have all been re-implemented to work interpretively. I departed from the JDK design by using an approach similar to Saxon's XPath interpreter, in that the implementation of each operation returns an iterator over all the strings matched by the relevant expression. The key method supported by each operation is

IntIterator iterateMatches(final REMatcher matcher, final int position)

which is passed an REMatcher (contain all necessary information about the match operation, including the input string and regex flags, plus support for captured groups and backreferences), and a start position for matching; it return an IntIterator representing a sequence of integers which are the possible positions where a match can end.

The class handling a repeat operation such as (a|aa)* implements this operation by repeated calls on the underlying choice operation. The first choice operation will return an iterator over the two possible end positions (1,2); the next time round there are four possible positions (1,2,2,3) (there is no attempt to eliminate duplicates), and so on.The method maintains a stack of IntIterator objects representing these successive calls; initially only the first hit from each iterator is used, but when backtracking is necessary, the stack is popped, and the iterator now at the top of the stack is called to deliver its next match.

Saxon continues to take advantage of the optimization that recognizes unambiguous repetitions. With patterns such as A*B, or A{5}A, there is no need for backtracking so the stack of iterators is not used, and only the first match from each iterator is used.

There is scope for further optimization here. Some regular expression engines such as PCRE are known to handle highly-ambiguous expressions much more efficiently than this crude backtracking approach, mainly by eliminating the duplication that arises when a backtrack attempt ends up doing the same matches at the same positions as a previous attempt. Such improvements lie in the future; for the moment, being as good as the Java engine in performance, and better in features such as Unicode support and buglessness, is enough to declare success.

The new code is currently operational on the Saxon 9.6 development branch. In due course it will be retrofitted to Saxon-CE, and probably also to a Saxon 9.5 maintenance release. Probably only 10% of the original Jakarta code remains, but that's probably enough that I can't get rid of the Apache license that goes with it, for those who care about these things.

[1] Using Finite-State Automata to Implement W3C XML Schema Content Model Validation and Restriction Checking. Henry Thompson and Richard Tobin. XML Europe 2003

At the XML Summer School 2013, Tony Graham presented a lightning talk about life after libxslt 1.0.  I was not present for this summer school, but it was clear from the feedback of the discussions I received that there is a major gap of XSLT 2.0 support in the large developer community of C/Perl/PHP/Python/Ruby world and associated tools that rely on libxslt.

It is a known problem, which has never, to my knowledge been addressed. At Saxonica, we wanted to try and plug this gap by porting the Saxon processor from Java to C/C++, which would enable us to communicate with the languages specified above. One of our goals, if possible was to interface with libxml and libxslt. Providing such a bridge or cross-compiled version of a full fledged Java application to C/C++ is always a daunting task. In this blog post I discuss the technical steps in our quest to achieve our goals and give some details of the experiences gained along the way. I will begin by detailing the various technologies that we tried, and how we have have ended up using a commercial Java native compiler after several failed attempts with tools that either did not work, cumbersome or were just too error prone.


At the summer school there were discussions that the tool LLVM could do the job of compiling Java to native code. As claimed on the project website LLVM is a collection of modular and reusable compiler and toolchain technologies. The LLVM project seems very active with many projects using it to do various task, but I found it difficult to get anything working. In particular, I tried using the VMKit which relies on LLVM to compile some a simple 'Hello World' examples to machine code, but even that seemed cumbersome.


Secondly, I looked at the GCJ technology. GCJ is a tool that I have used before, so I was confident that it would work. However, from my past experience using this tool is that it can be error prone and contains long standing bugs, which is a result of the project being dormant for several years, it seems unlikely that bugs will be fixed. The other worrying fact is that GCJ only supports up-to JDK 1.5. Nevertheless for lack of other options, I persevered with GCJ and I had much better success given that I managed to compile Saxon-HE to native machine code  and actually got it to execute my example stylesheets. I had some problems because of classes that were not present in the GCJ implementation of JDK 1.5, such as the packages java.math and javax.xml. Therefore, I had to include my own version of these packages.

The next step was to create a shared library of Saxon-HE, so that I could interface it with C/C++. This proved to be a real battle, which in the end I succeeded. I decided to use Compiled Native Interface (CNI), which presents a convenient way to write Java native methods using C++. The alternative was JNI (Java Native Interface), which may be viewed as more portable. Both interfaces though have similar principles: you need a Java/CNI-aware C++ compiler, any recent version of G++ is capable, and then you must include the header file for each Java class it uses. These header files, if not automatically generated, can be done using gcjh. I soon gave up on using GCJ: I stumbled upon a few known bugs and because if I was having major issues with the setup and prerequisites required then surely users would have the same problems.

Excelsior JET

The Excelsior JET tool is the final technology we looked at and thankfully it is what we have ended up using in the alpha release. JET is a commercial product that provides a Java native compiler for both Linux and Windows platforms. What is good about this software tool is that it provides an easy to use Graphical interface to build native executables and shared libraries from jar file(s). It also has the feature to package up the software into an installer ready to be deployed onto its intended host machine. This was great for us! 

There is a lot I could write about JET, but it would be a repeat of the plethora of information currently available on their website and forum. However, just to mention we started with their evaluation version which offers 90-days free usage of their software before purchasing the professional edition. Another point of interest is that Excelsior offer a free-of-charge license for use in conjunction with open-source software.

We know that there will be some sections of the open-source community that dislike the dependency upon using a commercial tool, but it is not that dissimilar from the early years of Java when the Sun compiler was freely available but not open-sourced.  

Implementation notes using JET

After creating the shared library, to interface it with C/C++ I used JNI. It is possible to use JET's own Java interface to external functions called xFunction, which is recommended if starting from scratch, but having used JNI with GCJ I continued with that approach. To get started there are a few examples of invoking a library with C/C++. In essence, you need to load the library and initialize the JET run-time before you can use it, see the code below (from the file

/* Load dll. */
HANDLE loadDll(char* name)
  HANDLE hDll = LoadLibrary (name);

  if (!hDll) {
    printf ("Unable to load %s\n", name);

  printf ("%s loaded\n", name);
  return hDll;

extern "C" {jint (JNICALL * JNI_GetDefaultJavaVMInitArgs_func) (void *args);
            jint (JNICALL * JNI_CreateJavaVM_func) (JavaVM **pvm, void **penv, void *args);

/*Initialize JET run-time.*/
extern "C" void initJavaRT(HANDLE myDllHandle, JavaVM** pjvm, JNIEnv** penv)
  int result;
  JavaVMInitArgs args;

  JNI_GetDefaultJavaVMInitArgs_func =
  (jint (JNICALL *) (void *args))
  GetProcAddress (myDllHandle, "JNI_GetDefaultJavaVMInitArgs");
  JNI_CreateJavaVM_func =
  (jint (JNICALL *) (JavaVM **pvm, void **penv, void *args))
  GetProcAddress (myDllHandle, "JNI_CreateJavaVM");

  if(!JNI_GetDefaultJavaVMInitArgs_func) {
    printf ("%s doesn't contain public JNI_GetDefaultJavaVMInitArgs\n", dllname);
    exit (1);

  if(!JNI_CreateJavaVM_func) {
    printf ("%s doesn't contain public JNI_CreateJavaVM\n", dllname);
    exit (1);

  memset (&args, 0, sizeof(args));
  args.version = JNI_VERSION_1_2;
  result = JNI_GetDefaultJavaVMInitArgs_func(&args);
  if (result != JNI_OK) {
    printf ("JNI_GetDefaultJavaVMInitArgs() failed with result %d\n", result);

  /* NOTE: no JVM is actually created
  * this call to JNI_CreateJavaVM is intended for JET RT initialization
  result = JNI_CreateJavaVM_func (pjvm, (void **)penv, &args);
  if (result != JNI_OK) {
    printf ("JNI_CreateJavaVM() failed with result %d\n", result);
  printf ("JET RT initialized\n");
  fflush (stdout);

XsltProcessor::XsltProcessor(bool license) {
  /* * First of all, load required component.
  * By the time of JET initialization, all components should be loaded.
  myDllHandle = loadDll (dllname);

  * Initialize JET run-time.
  * The handle of loaded component is used to retrieve Invocation API.
  initJavaRT (myDllHandle, &jvm, &env);
  /* Look for class.*/
  cppClass = lookForClass(env, "net/sf/saxon/option/cpp/XsltProcessorForCpp");
  versionClass = lookForClass(env, "net/sf/saxon/Version");

  cpp = createObject (env, cppClass, "(Z)V", license);
  jmethodID debugMID = env->GetStaticMethodID(cppClass, "setDebugMode", "(Z)V");
    env->CallStaticVoidMethod(cppClass, debugMID, (jboolean)false);

In the constructor method of XsltProcessor we see that once we have loaded the library and initialized the JET run-time we can now make calls to the environment, which has been created to get class definitions and create instance(s) of the class in the Java world. This is before we make method calls on the object.

PHP Extension in C/C++

After successfully getting XSLT transformations to work within C/C++, the next step was to try and develop a PHP extension, which would operate like libxslt. There is a lot of material on the web and books in regards to PHP extensions and I found the following guide very useful: I literally followed it step-by-step, adding a few steps of my own when I worked out what I was doing.


As a proof of concept I wrote a test harness in PHP which makes use of the PHP extension (see: xslt30TestSuite.php in the download library). This is a test driver designed to run the public W3C XSLT test suite at The test driver in its current form requires Saxon-EE, which is not yet available in this alpha release; nevertheless, the program may serve as a useful example of how the API can be used. Note that it is written to use libXML to read the test catalog, but to use Saxon for running the tests and assessing the results.

Performance Testing

I now draw comparisons between running Saxon-HE (on Java) vs running Saxon-HE/C on C++ and on PHP on some preliminary tests. I also compare these times to libxslt (C/C++). An important aim is to get a good measure of the costs of crossing the Java/C++ boundary using JNI and also to see what the effect is with the PHP extension. 

I used Saxon-HE as the baseline. The test machine was a Intel Core i5 processor 430M laptop with 4GB memory, 2.26Ghz CPU and 3MB L3 cache, running Ubuntu 13.10 Linux. Servers Apache2 and PHP version 5.5.3-1ubuntu2. The compiler was Sun/Oracle Java 

The experiments were based on the XMark benchmark. I used query q8, which was converted into the stylesheet below. The choice of q8.xsl is because we should expect some performance bottle-necks across the implementations due to its equijoins in the query:

<result xmlns:xsl="" xsl:version="2.0">
<!-- Q8.  List the names of persons and the number of items they bought.
          (joins person, closed_auction) -->

  <xsl:for-each select="/site/people/person">
    <xsl:variable name="a" 
       select="/site/closed_auctions/closed_auction[buyer/@person = current()/@id]"/>
    <item person="{name}"><xsl:value-of select="count($a)"/></item>  

The running times of executing q8.xsl on the document xmark64.xml, which is a 64MB size document are as follows:

Saxon-HE (Java):    60.5 seconds

Saxon-HE/C (C++): 132 seconds 

Saxon-HE/C (PHP): 137 seconds

libxslt (C/C++):        213 seconds

* Update on the times reported for Saxon-HE/C as a result of optimizations in the JET compiler.
* Code used to get libxslt time taken from:

The times for Saxon-HE/C are without the cost of JET initialisation and loading the library, which accounted for only 4 seconds. So we observe that there is not a big overhead between C++ and the PHP extension. The biggest cost as expected is between Java and C++, where we see on the C++/PHP platform a slowdown of ~ x2.2. We also observe that Saxon-HE/C out performs libxslt on C/C++ by ~40% on q8.

See project page on Saxon/C.

Stripping the DOM - Saxon diaries

| No Comments | No TrackBacks
I discovered yesterday that Saxon-CE isn't applying xsl:strip-space directives to input documents.  An unfortunate bug: not that many users seem too bothered by it, but conformance is always important.

A reminder: xsl:strip-space and xsl:preserve-space are designed to remove "ignorable" whitespace from the application's view. Because XML  doesn't distinguish significant from insignificant whitespace (a bad design mistake), this can only be done under program control. The idea is to present the application (the stylesheet) with a view of the input tree in which the insignificant whitespace does not appear, and the xsl:strip-space and xsl:preserve-space directives allow the stylesheet author to say which whitespace is considered significant.

How to fix the bug? Server-side Saxon (what I am starting to call "Big Saxon") traditionally provides two mechanisms: either it wraps the supplied DOM in a wrapping layer that hides the offending whitespace text nodes, or it bulk-copies the supplied DOM to an internal tree structure in which the whitespace is not copied across. Both methods have disadvantages. The "wrapping" approach imposes a significant extra cost on navigation of the tree (the cost is higher than it might appear, because whitespace stripping also has to take account of xml:space attributes in the document, which means a lot of searching for attributes-of-ancestors). The bulk-copy approach imposes a significant start-up cost to the transformation, often larger than the cost of doing the real transformation, and the cost is particularly high when processing large input documents because it doubles the memory requirement.

I'm considering using a third approach: in-situ modification of the supplied DOM. For Big Saxon, I have tended to treat this option as unthinkable: you don't muck with the user's data in this way. The same DOM, after all, might be input to multiple stylesheets, perhaps concurrently (though DOM is not thread-safe so this is a bad idea....) and they might have different space-stripping directives. A Java application might pass the DOM to an XSLT stylesheet to do some processing, but might want to continue operating on the same DOM afterwards.

In Saxon-CE, however, these considerations hardly apply. In the vast majority of cases the DOM is built solely in order to act as XSLT input. Unlike Big Saxon, where one can assume that if users supply a DOM they are doing so by choice, with Saxon-CE there is no alternative: all input XML arrives this way.

In-situ clean-up of the input DOM has other attractions; it can be used to solve the other two big problems with DOM processing (both arising from a mismatch between the DOM model and the XDM model): adjacent text nodes, and namespaces. The same pre-scan that strips unwanted whitespace nodes can also concatenate adjacent text nodes (eliminating entity references and CDATA sections in the process); presenting a view of adjacent text nodes as a single node accounts for a great deal of complexity (and lines of code, and execution time) in navigating a DOM. Equally for namespaces. To determine the namespace part of the name of an element or attribute, Saxon can't simply call getNamespaceURI() on the underlying DOM node. That method is defined to have no effect if "the node was created using DOM level-1 interfaces". Saxon of course has no idea how the node was created, and DOM provides no way of finding out. So it does an optimistic call of getNamespaceURI() just in case, and if this returns null then it has no choice but to search the node's ancestors and attributes looking for namespace declarations - and since this is done every time you look at a node to see whether it satisfies an axis step, it's no wonder that Saxon can be ten times slower processing a DOM than when processing its native tree structure. We should be able to cut this cost dramatically by doing a pre-scan of the document and calling setNamespaceURI() on all element and attribute nodes.

The more I think about this, the more appealing the idea of doing the same thing on Big Saxon. The fact is, most people who supply a DOM as input to XSLT do so out of ignorance; they have no idea of the extra costs they are incurring, or of the fact that supplying SAX input would be ten times faster. It's important to deliver good performance not only to experts who know how to get the ultimate performance out of the system, but also to the average user who just copies JAXP code from somewhere on the web without any great thought or understanding. For that kind of user, in-situ modification of the supplied DOM might be the right answer. We can always continue to provide wrapping or bulk-copying as alternative configuration options, for people to invoke if they don't like their data being trampled upon.

Returning to Saxon-CE, the other complication is the HTML DOM. Running XPath against an HTML DOM is already sufficiently quirky (e.g. because of case-insensitive element names, namespace non-support, etc etc) that it's probably wisest to keep things as simple as possible on this front. I don't think we need to apply xsl:strip-space directives to the HTML document; there's nothing in the spec that would make this a conformance issue, and there's no usability imperative. So we'll leave well alone here.

Reducing the size of Saxon-CE - Saxon diaries

| No Comments | No TrackBacks
During the early stages of Saxon-CE development, the major part of the work was cutting Saxon down to size to run on the browser. That initial work was described in an earlier post: see Cutting Saxon down to size:

In the last couple of weeks I've resumed the effort to reduce the size of the code. This article examines the changes I've made, and the opportunities for further reduction.

Saxon-CE 1.1 is around 880Kb, and so far, I've reduced that to 780Kb. I'm measuring the size of the main Javascript download here, not the ancillary data files that may or may not be downloaded, depending on what the stylesheet actually does. Reducing the size is important because the download time causes an appreciable delay on the first visit to a site, and first impressions are important; cutting the delay from say 7 seconds to 5 seconds will result in a corresponding reduction in the number of visitors who don't bother to wait.

What has gone?

About half this reduction was achieved by rewriting two data-intensive classes to load their data from XML files on demand, rather that defining it in static initialization code. The two classes in question are Categories, which contains the mapping of Unicode characters to categories used in regular expressions such as Lu and Nd; and CaseVariants, used when the "i" flag appears in a regular expression to determine all the upper/lower case variants of any character. These XML files are downloaded only if the relevant language features are used in the stylesheet.

A couple of other classes were also performing lengthy static initializing: StandardFunctions, which contains details of all the system functions in the core library, and StandardNames, which contains the names of all XSLT elements, attributes, and built-in types. In the case of StandardFunctions the amount of initialization code has been vastly reduced by introducing a micro-syntax to describe function signatures in the form of a string literal, which can be easily parsed in a few lines of code. In the case of StandardNames, the class was setting up a mapping from integer constants to QNames; this has been removed by changing the product to use QNames throughout (or in the case of types, the object instances representing each built-in type) rather than integer constants.

Associated with this second change, the NamePool has gone. Saxon-CE was not getting any real benefit from the NamePool, because the source document is always in the form of a DOM, in which names are held as strings. So integer fingerprints were being allocated on demand, which cancels any saving from faster integer comparisons. The change to use QName objects in place of namecodes and fingerprints throughout the product was of course very extensive, but it seems to have been trouble-free. There is only one very minor loss of functionality: in cases where Saxon has to invent a prefix for use on a constructed element or attribute, it is no longer able to consult the NamePool to see what prefixes have previously been used with that URI.

The other changes are minor and tedious. They fall into a number of categories:

* Finding code that could never be executed. Although GWT automatically eliminates code that is completely unreachable, there is code that cannot be executed in practice for details that GWT cannot detect. Two examples: (a) code for matching types such as attribute(*, xs:date). Although a basic XSLT processor allows this syntax, all nodes in Saxon-CE are untyped, so this type will never match anything. It can therefore be replaced during parsing by a type that matches no instances, which cuts out a lot (well a couple of Kb) of useless code. (b) there was still a lot of code designed to report the line number on which errors occurred, both statically and dynamically. But in the browser environment, the XML parser provides no line number information, so this code was useless.

* Eliminating duplication. For example, I found that there were two completely separate mechanisms for maintaining comparison keys, one mechanism used for xsl:for-each-group and distinct-values, the other used for sorting and various other operations. Although there were a few semantics differences (for example in the handling of NaN) it proved reasonably easy to combine these two mechanisms into one.

* Cutting out minor optimizations. Do we really need a separate class to handle hash maps in which the key is an unboxed integer, rather than using the standard HashMap with a boxed integer key? Is that optimization worth 1000 lines of code? Probably not. There are many similar examples.

* Cutting out small classes. Arithmetic expressions, for example, delegated the computation to one of a large number of small Calculator classes, each dedicated to one particular operation, such as (integer plus integer). This is an elegant design, but in GWT it appears to be expensive: each of these classes was generating up to 2Kb of code. Combining the classes into a set of switch expressions in the parent class appears to give a significant reduction in code size (though I have to admit I am not entirely sure why).

In deciding whether to remove a particular piece of code, I find it useful to reverse the thinking: if this wasn't there already, would I consider it worthwhile adding it? This is analogous to zero-based budgeting: there is a presumption that something isn't needed unless it can be justified, rather than a presumption in favour of the status quo.

Where will the next 100Kb come from?

Getting down from 880Kb to 780Kb is useful, but it isn't enough. The target should be something like 500Kb. Finding big savings, however, gets more and more difficult. There are plently of small salami-slices available, but they get harder to find.

Some of the original areas in my earlier post remain unaddressed. There is still a lot of duplication between the "style" classes representing the source stylesheet, and the "instruct" classes on the expression tree. Historically, it was important to allow the source stylesheet tree to be garbage collected once compilation is finished; in Saxon-CE, that reasoning doesn't apply. Probably we should keep the instruction classes on the expression tree, but have them delegate a lot more to the original "style" elements to avoid duplication.

The expression tree is a major opportunity for simplification. There are two issues: there are too many different kinds of expression (too many subclasses), and they implement too many methods. Specifically, there are 167 subclasses of Expression, and the class defines some 48 methods. Some of these classes are admittedly very small (FirstItemExpression, for example, which implements constructs of the form EXP[1], is just 157 bytes -- I haven't worked out why these classes are so much smaller than the Calculator classes mentioned above). So compiling EXP[1] to a call on subsequence(EXP,1,1) would probably give no net saving. But despite much use of superclasses to hold shared functionality, there's still a fair bit of duplication that could be eliminated, and a large number of minor optimizations that probably have little merit.

Could we drop the LinkedTree implementation of XDM? The Tinytree is already gone from Saxon-CE. The LinkedTree is only used (a) for stylesheets, and (b) for temporary trees: source documents and result documents use the Javascript DOM. (Not the GWT DOM). We could consider using the DOM also for stylesheets and for temporary trees. But it's only 20Kb. A more interesting way forward might be some kind of hybrid of the current LinkedTree and HTML-DOM-wrapped-tree; given the way Javascript works, we could consider adding properties to the DOM to make it implement the NodeInfo interface more directly, supplementing some of its deficiencies for example (notoriously) its poor support for namespaces.

There is a lot of code whose main purpose is lazy (pipelined) evaluation. This leads to a proliferation of small specialized classes. Some of these are probably an unnecessary luxury. For the index-of() function, for example, do we really need a custom iterator that searches for the next match and returns it on demand, or would it do any harm to find all the matches up-front? In some cases several iterator implementations could be combined into one: for example the index-of() iterator could be handled by a flag in the filter iterator that returns the position of the matched item instead of the item itself.

Generally, I think that finding another 100Kb of discardable code looks feasible, though it's a lot of work.

Note: Saxon-CE today is 64800 non-comment lines of Java code. Saxon-EE for comparison is 261100 non-comment lines.

Update two weeks later:

This kind of exercise can become a bit obsessive. I'm now down just below 700Kb, and at the Java level the number of source lines is down to about 55000. It's amazing how much dead wood there is, and it's a great pleasure to get rid of it. By and large, most of the reductions have been achieved without adverse effects that anyone will notice, and in some cases it's been a real opportunity to tidy up the code and improve it. I'm very inclined to port some of the changes back to "big Saxon" (that is, the server-side code base).

The changes made fall into a number of general categories:

  • Avoiding duplication. For example, the code to do compile-time static type checking of function arguments (to implement the function conversion rules) was extremely similar to the code for dynamic conversion of supplied stylesheet parameters to their required type. A little bit of thought enabled the two to be combined. There was extensive duplication in the code for iterating different axes for different tree models. The code for xsl:next-match was about 95% the same as the code for xsl:apply-imports. There was perhaps 70% code overlap in the iterators for the union, except, and difference operators. There were half-a-dozen places in the product containing almost identical code for converting a lexical QName to an expanded QName given a namespace context. In some cases the duplication could be justified for "big Saxon", because of the imperative to minimize run-time pathlengths. But that's the wrong trade-off for Saxon-CE; the size of the code really does matter.
  • Eliminating trivial optimizations. I already mentioned the "integer hash-map" example. There are plenty of examples of static rewrite optimizations that really don't justify themselves: rewriting string-length($x) = 0 as not(string($x)) is an example. You have to look at these and ask (a) whether the saving achieved is significant, and (b) whether the rule will fire often enough to be worthwhile. It's easy to make mistakes on both counts - nearly all these optimizations were added at some time because of a real use case where they made a significant difference. But over time they accumulate and gather dust; sometimes they cease to be relevant because something else has changed elsewhere in the system.
  • Avoiding over-generalized code. GWT does a good job of identifying methods that are never called. But it can't spot methods that are capable of doing 17 things, and are only ever called upon to do one of those 17 things.
  • Rewriting the code to be more compact. A couple of examples here: the code for checking the select attribute of xsl:value-of now reads "checkAttribute("select", "e")", where "select" is the name of the attribute, and "e" indicates that its value must be an expression. The previous code was about four times this long, and the extra stuff was essentially the same for every attribute of every element. Similarly, the code for registering details of the signatures of built-in functions, and the code for enumerating the subexpressions of every kind of expression: these fragments of code occur very often, and cutting them down from 4 lines to 1 can make a big difference overall.
  • Simplifying error handling. Good diagnostics for errors are still important, but there was a lot of verbosity in this area that could go. If a date is invalid, how much detail do you need to give the user about the fact? Might it not be good enough to tell them that 2013-03-59 is an invalid date?

What's new in Saxon 9.5? - Saxon diaries

| No Comments | No TrackBacks

The following is a provisional list of changes expected in the Saxon 9.5 release, which is in the final testing phase. There could still be changes to this list, for example a feature could be dropped if it is found to be unreliable during testing.


The xsl:result-document instruction in Saxon-EE is now asynchronous. That is, the code to output the result document runs in a separate thread, in parallel with other processing. The maximum number of threads used by xsl:result-document instructions is limited by the configuration option FeatureKeys.RESULT_DOCUMENT_THREADS which defaults to the number of processors available to the Java VM; setting this to zero or one will suppress multithreading. Setting FeatureKeys.ALLOW_MULTITHREADING to false has the same effect. (This can be useful when debugging, because otherwise the output from xsl:message and fn:trace() can be very confusing).

The facility can also potentially cause problems if the code calls extension functions that have side-effects. Multi-threading can therefore be controlled, if required, using thesaxon:asynchronous attribute on the xsl:result-document instruction: use saxon:asynchronous="no" to suppress multi-threading. Asynchronous processing of xsl:result-documentis automatically suppressed if tracing (using a TraceListener) is enabled.

The collection() function is also now multi-threaded in Saxon-EE. Each document in the collection is parsed in a separate thread, and the documents are processed in the order in which parsing completes. This makes the order of the documents less predictable than in previous releases, though it was never guaranteed or documented.


New XSLT 3.0 instructions such as xsl:iterate and xsl:try no longer have synonyms in the Saxon namespace.

Extension function saxon:for-each-group() is dropped (superseded by "group by" in XQuery).

New extension functions saxon:schema() and saxon:type are available, giving access to schema information. The saxon:schema() function obtains information from all the schema components available in the query or stylesheet; the saxon:type function gives information about the type annotation of a node. In both cases, the information is returned as a function item which behaves like a map from schema component property names to values; for example the name of the type annotation of a node is given by saxon:type($node)('name').

A new extension function saxon:send-mail() is available to send email via an SMTP server.

A new extension function saxon:key-map() is available: this allows the index constructed using xsl:key to be viewed as a map, giving extra functionality such as getting all the entries in a given range, enumerating the key values, concatenating keys across multiple documents, processing the entries in sorted order, etc.

The third argument of saxon:transform(), which supplies parameters to an XSLT transformation, may now take the form of a map, allowing the parameter value to be any value whatsoever (the old mechanism restricted it to atomic values.)

The extension function saxon:index has changed to expect a function as its second argument rather than a prepared expression, and it now returns a map which can be accessed using all the standard map functions. The extension function saxon:find is now a synonym for map:get. There is no longer an option to specify a collation.

A new flag "v" has been added to saxon:deep-equal() to suppress the check that two elements have the same "variety" of type: for example if one has element-only content, the other must have element-only content. This check was not performed in previous releases; in this release it is performed by default (as required by the fn:deep-equal() specification), but may be suppressed using this option. The option is useful when comparing validated and unvalidated documents.

The proposed EXPath file module (see is implemented (in Saxon-PE and -EE). This provides a number of extension functions for reading and writing files in filestore, creating and deleting files and directories, listing the contents of directories, and so on.

The EXPath zip module (see is implemented. The implementation is derived from the original implementation by Florent Georges, but has been more closely integrated into Saxon and more thoroughly tested. This module is open source code; the extensions are integrated into the Saxon-PE and Saxon-EE distribution, and are available to Saxon-HE users in source code form, where they can be linked to the product in the same way as any other extension functions (see .

A new serialization option saxon:attribute-order is available. The value is a whitespace-separated list of attribute names. Attributes whose names are present in the list come first, in the order specified; other attributes follow, sorted first by namespace URI and then by local name.

When the -wrap option is used in XQuery, the serializer is now capable of generating an XML representation of a map, using nested elements to represent the entries in the map with their keys and values.

External Object Models

Support for JDOM2 is added.

Support for Apache Axiom is added.

A number of optimizations have been made to the support modules for other external object models, noticeably to speed up navigation of the descendant axis.

The class DocumentBuilderFactoryImpl, which constructed a DOM wrapper around a TinyTree, and which has been deprecated since 9.3, is now removed.

XSD Extensions

Parameterized validation: a new extension to XSD 1.1 is implemented to allow schema validation to be parameterized. The saxon:param element can be added to a schema to declare a parameter; the value of the parameter can be referenced in XPath expressions, for example in assertions. The parameter values can be set from the command line when running the Validatecommand, or from the s9api (Java) and Saxon.Api (.NET) interfaces when validating from an application. It is also possible to initiate parameterized validation using a new saxon:validateextension function available in XSLT and XQuery.

Internal Changes

Regular expressions: a new regular expression engine is introduced, based on the Apache Jakarta product.

The implementation of the base64 encoding and decoding routines has been rewritten in order to simplify the set of open source licenses in use

XPath 3.0 Functions and Operators

fn:serialize() function: the implementation has been changed to match the 2011 version of the W3C specification. This changes the format of the serialization parameters supplied in the second argument to the function.

The functions has-children()innermost(), and outermost() are implemented.

The function parse-xml-fragment() is implemented.

In format-date() and friends, timezone formatting has been updated to match the current spec. This may create incompatible changes. The changes apply whether or not XPath 3.0 processing is enabled.

The implementation of deep-equal() has been changed to enforce the rule that the type annotations of two element nodes must have the same variety. This corrects a long-standing non-conformance, but may cause incompatibility, especially when comparing a schema-validated document against an unvalidated document. A new flag "v" has been added to the Saxon equivalent function saxon:deep-equal() to suppress this check.

Command line

Command-line interfaces: added the option -quit:off to prevent exiting the Java VM in the event of a failure; instead, a RunTimeException is thrown. Useful when the command line interfaces are invoked from another Java application, for example ant, as it allows the calling application to recover.

When the -wrap option is used in XQuery, the serializer is now capable of generating an XML representation of a map, using nested elements to represent the entries in the map with their keys and values.

XSD 1.0 Support

There have been changes to the way the schema processor handles recovery from validation errors. By default the processor tries to continue validating to the end of the file, reporting as many validation errors as it can. All errors are reported to the error() method of the ErrorListener, which by default throws no exception; if the error() method does choose to throw an exception then validation terminates immediately. Validation can also be terminated early if an error limit is reached; the error limit can be set in the ParseOptions object that controls the validation. (The value 1 causes termination after the first error is reported).

At the end of the document, if there have been one or more validation errors, then a fatal error is thrown unless the "recover after validation errors" option (FeatureKeys.VALIDATION_WARNINGS) is set, in which case no exception occurs and processing can continue. Note that in this case the type annotations in any result document are unreliable. The setting FeatureKeys.VALIDATION_WARNINGS is ignored in the case where a document used as input to XQuery or XSLT is being validated, because in that case the incorrect type annotations would cause inconsistencies during subsequent processing. TODO: check this works.

If an element is found that cannot legitimately appear according to the content model of its parent, Saxon previously abandoned the validation of the content of that element, as well as the content of its following siblings. This has been changed so that the content of the offending element, and the content of its siblings, is now validated, using the context-determined type for the element (the "element declarations consistent" constraint ensures that if the same element name appears more than once in a content model, each must be associated with the same type).

The validation exception made available to the ErrorListener now includes a structured path indicating location of the offending node (previously the path was available only as a string). It also includes, where available, a reference to the schema type against which validation was attempted.

The error messages produced when a sequence of elements does not conform to the content model of a complex type has been improved. There is now more effort to distinguish different causes of the error: for example, too many repetitions of a repeated element, a mandatory element that has been omitted, an element that is in the wrong namespace.

Saxon now recognizes the XLink namespace and fetches the schema for this namespace locally rather than fetching it from the W3C web site (which will often time out).

XPath 3.0 Support

Revised the EQName syntax as per spec bug 15399: it is now Q{uri}local.

Affecting XPath 2.0 also, some very long-standing bugs have been fixed in the handling of schema-validated documents using xsi:nil - specifically, ensuring that the typed value of a nilled element is always an empty sequence, getting the matching of types element(N, T) and element(N, T?) right, as well as the impact of nillability on type subsumption.

XSLT 3.0 Support

The bind-group and bind-grouping-key variables on the xsl:for-each-group element are now implemented.

This involved some internal refactoring of the way variables are managed during the XSLT static analysis phase.

Implemented the composite attribute of the xsl:for-each-group element.

Changed the implementation of xsl:merge to reflect the revisions in the draft XSLT 3.0 specification (removed xsl:merge-input element; added sort-before-merge attribute).

Implemented accumulators (the new xsl:accumulator feature) for both streamed and unstreamed documents.

Implemented the xsl:stream instruction.

Implemented the error-code attribute of xsl:message.

Implemented the start-at attribute of xsl:number.

Implemented the context-item attribute of xsl:evaluate.

Implemented the xsl:assert instruction.

Implemented the xsl:map and xsl:map-entry instructions.

The restriction that xsl:import must precede other declarations in a stylesheet module has been removed.

Implemented the on-empty attribute of xsl:attribute.

Implemented the xsl:on-empty attribute of literal result elements. Tested with and without byte code generation. TODO: test with streaming, and with freestanding construction (evaluateItem).

Saxon now allows the EQName syntax Q{uri}local in places where a QName is required, for example the name attribute of xsl:variablexsl:templatexsl:function etc, the first argument of the functions key()system-property(), and function-available(), the final argument of format-number(), etc. This is useful for static names in the case where stylesheets are generated programmatically, and it is always useful for dynamic names because it avoids the dependency on the static namespace context.

It is now a static error if the same NodeTest appears in an xsl:strip-space and an xsl:preserve-space declaration at the same import precedence. This is a backwards incompatibility in the 3.0 specification.

There is now a warning message if the namespace URI of the document element of the principal input document does not match the namespace URIs used in the template rules of the stylesheet. This is designed to catch the common beginner's mistake of writing (for example) match="html" when the element to be matched is actually in a (default) namespace. The precise conditions for the warning are that the stylesheet contains one or more template rules in the initial mode that match an explicit element name, and none of these template rules matches an element in the namespace used for the top level element of the principal source document.

Implemented more of the new pattern syntax: patterns matching variables, namespace nodes, ... Patterns that match atomic values can no longer be used as part of a pattern that uses "union", "intersect", or "except" (as a result of clarification of the XSLT 3.0 specification.)

Affecting XSLT 2.0 also, a very long-standing bug has been fixed: documents read using the collection() function are now subjected to whitespace stripping as defined by xsl:strip-space declarations in the stylesheet.

Also affecting XSLT 2.0, a change has been made in the behavior of xsl:result-document when the href attribute is absent. Previously this caused the constructed result document to be serialized to the location specified by the base output URI. Now it causes the constructed document to be sent to the primary output destination selected in the calling API (which might, for example, be a DOMResult or an XdmDestination). Both interpretations appear to be allowed by the specification. Note that omitting the href attribute is done typically when you want to validate the result document against a schema, though the same effect can be achieved using xsl:document.

Whitespace stripping: the xsl:strip-space declaration now has no effect on a node if it is within the scope of an XSD 1.1 assertion: that is, whitespace text nodes are not stripped if any ancestor node has been validated against a type that contains an assertion. This is because changing the content of such a node could invalidate the assertion, thus breaking type safety.

Implemented content value templates. These allow expressions contained in curly braces to be contained in text nodes within a sequence constructor, rather like attribute value templates; the facility can be enabled by setting expand-text="yes" on any enclosing XSLT element (for example, <xsl:stylesheet>). Example: {$greeting}, {$first} {$last}, how are things?]]>. Note that the details of the facility in the specification (for example, handling of boundary whitespace) are subject to change.

The new match pattern syntax match="?{ expr }" is implemented. This matches an item T if the expression has an effective boolean value of true when evaluated with T as the context item. This construct replaces the ~ itemType syntax defined in earlier XSLT 3.0 drafts, but for the moment Saxon implements both.

Some of the features NOT implemented in XSLT 3.0 include:

  • Packages

  • xsl:context-item


There has been considerable development of the streaming capability, much of it involving removal of restrictions that were not previously documented (or known).

The xsl:stream instruction is now implemented, making the saxon:stream extension function obsolescent (though it is retained for the time being). Accumulators are also implemented, in both streaming and non-streaming mode.

Streaming: Saxon is moving towards a design where it implements the "guaranteed streamability" rules in the W3C draft precisely, unless the configuration optionALLOW_STREAMABILITY_EXTENSIONS is set, in which case Saxon may implement streaming (or near-streaming) for some additional cases.

Certain constructs using positional filters can now be evaluated in streaming mode. The filter must be on a node test that uses the child axis and selects element nodes. The forms accepted are expressions that can be expressed as x[position() op N] where N is an expression that is independent of the focus and is statically known to evaluate to a number, x is a node test using the child axis, and op is one of the operators eq, le, lt, gt, or ge. Alternative forms of this construct such as x[N], remove(x, 1), head(x), tail(x), and subsequence(x, 1, N) are also accepted.

Streaming is now possible for xsl:for-each-group using the group-adjacentgroup-starting-with, or group-ending-with attributes.

XQuery 3.0

Saxon 9.5 fully implements the XQuery 3.0 Candidate Recommendation of January 2013.

Forwards references to global variables are now allowed.

Added support for variables such as err:code and err:description within the catch block of try/catch.

Added support for function annotations.

Added support for the required-feature and prohibited-feature declarations.

XQuery 3.0 extensions can now co-exist with XQuery Update extensions in the same query. 


The XPathSelector interface now has an EffectiveBooleanValue method.

Various methods have been added to bring the interface closer to the functionality level of the Java s9api interface.

System Programming Interfaces

In the Configuration class, many of the getters and setters for individual configuration properties have been removed, relying instead on the general-purpose methodsgetConfigurationProperty() and setConfigurationProperty(). To make these easier to use in the common case of boolean properties, the new methods getBooleanProperty() andsetBooleanProperty() are introduced. Property names that are only relevant to Saxon-PE or Saxon-EE are now generally recognized only by a ProfessionalConfiguration orEnterpriseConfiguration respectively; previously some were also recognized by a Saxon-HE configuration while others were not.

The classes ValueRepresentation and Value have been replaced by the new class Sequence. In some interfaces Sequence also replaces SequenceIterator, since passing aLazySequence (which implements Sequence) has all the same performance benefits as passing a SequenceIterator, while still allowing an Item to be passed directly, without wrapping.

The change affects (simplifies) the interface for integrated extension functions, where arguments and result values are now passed as Sequence objects rather than as SequenceIteratorinstances.

In the NodeInfo interface, the two methods getTypedValue() and atomize() for obtaining the typed value of a node have been unified into a single method, atomize(), which returns the new type AtomicSequence. In the vast majority of cases the typed value is a single atomic value, and this case works efficiently because AtomicValue implements AtomicSequence.

Also in the NodeInfo interface, the method getAttributeValue(fingerprint) has been dropped; callers should use getAttributeValue(uri, local) instead.

The OutputURIResolver interface has been changed: a new method newInstance() is added. This change is made because xsl:result-document is now multi-threaded by default, and since it's likely that existing implementations of OutputURIResolver won't be thread-safe, making an interface change is better than a semantic change that will cause the code to break in difficult-to-diagnose ways. The new method typically returns a new instance of the OutputURIResolver class, so that each instance of the class only needs to remember information about one result document, and is automatically thread-safe.

In the Java classes that implement the schema component model (in package com.saxonica.schema) many of the methods that return an Iterator have been replaced by methods that return a Set or a List. This simplifies processing by enabling use of the Java for-each construct, and is more convenient when the saxon:schema extension function is used to process schema components in the form of XPath functions.

The toString() method on the Expression class has been more carefully defined and implemented in an attempt to achieve the result that (for an Expression within the scope of XPath) the result is a legal and equivalent XPath 3.0 expression with no dependencies on namespace prefixes other than (a) the binding of the prefix "xs" to the standard Schema namespace, and (b) the assumption that the XPath functions namespace is the default function namespace. In other cases QNames are expanded using EQName notation Q{uri}local. There may be a few remaining cases where the output does not yet satisfy these intentions.


Saxon has long provided the ability to have an Item that wraps an external Java or .NET object, which can be supplied in arguments to extension function calls or used in the response from an extension function calls. In the past, such values have appeared in the type hierarchy under "atomic value", that is, as a peer of types such as xs:boolean and xs:string. This has changed so they are no long atomic values, instead forming a fourth kind of item alongside nodes, atomic values, and function items.

The string value and typed value of an external object are the same; they are the xs:string value that results from calling its toString() method.

The handling of extension items that wrap a Java null has been tidied up. When a Java extension function returns null, this is mapped to an XDM empty sequence, not to an extension item that wraps null. When an empty sequence is supplied to a Java extension function that expects an Object, the value null will be passed to the Java method. Extension items are therefore no longer allowed to wrap a Java null.

With reflexive extension functions, the handling of Java arrays has been improved; the component type of the array is now considered when deciding which of several overloaded methods is the best fit.

With reflexive extension functions, if there are two methods where one expects a String and the other a CharSequence, the one expecting a CharSequence is now preferred. Previously the function call was rejected as ambiguous. Although there is no particular reason for choosing one rather than the other, it's likely that both methods will have the same effect, so choosing one arbitrarily is better than reporting an error.

With reflexive extension functions, if there are two methods of the right name and arity, the decision which to use is now postponed until after type checking has been done. This is particularly useful when arguments are supplied in the form of variable references. Previously the decision was postponed only if the early analysis showed both methods as equally preferred; now it is always postponed.

When a constructor is called the code now does what the documentation has always said: a wrapped external object is returned with no conversion. So for example Date:new() will return a wrapped java.util.Date object rather than an instance of xs:dateTime, and HashMap:new() will return a wrapped HashMap rather than an XPath map. The code also avoids doing conversions other than object unwrapping for the first argument of a call to an instance method.

When a reflexive extension function throws an exception, the exception details are now captured in an error that can be caught using the try/catch capabilities in XSLT 3.0 and XQuery 3.0. In particular, the error code is a QName whose namespace URI is " and whose local name is the class name of the exception type, for

It is now possible to use the reflexive call mechanism to call instance-level methods of the implementation classes for XDM values. For example, the following gets the current Julian instant:

<xsl:value-of select="d:toJulianInstant(current-dateTime())"

When an instance of java.util.Map is passed as a parameter to a stylesheet or query, then by default it is accessible within the stylesheet or query as an XPath 3.0 map object. However, if the parameter is declared with a required type of jt:java.util.Map, then it will instead be retained as an external object wrapping the java.util.Map, which means for example that (with the usual caveats and disclaimers) it is updateable by calling its put() method as an extension function.

The SQL Extension

Added the value #auto for the column-tag and row-tag attributes of sql:query, and made this the default; when this value is used, the element names used for rows and cells are taken from the table name used in the query, and the column names returned in the query metadata. To get the old default behaviour, specify row-tag="row" and column-tag="col".


Support for XHTML 5 serialization, as defined in the 3.0 Serialization specification, is available. (Use <xsl:output method="xhtml" html-version="5.0"/>).

Comments and processing instructions no longer prevent an immediately following start tag being indented if it would have been indented in the absence of the comment or PI.

(However, comments and PIs are not themselves indented, because it is not possible to do this safely without looking ahead to see if the comment or PI is followed by significant character content.)

I've been working on a stylesheet that converts the W3C XQuery/XPath 3.0 test suite into an XSLT 3.0 test suite. The idea is that all the XPath tests only need to be written once, allowing them to form XSLT tests and thus avoiding the duplication of effort that arises if separate test suites are written.

The stylesheet generates 19,700 output files (38Mb) from 1,863 input files (122Mb) so it's a good test in its own right. I'm therefore running it using the almost-ready Saxon 9.5 as part of the system testing before we ship the product. Here are a few points I'd like to highlight.

Firstly, some files are copied unchanged. For this purpose the extension function file:copy(), which is part of the EXPath file module, comes in very handy. This llibrary will be included as standard with Saxon 9.5 (like all extensions, it will be in the PE and EE editions only).

Secondly, I followed a development approach which I've found very useful and would like to share. There's a schema for the target document vocabulary (the XSLT 3 test suite metadata), so it makes sense to use a schema-aware stylesheet that validates the result documents as they are produced. However, I don't switch on validation until the transformation is "roughly working", that is, until it produces output that looks correct to the naked eye. The reason for that is simply that the validation error messages are a nuisance during the stage when you know the transformation is incomplete and incorrect. It's only when the transformation is running without errors and handling all the input that you need to start checking that the output is correct, and at this stage adding an xsl:import-schema declaration for the output schema, plus the attribute validation='strict" on the xsl:result-document instruction, is absolutely invaluable.

Except, I found it could be improved. The validation error messages tell you what's wrong (for example, you've output an attribute value which isn't allowed for that attribute), and they tell you where the code is in the stylesheet that caused the error (the offending xsl:attribute or xsl:copy instruction), but they weren't telling you what input was being processed at the time: where did the offending attribute value come from? So I've fixed that, or at least tried to. It's one of those cases that sounds easy but is actually quite tricky, because the code that detects the validation error is very remote from any code than knows where we are in the source document (or indeed, which of 1,863 source documents we are processing) at the time. The way I've done this is that the PipelineConfiguration - which holds all the context information used by a push pipeline, including schema valdiation - now includes a stack of items representing items being processed using xsl:apply-templates; so the validation error message can tell you what the latest node processed by a template rule is. Doesn't work unfortunately for those who use monolithic xsl:for-each structures, or for those who use XQuery - but this is specifically focussed on this kind of workload handling many inputs and many outputs, for which XQuery is unsuitable anyway. There is of course a small performance hit for maintaining this stack, but I'm confident it's very small, and I always reckon that a small amount of machine time to save a large amount of programmer time is a good trade-off.

The third thing I wanted to mention in this article is asynchronous processing of xsl:result-document. This is a new feature for Saxon-EE 9.5. By default, an xsl:result-document instruction executes in its own thread, independently of the rest of the stylesheet. Threads are allocated from a pool, whose size defaults to the number of processors available (on my Mac laptop, that's 8). If no free threads are available, the instruction executes in the current thread.

I did some experiments on the performance as the number of available threads is increased:

1 - 14.3s
4 - 8.52s
8 - 8.92s
12 - 9.81s
16 - 10.6s

so the default of 8 appears to be a reasonable choice. The saving in elapsed time by using multi-threading seems to be well worth having, though frankly, for a task like this, 14s is already fantastically good performance and any improvement is simply icing on the cake. It's becoming quite hard to find workloads where performance improvements are genuinely needed.

Incidentally, switching bytecode generation off has almost no noticeable effect on this transformation, and the same is true when optimization is switched off. That's because it's a very simple transformation that happens to be processing a large amount of data.

The xsl:result-document instruction is a good candidate for parallel processing because there's no need to combine its results with those of any other instruction, which minimizes the synchronization overheads that usually come with parallel processing. However, there is still some synchronization needed, for example because global variables are evaluated lazily, so there is the chance that two threads will encounter the same unevaluated variable. Unfortunately finding all these cases is very difficult to do methodically, and this test stylesheet found a few bugs that needed to be addressed. Realistically, it's likely there will be further bugs that we won't hit until after release. So we've been careful to make sure the feature can be disabled - that's proved very useful with bytecode generation, which is another area where it inevitably took a few months of field exposure to iron out the last bugs.