Redesigning the Saxon expression tree - Saxon diaries

| No Comments | No TrackBacks
I've been embarking on an exercise to redesign the Saxon expression tree. It's got a number of problems that it would be nice to fix; and there's major work ahead in being able to save and restore the tree as part of a compiled XSLT 3.0 package, so it would be nice to get the structure into better shape first.

One of the problems is that there's 150-odd different implementation classes for nodes on the expression tree, and that's before you start counting individual function calls; and there's far too much duplication between these classes. Implementing a new kind of expression involves far too much code. In addition, there's far too much scope to get this code wrong, leading to obscure bugs when a new kind of expression gets involved in some particular optimization rewrite, such as function or variable inlining. There's a steady flow of bugs, perhaps a couple a month, caused by tree corruptions of one kind or another, and it would be nice to make the whole structure more robust.

One thing I want to do is to be more systematic about the way in which static context information is held on the tree. At present the principle is that each expression saves that part of the static context that it thinks it might need. For example, some expressions need run-time access to the static base URI, some to the static namespace context, and some to the register of collation names. (This last case has been simplified in 9.6 because all collation names are now global to a Configuration, though the default collation can still vary from one expression to another). A recent shock discovery is that in XPath 3.0, general comparisons (that is, the humble "=" operator) can depend on the namespace context - if one operand turns out to be untyped atomic and another is a QName, then the untyped atomic value needs to be converted to a QName using the in-scope namespaces. This means the namespace context needs to be saved with every general comparison expression, which is heavy stuff. Given that the static context is almost always the same for all expressions in a module, it would be much better rather than saving what each expression thinks it might need, to save the changes to the static context on the tree, so each expression can discover the whole static context by processing a set of deltas held with its ancestors in the tree.

This leads to another point, which is that it would be nice to have parent pointers in the expression tree. At present you can navigate from a node to its children (that is, its subexpressions), but not in the other direction. Saxon gets around this by keeping a dynamic stack of expressions visited in some of its operations on the tree (such as type-checking and optimization) so the ancestor expressions are maintained dynamically during the recursive tree-walk rather than being maintained statically on the tree.

In 9.6, mainly to support XSLT 3.0 streamability analysis, we introduced a generalized mechanism for obtaining the subexpressions of any node: the operands() method. This returns a set of Operand objects, each of which contains a pointer to the child expression itself, and also properties in effect of the parent-child expression relationship, such as the XSLT 3.0 posture and sweep properties, and the operand usage, used in the 3.0 general streamability rules. This mechanism has proved very successful (and not just for streaming) in enabling more generalized operations on the tree. But a limitation is that modtifications to the tree, such as substitution one child expression for another (which is very common during type-checking and optimization) is still entirely ad-hoc, and has to be managed independently by each class of expression node.

As a first step in redesigning the tree for 9.7, I have extended the way in which we use Operand objects. As well as being used to navigate to subexpressions, they are also now used to modify subexpressions: the Operand object has a method setChildExpression() which can be used to replace the existing child expression with another. All structural changes to the tree are required to go via this method, which is enforced by encapsulating the reference to the child expression within the Operand object. The Operand also holds a reference to its "owner" expression, so when a child expression is changed, the single setChildExpression() method can take responsibility for housekeeping such as making sure the child expression has location information for use in error reporting, and making sure that expression properties cached on the parent node (such as the inferred type) are invalidated and recomputed when the children of the expression change.

This process is complicated by the fact that the nodes on the expression tree are highly diverse, in fact they don't even all represent expressions. The tree also has to cater, for example, for XSLT patterns and XQuery FLWOR expression clauses.

Making updates go through the Operand object enables many expressions to inherit a generic implementation of common rewrite methods such as typeCheck and optimize. For example, the default action of optimize is to call optimize() on each subexpression, and if any changes have occurred, replace the subexpression with its rewritten self. The redesign means that this "replace" operation can now be done in a generic way, meaning that the default optimize() method does not need to be tailor-made for each class of expression. The same is true of other methods such as primote().

I was hoping that channelling all updates through Operand.setChildExpression() would also make it easy to maintain parent pointers in the tree. Unfortunately this is not the case. It's easy enough, when B is set as a child of A, for B's parent pointer to be updated to point to A. The problem arises when B's parent pointer was previously set to C: what happens to C's children? Can B be a child of A and C simultaneously (making it not a tree at all?). It turns out that some rewrites on the tree involve creating a new structure over existing leaf nodes in the tree, which might then be discarded if not all the conditions for optimization are met. So we've updated the parent pointers in the leaf nodes to this new superstructure, which we then discarded, reverting to the original. It's difficult then to make sure that the parent pointers are reset properly when the rewrite is abandoned. It can be done in an ad-hoc way, of course, but we're looking for something more robust: an API for tree rewriting that doesn't allow the tree to become inconsistent. This is proving hard to achieve. We may have to resort to a different approach, doing a bulk sweep of the tree to set all parent pointers before the typecheck and optimize operations on each template or function. This is intellectually unsatisfactory because it means accepting that the tree could be temporarily inconsistent in the middle of such an operation, but it may be the best option available.

This is work in progress, but it's looking promising. I appreciate that for most users it's about as interesting as the repair works to the sewers beneath your street. Hopefully, though, the bottom line will be fewer product bugs, and more ability to take the product forward into new areas.

Sharing key indexes - Saxon diaries

| No Comments | No TrackBacks
For ever and a day, Saxon has tried to ensure that when several transformations are run using the same stylesheet and the same source document(s), any indexes built for those documents are reused across transformations. This has always required some careful juggling of Java weak references to ensure that the indexes are dropped from memory as soon as either the executable stylesheet or the source document are no longer needed.

I've now spotted a flaw in this design. It wasn't reported by a user, and it didn't arise from a test case, it simply occurred to me as a theoretical possibility, and I have now written a test case that shows it actually happens. The flaw is this: if the definition of the key includes a reference to a global variable or a stylesheet parameter, then the content of the index depends on the values of global variables, and these are potentially different in different transformations using the same stylesheet.

Fixing this is not easy, and poses some interesting dilemmas, given that no-one has hit the problem in a dozen years and it's quite likely that no-one ever will. It would be nice to roll back my discovery of the problem and pretend that it doesn't exist, but that's not my style.

One solution would be to abandon the strategy of reusing indexes. It's likely that very few users have workloads that benefit from this strategy. On the other hand, those that do could see a big performance hit. Removing an powerful optimization because of a problem that doesn't apply to these workloads is not very friendly.

The right solution is to identify those key definitions that allow indexes to be shared and distinguish them from indexes that can't be shared, and then manage the two kinds of index differently. Both the identification and the management can be done, but they introduce more complexity and hence more risk of bugs, and the whole thing feels like overkill given that we don't know any users who have the problem.

Civil engineers design structures to withstand storms that are likely to happen more than once in N years. It would be nice if we could adopt a similar approach here. But software doesn't seem to be like that; it seems to follow Murphy's law, which states that if something can go wrong, it will.

So with heavy heart, I think that there's no alternative to "doing the right thing", with all the complexity it entails.

Reference: Saxon bug 1968; W3C test case key-085a/b.

Another regex rewrite - Saxon diaries

| No Comments | No TrackBacks

In Saxon 9.5, a new regular expression engine was introduced. This was obviously a risky thing to do, but I'm not averse to such risks, and although there is often some short-term inconvenience to users, I believe such things usually bring long-term benefits. The risks relate to both functionality and performance. In both cases we were able to minimize the risks thanks to a good collection of test material, but we always knew that contact with real life problems might throw up things that the test suite didn't.

Why did we need the new engine? There were several factors. On Saxon-CE, the GWT technology didn't offer access to the JDK regex library, only to the Javascript regex engine, which has different syntax and semantics, and no support for non-BMP Unicode characters. We could have translated XPath regexes to Javascript regexes the way we were doing for the JDK, but the expansions to handle non-BMP characters are very cumbersome. (We could also have abandoned W3C regex conformance, but that's not my style.) For "server-side Saxon", I had been impatient with the existing design for some time. There are bugs I reported in the JDK regex engine 5 years ago which Sun/Oracle have shown no inclination to fix; the approach of translating one regex dialect to another is slow and clunky; and the JDK wasn't keeping pace with new Unicode versions. I felt it was time we brought the regex engine in-house.

I looked around for an open source regex engine suitable for forking, and chose the Apache Jakarta engine. Unlike some other options the code was clear and well commented, the performance seemed reasonable, and the license was compatible. I made all the changes needed to support the XPath regex syntax and semantics, and a few other changes including some optimizations, and it passed all the tests. In fact, it has proved very reliable; there have been no bug reports against its functionality.

But there have been two performance glitches.

  • One is the way it handles integer repetition bounds such as a{3,7}. The algorithm it uses for this is the same as Saxon uses in its schema processor, which is described in a paper by Thompson and Tobin [1]; it essentially translates such an expression to aaaa?a?a?a?, and then generates a finite state machine accordingly. This becomes problematic when the integer bounds start to become significantly large; the size of the finite state machine and the time taken to generate it both increase rapidly with size. (For the technically minded, I'm not sure how rapidly, but even if it's only linear, it's a significant problem.)

  • The second problem is the handling of backtracking. The Jakarta engine, like most regex engines, relies on backtracking to handle ambiguous expressions. Backtracking relies on maintaining a stack so you can go back to where you were. In Jakarta, this data was kept on the Java program stack; each checkpoint was represented by a recursive function call. In many cases each character position in the input string is a potential checkpoint, so the depth of recursion was essentially equal to the length of the input string. That puts a limit of around 1K to 2K on the size of the input, which is not really acceptable.

For the 9.5 release I mitigated this problem by introducing an optimization: in cases where repetition was unambiguous, the recursive calls were eliminated. This catches many cases like A*B where you know that you have to keep reading A's until you hit a B, and when you do hit a B you know you won't have to try again from somewhere in the middle of the sequence of A's. But unfortunately, ambiguous expressions are common (an example would be .*B), and we have to handle them.

So I did something that I very rarely do; I studied how the JDK regex engine avoided this problem, by following its activity through the debugger. I was surprised to see that the JDK engine doesn't actually build a finite state machine. Instead it uses a pure interpreter pattern: it builds an abstract syntax tree representing the grammar of the regex, and then evaluates the regex by direct interpretation of the constructs on this tree. The maximum recursive depth with this approach is represented by the depth of the abstract syntax tree (that is, it depends on the regex but not on the input string), and the data needed for backtracking is held on the heap rather than on the Java stack.

Over the last week I have been rewriting the back end of the (ex-)Jakarta regex engine to use a similar design. It's passing all the functionality tests; I still have some performance testing to do, but it's looking good. The front-end code, that is the regex parser, is largely unchanged, but it now generates an operation tree rather than an FSM. I retained many of the features of the existing design, such as the use of a variety of implementations of integer sets and predicates to support the different kind of character classes, but the operations have all been re-implemented to work interpretively. I departed from the JDK design by using an approach similar to Saxon's XPath interpreter, in that the implementation of each operation returns an iterator over all the strings matched by the relevant expression. The key method supported by each operation is

IntIterator iterateMatches(final REMatcher matcher, final int position)

which is passed an REMatcher (contain all necessary information about the match operation, including the input string and regex flags, plus support for captured groups and backreferences), and a start position for matching; it return an IntIterator representing a sequence of integers which are the possible positions where a match can end.

The class handling a repeat operation such as (a|aa)* implements this operation by repeated calls on the underlying choice operation. The first choice operation will return an iterator over the two possible end positions (1,2); the next time round there are four possible positions (1,2,2,3) (there is no attempt to eliminate duplicates), and so on.The method maintains a stack of IntIterator objects representing these successive calls; initially only the first hit from each iterator is used, but when backtracking is necessary, the stack is popped, and the iterator now at the top of the stack is called to deliver its next match.

Saxon continues to take advantage of the optimization that recognizes unambiguous repetitions. With patterns such as A*B, or A{5}A, there is no need for backtracking so the stack of iterators is not used, and only the first match from each iterator is used.

There is scope for further optimization here. Some regular expression engines such as PCRE are known to handle highly-ambiguous expressions much more efficiently than this crude backtracking approach, mainly by eliminating the duplication that arises when a backtrack attempt ends up doing the same matches at the same positions as a previous attempt. Such improvements lie in the future; for the moment, being as good as the Java engine in performance, and better in features such as Unicode support and buglessness, is enough to declare success.

The new code is currently operational on the Saxon 9.6 development branch. In due course it will be retrofitted to Saxon-CE, and probably also to a Saxon 9.5 maintenance release. Probably only 10% of the original Jakarta code remains, but that's probably enough that I can't get rid of the Apache license that goes with it, for those who care about these things.

[1] Using Finite-State Automata to Implement W3C XML Schema Content Model Validation and Restriction Checking. Henry Thompson and Richard Tobin. XML Europe 2003

At the XML Summer School 2013, Tony Graham presented a lightning talk about life after libxslt 1.0.  I was not present for this summer school, but it was clear from the feedback of the discussions I received that there is a major gap of XSLT 2.0 support in the large developer community of C/Perl/PHP/Python/Ruby world and associated tools that rely on libxslt.

It is a known problem, which has never, to my knowledge been addressed. At Saxonica, we wanted to try and plug this gap by porting the Saxon processor from Java to C/C++, which would enable us to communicate with the languages specified above. One of our goals, if possible was to interface with libxml and libxslt. Providing such a bridge or cross-compiled version of a full fledged Java application to C/C++ is always a daunting task. In this blog post I discuss the technical steps in our quest to achieve our goals and give some details of the experiences gained along the way. I will begin by detailing the various technologies that we tried, and how we have have ended up using a commercial Java native compiler after several failed attempts with tools that either did not work, cumbersome or were just too error prone.


At the summer school there were discussions that the tool LLVM could do the job of compiling Java to native code. As claimed on the project website LLVM is a collection of modular and reusable compiler and toolchain technologies. The LLVM project seems very active with many projects using it to do various task, but I found it difficult to get anything working. In particular, I tried using the VMKit which relies on LLVM to compile some a simple 'Hello World' examples to machine code, but even that seemed cumbersome.


Secondly, I looked at the GCJ technology. GCJ is a tool that I have used before, so I was confident that it would work. However, from my past experience using this tool is that it can be error prone and contains long standing bugs, which is a result of the project being dormant for several years, it seems unlikely that bugs will be fixed. The other worrying fact is that GCJ only supports up-to JDK 1.5. Nevertheless for lack of other options, I persevered with GCJ and I had much better success given that I managed to compile Saxon-HE to native machine code  and actually got it to execute my example stylesheets. I had some problems because of classes that were not present in the GCJ implementation of JDK 1.5, such as the packages java.math and javax.xml. Therefore, I had to include my own version of these packages.

The next step was to create a shared library of Saxon-HE, so that I could interface it with C/C++. This proved to be a real battle, which in the end I succeeded. I decided to use Compiled Native Interface (CNI), which presents a convenient way to write Java native methods using C++. The alternative was JNI (Java Native Interface), which may be viewed as more portable. Both interfaces though have similar principles: you need a Java/CNI-aware C++ compiler, any recent version of G++ is capable, and then you must include the header file for each Java class it uses. These header files, if not automatically generated, can be done using gcjh. I soon gave up on using GCJ: I stumbled upon a few known bugs and because if I was having major issues with the setup and prerequisites required then surely users would have the same problems.

Excelsior JET

The Excelsior JET tool is the final technology we looked at and thankfully it is what we have ended up using in the alpha release. JET is a commercial product that provides a Java native compiler for both Linux and Windows platforms. What is good about this software tool is that it provides an easy to use Graphical interface to build native executables and shared libraries from jar file(s). It also has the feature to package up the software into an installer ready to be deployed onto its intended host machine. This was great for us! 

There is a lot I could write about JET, but it would be a repeat of the plethora of information currently available on their website and forum. However, just to mention we started with their evaluation version which offers 90-days free usage of their software before purchasing the professional edition. Another point of interest is that Excelsior offer a free-of-charge license for use in conjunction with open-source software.

We know that there will be some sections of the open-source community that dislike the dependency upon using a commercial tool, but it is not that dissimilar from the early years of Java when the Sun compiler was freely available but not open-sourced.  

Implementation notes using JET

After creating the shared library, to interface it with C/C++ I used JNI. It is possible to use JET's own Java interface to external functions called xFunction, which is recommended if starting from scratch, but having used JNI with GCJ I continued with that approach. To get started there are a few examples of invoking a library with C/C++. In essence, you need to load the library and initialize the JET run-time before you can use it, see the code below (from the file

/* Load dll. */
HANDLE loadDll(char* name)
  HANDLE hDll = LoadLibrary (name);

  if (!hDll) {
    printf ("Unable to load %s\n", name);

  printf ("%s loaded\n", name);
  return hDll;

extern "C" {jint (JNICALL * JNI_GetDefaultJavaVMInitArgs_func) (void *args);
            jint (JNICALL * JNI_CreateJavaVM_func) (JavaVM **pvm, void **penv, void *args);

/*Initialize JET run-time.*/
extern "C" void initJavaRT(HANDLE myDllHandle, JavaVM** pjvm, JNIEnv** penv)
  int result;
  JavaVMInitArgs args;

  JNI_GetDefaultJavaVMInitArgs_func =
  (jint (JNICALL *) (void *args))
  GetProcAddress (myDllHandle, "JNI_GetDefaultJavaVMInitArgs");
  JNI_CreateJavaVM_func =
  (jint (JNICALL *) (JavaVM **pvm, void **penv, void *args))
  GetProcAddress (myDllHandle, "JNI_CreateJavaVM");

  if(!JNI_GetDefaultJavaVMInitArgs_func) {
    printf ("%s doesn't contain public JNI_GetDefaultJavaVMInitArgs\n", dllname);
    exit (1);

  if(!JNI_CreateJavaVM_func) {
    printf ("%s doesn't contain public JNI_CreateJavaVM\n", dllname);
    exit (1);

  memset (&args, 0, sizeof(args));
  args.version = JNI_VERSION_1_2;
  result = JNI_GetDefaultJavaVMInitArgs_func(&args);
  if (result != JNI_OK) {
    printf ("JNI_GetDefaultJavaVMInitArgs() failed with result %d\n", result);

  /* NOTE: no JVM is actually created
  * this call to JNI_CreateJavaVM is intended for JET RT initialization
  result = JNI_CreateJavaVM_func (pjvm, (void **)penv, &args);
  if (result != JNI_OK) {
    printf ("JNI_CreateJavaVM() failed with result %d\n", result);
  printf ("JET RT initialized\n");
  fflush (stdout);

XsltProcessor::XsltProcessor(bool license) {
  /* * First of all, load required component.
  * By the time of JET initialization, all components should be loaded.
  myDllHandle = loadDll (dllname);

  * Initialize JET run-time.
  * The handle of loaded component is used to retrieve Invocation API.
  initJavaRT (myDllHandle, &jvm, &env);
  /* Look for class.*/
  cppClass = lookForClass(env, "net/sf/saxon/option/cpp/XsltProcessorForCpp");
  versionClass = lookForClass(env, "net/sf/saxon/Version");

  cpp = createObject (env, cppClass, "(Z)V", license);
  jmethodID debugMID = env->GetStaticMethodID(cppClass, "setDebugMode", "(Z)V");
    env->CallStaticVoidMethod(cppClass, debugMID, (jboolean)false);

In the constructor method of XsltProcessor we see that once we have loaded the library and initialized the JET run-time we can now make calls to the environment, which has been created to get class definitions and create instance(s) of the class in the Java world. This is before we make method calls on the object.

PHP Extension in C/C++

After successfully getting XSLT transformations to work within C/C++, the next step was to try and develop a PHP extension, which would operate like libxslt. There is a lot of material on the web and books in regards to PHP extensions and I found the following guide very useful: I literally followed it step-by-step, adding a few steps of my own when I worked out what I was doing.


As a proof of concept I wrote a test harness in PHP which makes use of the PHP extension (see: xslt30TestSuite.php in the download library). This is a test driver designed to run the public W3C XSLT test suite at The test driver in its current form requires Saxon-EE, which is not yet available in this alpha release; nevertheless, the program may serve as a useful example of how the API can be used. Note that it is written to use libXML to read the test catalog, but to use Saxon for running the tests and assessing the results.

Performance Testing

I now draw comparisons between running Saxon-HE (on Java) vs running Saxon-HE/C on C++ and on PHP on some preliminary tests. I also compare these times to libxslt (C/C++). An important aim is to get a good measure of the costs of crossing the Java/C++ boundary using JNI and also to see what the effect is with the PHP extension. 

I used Saxon-HE as the baseline. The test machine was a Intel Core i5 processor 430M laptop with 4GB memory, 2.26Ghz CPU and 3MB L3 cache, running Ubuntu 13.10 Linux. Servers Apache2 and PHP version 5.5.3-1ubuntu2. The compiler was Sun/Oracle Java 

The experiments were based on the XMark benchmark. I used query q8, which was converted into the stylesheet below. The choice of q8.xsl is because we should expect some performance bottle-necks across the implementations due to its equijoins in the query:

<result xmlns:xsl="" xsl:version="2.0">
<!-- Q8.  List the names of persons and the number of items they bought.
          (joins person, closed_auction) -->

  <xsl:for-each select="/site/people/person">
    <xsl:variable name="a" 
       select="/site/closed_auctions/closed_auction[buyer/@person = current()/@id]"/>
    <item person="{name}"><xsl:value-of select="count($a)"/></item>  

The running times of executing q8.xsl on the document xmark64.xml, which is a 64MB size document are as follows:

Saxon-HE (Java):    60.5 seconds

Saxon-HE/C (C++): 132 seconds 

Saxon-HE/C (PHP): 137 seconds

libxslt (C/C++):        213 seconds

* Update on the times reported for Saxon-HE/C as a result of optimizations in the JET compiler.
* Code used to get libxslt time taken from:

The times for Saxon-HE/C are without the cost of JET initialisation and loading the library, which accounted for only 4 seconds. So we observe that there is not a big overhead between C++ and the PHP extension. The biggest cost as expected is between Java and C++, where we see on the C++/PHP platform a slowdown of ~ x2.2. We also observe that Saxon-HE/C out performs libxslt on C/C++ by ~40% on q8.

See project page on Saxon/C.

Stripping the DOM - Saxon diaries

| No Comments | No TrackBacks
I discovered yesterday that Saxon-CE isn't applying xsl:strip-space directives to input documents.  An unfortunate bug: not that many users seem too bothered by it, but conformance is always important.

A reminder: xsl:strip-space and xsl:preserve-space are designed to remove "ignorable" whitespace from the application's view. Because XML  doesn't distinguish significant from insignificant whitespace (a bad design mistake), this can only be done under program control. The idea is to present the application (the stylesheet) with a view of the input tree in which the insignificant whitespace does not appear, and the xsl:strip-space and xsl:preserve-space directives allow the stylesheet author to say which whitespace is considered significant.

How to fix the bug? Server-side Saxon (what I am starting to call "Big Saxon") traditionally provides two mechanisms: either it wraps the supplied DOM in a wrapping layer that hides the offending whitespace text nodes, or it bulk-copies the supplied DOM to an internal tree structure in which the whitespace is not copied across. Both methods have disadvantages. The "wrapping" approach imposes a significant extra cost on navigation of the tree (the cost is higher than it might appear, because whitespace stripping also has to take account of xml:space attributes in the document, which means a lot of searching for attributes-of-ancestors). The bulk-copy approach imposes a significant start-up cost to the transformation, often larger than the cost of doing the real transformation, and the cost is particularly high when processing large input documents because it doubles the memory requirement.

I'm considering using a third approach: in-situ modification of the supplied DOM. For Big Saxon, I have tended to treat this option as unthinkable: you don't muck with the user's data in this way. The same DOM, after all, might be input to multiple stylesheets, perhaps concurrently (though DOM is not thread-safe so this is a bad idea....) and they might have different space-stripping directives. A Java application might pass the DOM to an XSLT stylesheet to do some processing, but might want to continue operating on the same DOM afterwards.

In Saxon-CE, however, these considerations hardly apply. In the vast majority of cases the DOM is built solely in order to act as XSLT input. Unlike Big Saxon, where one can assume that if users supply a DOM they are doing so by choice, with Saxon-CE there is no alternative: all input XML arrives this way.

In-situ clean-up of the input DOM has other attractions; it can be used to solve the other two big problems with DOM processing (both arising from a mismatch between the DOM model and the XDM model): adjacent text nodes, and namespaces. The same pre-scan that strips unwanted whitespace nodes can also concatenate adjacent text nodes (eliminating entity references and CDATA sections in the process); presenting a view of adjacent text nodes as a single node accounts for a great deal of complexity (and lines of code, and execution time) in navigating a DOM. Equally for namespaces. To determine the namespace part of the name of an element or attribute, Saxon can't simply call getNamespaceURI() on the underlying DOM node. That method is defined to have no effect if "the node was created using DOM level-1 interfaces". Saxon of course has no idea how the node was created, and DOM provides no way of finding out. So it does an optimistic call of getNamespaceURI() just in case, and if this returns null then it has no choice but to search the node's ancestors and attributes looking for namespace declarations - and since this is done every time you look at a node to see whether it satisfies an axis step, it's no wonder that Saxon can be ten times slower processing a DOM than when processing its native tree structure. We should be able to cut this cost dramatically by doing a pre-scan of the document and calling setNamespaceURI() on all element and attribute nodes.

The more I think about this, the more appealing the idea of doing the same thing on Big Saxon. The fact is, most people who supply a DOM as input to XSLT do so out of ignorance; they have no idea of the extra costs they are incurring, or of the fact that supplying SAX input would be ten times faster. It's important to deliver good performance not only to experts who know how to get the ultimate performance out of the system, but also to the average user who just copies JAXP code from somewhere on the web without any great thought or understanding. For that kind of user, in-situ modification of the supplied DOM might be the right answer. We can always continue to provide wrapping or bulk-copying as alternative configuration options, for people to invoke if they don't like their data being trampled upon.

Returning to Saxon-CE, the other complication is the HTML DOM. Running XPath against an HTML DOM is already sufficiently quirky (e.g. because of case-insensitive element names, namespace non-support, etc etc) that it's probably wisest to keep things as simple as possible on this front. I don't think we need to apply xsl:strip-space directives to the HTML document; there's nothing in the spec that would make this a conformance issue, and there's no usability imperative. So we'll leave well alone here.

Reducing the size of Saxon-CE - Saxon diaries

| No Comments | No TrackBacks
During the early stages of Saxon-CE development, the major part of the work was cutting Saxon down to size to run on the browser. That initial work was described in an earlier post: see Cutting Saxon down to size:

In the last couple of weeks I've resumed the effort to reduce the size of the code. This article examines the changes I've made, and the opportunities for further reduction.

Saxon-CE 1.1 is around 880Kb, and so far, I've reduced that to 780Kb. I'm measuring the size of the main Javascript download here, not the ancillary data files that may or may not be downloaded, depending on what the stylesheet actually does. Reducing the size is important because the download time causes an appreciable delay on the first visit to a site, and first impressions are important; cutting the delay from say 7 seconds to 5 seconds will result in a corresponding reduction in the number of visitors who don't bother to wait.

What has gone?

About half this reduction was achieved by rewriting two data-intensive classes to load their data from XML files on demand, rather that defining it in static initialization code. The two classes in question are Categories, which contains the mapping of Unicode characters to categories used in regular expressions such as Lu and Nd; and CaseVariants, used when the "i" flag appears in a regular expression to determine all the upper/lower case variants of any character. These XML files are downloaded only if the relevant language features are used in the stylesheet.

A couple of other classes were also performing lengthy static initializing: StandardFunctions, which contains details of all the system functions in the core library, and StandardNames, which contains the names of all XSLT elements, attributes, and built-in types. In the case of StandardFunctions the amount of initialization code has been vastly reduced by introducing a micro-syntax to describe function signatures in the form of a string literal, which can be easily parsed in a few lines of code. In the case of StandardNames, the class was setting up a mapping from integer constants to QNames; this has been removed by changing the product to use QNames throughout (or in the case of types, the object instances representing each built-in type) rather than integer constants.

Associated with this second change, the NamePool has gone. Saxon-CE was not getting any real benefit from the NamePool, because the source document is always in the form of a DOM, in which names are held as strings. So integer fingerprints were being allocated on demand, which cancels any saving from faster integer comparisons. The change to use QName objects in place of namecodes and fingerprints throughout the product was of course very extensive, but it seems to have been trouble-free. There is only one very minor loss of functionality: in cases where Saxon has to invent a prefix for use on a constructed element or attribute, it is no longer able to consult the NamePool to see what prefixes have previously been used with that URI.

The other changes are minor and tedious. They fall into a number of categories:

* Finding code that could never be executed. Although GWT automatically eliminates code that is completely unreachable, there is code that cannot be executed in practice for details that GWT cannot detect. Two examples: (a) code for matching types such as attribute(*, xs:date). Although a basic XSLT processor allows this syntax, all nodes in Saxon-CE are untyped, so this type will never match anything. It can therefore be replaced during parsing by a type that matches no instances, which cuts out a lot (well a couple of Kb) of useless code. (b) there was still a lot of code designed to report the line number on which errors occurred, both statically and dynamically. But in the browser environment, the XML parser provides no line number information, so this code was useless.

* Eliminating duplication. For example, I found that there were two completely separate mechanisms for maintaining comparison keys, one mechanism used for xsl:for-each-group and distinct-values, the other used for sorting and various other operations. Although there were a few semantics differences (for example in the handling of NaN) it proved reasonably easy to combine these two mechanisms into one.

* Cutting out minor optimizations. Do we really need a separate class to handle hash maps in which the key is an unboxed integer, rather than using the standard HashMap with a boxed integer key? Is that optimization worth 1000 lines of code? Probably not. There are many similar examples.

* Cutting out small classes. Arithmetic expressions, for example, delegated the computation to one of a large number of small Calculator classes, each dedicated to one particular operation, such as (integer plus integer). This is an elegant design, but in GWT it appears to be expensive: each of these classes was generating up to 2Kb of code. Combining the classes into a set of switch expressions in the parent class appears to give a significant reduction in code size (though I have to admit I am not entirely sure why).

In deciding whether to remove a particular piece of code, I find it useful to reverse the thinking: if this wasn't there already, would I consider it worthwhile adding it? This is analogous to zero-based budgeting: there is a presumption that something isn't needed unless it can be justified, rather than a presumption in favour of the status quo.

Where will the next 100Kb come from?

Getting down from 880Kb to 780Kb is useful, but it isn't enough. The target should be something like 500Kb. Finding big savings, however, gets more and more difficult. There are plently of small salami-slices available, but they get harder to find.

Some of the original areas in my earlier post remain unaddressed. There is still a lot of duplication between the "style" classes representing the source stylesheet, and the "instruct" classes on the expression tree. Historically, it was important to allow the source stylesheet tree to be garbage collected once compilation is finished; in Saxon-CE, that reasoning doesn't apply. Probably we should keep the instruction classes on the expression tree, but have them delegate a lot more to the original "style" elements to avoid duplication.

The expression tree is a major opportunity for simplification. There are two issues: there are too many different kinds of expression (too many subclasses), and they implement too many methods. Specifically, there are 167 subclasses of Expression, and the class defines some 48 methods. Some of these classes are admittedly very small (FirstItemExpression, for example, which implements constructs of the form EXP[1], is just 157 bytes -- I haven't worked out why these classes are so much smaller than the Calculator classes mentioned above). So compiling EXP[1] to a call on subsequence(EXP,1,1) would probably give no net saving. But despite much use of superclasses to hold shared functionality, there's still a fair bit of duplication that could be eliminated, and a large number of minor optimizations that probably have little merit.

Could we drop the LinkedTree implementation of XDM? The Tinytree is already gone from Saxon-CE. The LinkedTree is only used (a) for stylesheets, and (b) for temporary trees: source documents and result documents use the Javascript DOM. (Not the GWT DOM). We could consider using the DOM also for stylesheets and for temporary trees. But it's only 20Kb. A more interesting way forward might be some kind of hybrid of the current LinkedTree and HTML-DOM-wrapped-tree; given the way Javascript works, we could consider adding properties to the DOM to make it implement the NodeInfo interface more directly, supplementing some of its deficiencies for example (notoriously) its poor support for namespaces.

There is a lot of code whose main purpose is lazy (pipelined) evaluation. This leads to a proliferation of small specialized classes. Some of these are probably an unnecessary luxury. For the index-of() function, for example, do we really need a custom iterator that searches for the next match and returns it on demand, or would it do any harm to find all the matches up-front? In some cases several iterator implementations could be combined into one: for example the index-of() iterator could be handled by a flag in the filter iterator that returns the position of the matched item instead of the item itself.

Generally, I think that finding another 100Kb of discardable code looks feasible, though it's a lot of work.

Note: Saxon-CE today is 64800 non-comment lines of Java code. Saxon-EE for comparison is 261100 non-comment lines.

Update two weeks later:

This kind of exercise can become a bit obsessive. I'm now down just below 700Kb, and at the Java level the number of source lines is down to about 55000. It's amazing how much dead wood there is, and it's a great pleasure to get rid of it. By and large, most of the reductions have been achieved without adverse effects that anyone will notice, and in some cases it's been a real opportunity to tidy up the code and improve it. I'm very inclined to port some of the changes back to "big Saxon" (that is, the server-side code base).

The changes made fall into a number of general categories:

  • Avoiding duplication. For example, the code to do compile-time static type checking of function arguments (to implement the function conversion rules) was extremely similar to the code for dynamic conversion of supplied stylesheet parameters to their required type. A little bit of thought enabled the two to be combined. There was extensive duplication in the code for iterating different axes for different tree models. The code for xsl:next-match was about 95% the same as the code for xsl:apply-imports. There was perhaps 70% code overlap in the iterators for the union, except, and difference operators. There were half-a-dozen places in the product containing almost identical code for converting a lexical QName to an expanded QName given a namespace context. In some cases the duplication could be justified for "big Saxon", because of the imperative to minimize run-time pathlengths. But that's the wrong trade-off for Saxon-CE; the size of the code really does matter.
  • Eliminating trivial optimizations. I already mentioned the "integer hash-map" example. There are plenty of examples of static rewrite optimizations that really don't justify themselves: rewriting string-length($x) = 0 as not(string($x)) is an example. You have to look at these and ask (a) whether the saving achieved is significant, and (b) whether the rule will fire often enough to be worthwhile. It's easy to make mistakes on both counts - nearly all these optimizations were added at some time because of a real use case where they made a significant difference. But over time they accumulate and gather dust; sometimes they cease to be relevant because something else has changed elsewhere in the system.
  • Avoiding over-generalized code. GWT does a good job of identifying methods that are never called. But it can't spot methods that are capable of doing 17 things, and are only ever called upon to do one of those 17 things.
  • Rewriting the code to be more compact. A couple of examples here: the code for checking the select attribute of xsl:value-of now reads "checkAttribute("select", "e")", where "select" is the name of the attribute, and "e" indicates that its value must be an expression. The previous code was about four times this long, and the extra stuff was essentially the same for every attribute of every element. Similarly, the code for registering details of the signatures of built-in functions, and the code for enumerating the subexpressions of every kind of expression: these fragments of code occur very often, and cutting them down from 4 lines to 1 can make a big difference overall.
  • Simplifying error handling. Good diagnostics for errors are still important, but there was a lot of verbosity in this area that could go. If a date is invalid, how much detail do you need to give the user about the fact? Might it not be good enough to tell them that 2013-03-59 is an invalid date?

What's new in Saxon 9.5? - Saxon diaries

| No Comments | No TrackBacks

The following is a provisional list of changes expected in the Saxon 9.5 release, which is in the final testing phase. There could still be changes to this list, for example a feature could be dropped if it is found to be unreliable during testing.


The xsl:result-document instruction in Saxon-EE is now asynchronous. That is, the code to output the result document runs in a separate thread, in parallel with other processing. The maximum number of threads used by xsl:result-document instructions is limited by the configuration option FeatureKeys.RESULT_DOCUMENT_THREADS which defaults to the number of processors available to the Java VM; setting this to zero or one will suppress multithreading. Setting FeatureKeys.ALLOW_MULTITHREADING to false has the same effect. (This can be useful when debugging, because otherwise the output from xsl:message and fn:trace() can be very confusing).

The facility can also potentially cause problems if the code calls extension functions that have side-effects. Multi-threading can therefore be controlled, if required, using thesaxon:asynchronous attribute on the xsl:result-document instruction: use saxon:asynchronous="no" to suppress multi-threading. Asynchronous processing of xsl:result-documentis automatically suppressed if tracing (using a TraceListener) is enabled.

The collection() function is also now multi-threaded in Saxon-EE. Each document in the collection is parsed in a separate thread, and the documents are processed in the order in which parsing completes. This makes the order of the documents less predictable than in previous releases, though it was never guaranteed or documented.


New XSLT 3.0 instructions such as xsl:iterate and xsl:try no longer have synonyms in the Saxon namespace.

Extension function saxon:for-each-group() is dropped (superseded by "group by" in XQuery).

New extension functions saxon:schema() and saxon:type are available, giving access to schema information. The saxon:schema() function obtains information from all the schema components available in the query or stylesheet; the saxon:type function gives information about the type annotation of a node. In both cases, the information is returned as a function item which behaves like a map from schema component property names to values; for example the name of the type annotation of a node is given by saxon:type($node)('name').

A new extension function saxon:send-mail() is available to send email via an SMTP server.

A new extension function saxon:key-map() is available: this allows the index constructed using xsl:key to be viewed as a map, giving extra functionality such as getting all the entries in a given range, enumerating the key values, concatenating keys across multiple documents, processing the entries in sorted order, etc.

The third argument of saxon:transform(), which supplies parameters to an XSLT transformation, may now take the form of a map, allowing the parameter value to be any value whatsoever (the old mechanism restricted it to atomic values.)

The extension function saxon:index has changed to expect a function as its second argument rather than a prepared expression, and it now returns a map which can be accessed using all the standard map functions. The extension function saxon:find is now a synonym for map:get. There is no longer an option to specify a collation.

A new flag "v" has been added to saxon:deep-equal() to suppress the check that two elements have the same "variety" of type: for example if one has element-only content, the other must have element-only content. This check was not performed in previous releases; in this release it is performed by default (as required by the fn:deep-equal() specification), but may be suppressed using this option. The option is useful when comparing validated and unvalidated documents.

The proposed EXPath file module (see is implemented (in Saxon-PE and -EE). This provides a number of extension functions for reading and writing files in filestore, creating and deleting files and directories, listing the contents of directories, and so on.

The EXPath zip module (see is implemented. The implementation is derived from the original implementation by Florent Georges, but has been more closely integrated into Saxon and more thoroughly tested. This module is open source code; the extensions are integrated into the Saxon-PE and Saxon-EE distribution, and are available to Saxon-HE users in source code form, where they can be linked to the product in the same way as any other extension functions (see .

A new serialization option saxon:attribute-order is available. The value is a whitespace-separated list of attribute names. Attributes whose names are present in the list come first, in the order specified; other attributes follow, sorted first by namespace URI and then by local name.

When the -wrap option is used in XQuery, the serializer is now capable of generating an XML representation of a map, using nested elements to represent the entries in the map with their keys and values.

External Object Models

Support for JDOM2 is added.

Support for Apache Axiom is added.

A number of optimizations have been made to the support modules for other external object models, noticeably to speed up navigation of the descendant axis.

The class DocumentBuilderFactoryImpl, which constructed a DOM wrapper around a TinyTree, and which has been deprecated since 9.3, is now removed.

XSD Extensions

Parameterized validation: a new extension to XSD 1.1 is implemented to allow schema validation to be parameterized. The saxon:param element can be added to a schema to declare a parameter; the value of the parameter can be referenced in XPath expressions, for example in assertions. The parameter values can be set from the command line when running the Validatecommand, or from the s9api (Java) and Saxon.Api (.NET) interfaces when validating from an application. It is also possible to initiate parameterized validation using a new saxon:validateextension function available in XSLT and XQuery.

Internal Changes

Regular expressions: a new regular expression engine is introduced, based on the Apache Jakarta product.

The implementation of the base64 encoding and decoding routines has been rewritten in order to simplify the set of open source licenses in use

XPath 3.0 Functions and Operators

fn:serialize() function: the implementation has been changed to match the 2011 version of the W3C specification. This changes the format of the serialization parameters supplied in the second argument to the function.

The functions has-children()innermost(), and outermost() are implemented.

The function parse-xml-fragment() is implemented.

In format-date() and friends, timezone formatting has been updated to match the current spec. This may create incompatible changes. The changes apply whether or not XPath 3.0 processing is enabled.

The implementation of deep-equal() has been changed to enforce the rule that the type annotations of two element nodes must have the same variety. This corrects a long-standing non-conformance, but may cause incompatibility, especially when comparing a schema-validated document against an unvalidated document. A new flag "v" has been added to the Saxon equivalent function saxon:deep-equal() to suppress this check.

Command line

Command-line interfaces: added the option -quit:off to prevent exiting the Java VM in the event of a failure; instead, a RunTimeException is thrown. Useful when the command line interfaces are invoked from another Java application, for example ant, as it allows the calling application to recover.

When the -wrap option is used in XQuery, the serializer is now capable of generating an XML representation of a map, using nested elements to represent the entries in the map with their keys and values.

XSD 1.0 Support

There have been changes to the way the schema processor handles recovery from validation errors. By default the processor tries to continue validating to the end of the file, reporting as many validation errors as it can. All errors are reported to the error() method of the ErrorListener, which by default throws no exception; if the error() method does choose to throw an exception then validation terminates immediately. Validation can also be terminated early if an error limit is reached; the error limit can be set in the ParseOptions object that controls the validation. (The value 1 causes termination after the first error is reported).

At the end of the document, if there have been one or more validation errors, then a fatal error is thrown unless the "recover after validation errors" option (FeatureKeys.VALIDATION_WARNINGS) is set, in which case no exception occurs and processing can continue. Note that in this case the type annotations in any result document are unreliable. The setting FeatureKeys.VALIDATION_WARNINGS is ignored in the case where a document used as input to XQuery or XSLT is being validated, because in that case the incorrect type annotations would cause inconsistencies during subsequent processing. TODO: check this works.

If an element is found that cannot legitimately appear according to the content model of its parent, Saxon previously abandoned the validation of the content of that element, as well as the content of its following siblings. This has been changed so that the content of the offending element, and the content of its siblings, is now validated, using the context-determined type for the element (the "element declarations consistent" constraint ensures that if the same element name appears more than once in a content model, each must be associated with the same type).

The validation exception made available to the ErrorListener now includes a structured path indicating location of the offending node (previously the path was available only as a string). It also includes, where available, a reference to the schema type against which validation was attempted.

The error messages produced when a sequence of elements does not conform to the content model of a complex type has been improved. There is now more effort to distinguish different causes of the error: for example, too many repetitions of a repeated element, a mandatory element that has been omitted, an element that is in the wrong namespace.

Saxon now recognizes the XLink namespace and fetches the schema for this namespace locally rather than fetching it from the W3C web site (which will often time out).

XPath 3.0 Support

Revised the EQName syntax as per spec bug 15399: it is now Q{uri}local.

Affecting XPath 2.0 also, some very long-standing bugs have been fixed in the handling of schema-validated documents using xsi:nil - specifically, ensuring that the typed value of a nilled element is always an empty sequence, getting the matching of types element(N, T) and element(N, T?) right, as well as the impact of nillability on type subsumption.

XSLT 3.0 Support

The bind-group and bind-grouping-key variables on the xsl:for-each-group element are now implemented.

This involved some internal refactoring of the way variables are managed during the XSLT static analysis phase.

Implemented the composite attribute of the xsl:for-each-group element.

Changed the implementation of xsl:merge to reflect the revisions in the draft XSLT 3.0 specification (removed xsl:merge-input element; added sort-before-merge attribute).

Implemented accumulators (the new xsl:accumulator feature) for both streamed and unstreamed documents.

Implemented the xsl:stream instruction.

Implemented the error-code attribute of xsl:message.

Implemented the start-at attribute of xsl:number.

Implemented the context-item attribute of xsl:evaluate.

Implemented the xsl:assert instruction.

Implemented the xsl:map and xsl:map-entry instructions.

The restriction that xsl:import must precede other declarations in a stylesheet module has been removed.

Implemented the on-empty attribute of xsl:attribute.

Implemented the xsl:on-empty attribute of literal result elements. Tested with and without byte code generation. TODO: test with streaming, and with freestanding construction (evaluateItem).

Saxon now allows the EQName syntax Q{uri}local in places where a QName is required, for example the name attribute of xsl:variablexsl:templatexsl:function etc, the first argument of the functions key()system-property(), and function-available(), the final argument of format-number(), etc. This is useful for static names in the case where stylesheets are generated programmatically, and it is always useful for dynamic names because it avoids the dependency on the static namespace context.

It is now a static error if the same NodeTest appears in an xsl:strip-space and an xsl:preserve-space declaration at the same import precedence. This is a backwards incompatibility in the 3.0 specification.

There is now a warning message if the namespace URI of the document element of the principal input document does not match the namespace URIs used in the template rules of the stylesheet. This is designed to catch the common beginner's mistake of writing (for example) match="html" when the element to be matched is actually in a (default) namespace. The precise conditions for the warning are that the stylesheet contains one or more template rules in the initial mode that match an explicit element name, and none of these template rules matches an element in the namespace used for the top level element of the principal source document.

Implemented more of the new pattern syntax: patterns matching variables, namespace nodes, ... Patterns that match atomic values can no longer be used as part of a pattern that uses "union", "intersect", or "except" (as a result of clarification of the XSLT 3.0 specification.)

Affecting XSLT 2.0 also, a very long-standing bug has been fixed: documents read using the collection() function are now subjected to whitespace stripping as defined by xsl:strip-space declarations in the stylesheet.

Also affecting XSLT 2.0, a change has been made in the behavior of xsl:result-document when the href attribute is absent. Previously this caused the constructed result document to be serialized to the location specified by the base output URI. Now it causes the constructed document to be sent to the primary output destination selected in the calling API (which might, for example, be a DOMResult or an XdmDestination). Both interpretations appear to be allowed by the specification. Note that omitting the href attribute is done typically when you want to validate the result document against a schema, though the same effect can be achieved using xsl:document.

Whitespace stripping: the xsl:strip-space declaration now has no effect on a node if it is within the scope of an XSD 1.1 assertion: that is, whitespace text nodes are not stripped if any ancestor node has been validated against a type that contains an assertion. This is because changing the content of such a node could invalidate the assertion, thus breaking type safety.

Implemented content value templates. These allow expressions contained in curly braces to be contained in text nodes within a sequence constructor, rather like attribute value templates; the facility can be enabled by setting expand-text="yes" on any enclosing XSLT element (for example, <xsl:stylesheet>). Example: {$greeting}, {$first} {$last}, how are things?]]>. Note that the details of the facility in the specification (for example, handling of boundary whitespace) are subject to change.

The new match pattern syntax match="?{ expr }" is implemented. This matches an item T if the expression has an effective boolean value of true when evaluated with T as the context item. This construct replaces the ~ itemType syntax defined in earlier XSLT 3.0 drafts, but for the moment Saxon implements both.

Some of the features NOT implemented in XSLT 3.0 include:

  • Packages

  • xsl:context-item


There has been considerable development of the streaming capability, much of it involving removal of restrictions that were not previously documented (or known).

The xsl:stream instruction is now implemented, making the saxon:stream extension function obsolescent (though it is retained for the time being). Accumulators are also implemented, in both streaming and non-streaming mode.

Streaming: Saxon is moving towards a design where it implements the "guaranteed streamability" rules in the W3C draft precisely, unless the configuration optionALLOW_STREAMABILITY_EXTENSIONS is set, in which case Saxon may implement streaming (or near-streaming) for some additional cases.

Certain constructs using positional filters can now be evaluated in streaming mode. The filter must be on a node test that uses the child axis and selects element nodes. The forms accepted are expressions that can be expressed as x[position() op N] where N is an expression that is independent of the focus and is statically known to evaluate to a number, x is a node test using the child axis, and op is one of the operators eq, le, lt, gt, or ge. Alternative forms of this construct such as x[N], remove(x, 1), head(x), tail(x), and subsequence(x, 1, N) are also accepted.

Streaming is now possible for xsl:for-each-group using the group-adjacentgroup-starting-with, or group-ending-with attributes.

XQuery 3.0

Saxon 9.5 fully implements the XQuery 3.0 Candidate Recommendation of January 2013.

Forwards references to global variables are now allowed.

Added support for variables such as err:code and err:description within the catch block of try/catch.

Added support for function annotations.

Added support for the required-feature and prohibited-feature declarations.

XQuery 3.0 extensions can now co-exist with XQuery Update extensions in the same query. 


The XPathSelector interface now has an EffectiveBooleanValue method.

Various methods have been added to bring the interface closer to the functionality level of the Java s9api interface.

System Programming Interfaces

In the Configuration class, many of the getters and setters for individual configuration properties have been removed, relying instead on the general-purpose methodsgetConfigurationProperty() and setConfigurationProperty(). To make these easier to use in the common case of boolean properties, the new methods getBooleanProperty() andsetBooleanProperty() are introduced. Property names that are only relevant to Saxon-PE or Saxon-EE are now generally recognized only by a ProfessionalConfiguration orEnterpriseConfiguration respectively; previously some were also recognized by a Saxon-HE configuration while others were not.

The classes ValueRepresentation and Value have been replaced by the new class Sequence. In some interfaces Sequence also replaces SequenceIterator, since passing aLazySequence (which implements Sequence) has all the same performance benefits as passing a SequenceIterator, while still allowing an Item to be passed directly, without wrapping.

The change affects (simplifies) the interface for integrated extension functions, where arguments and result values are now passed as Sequence objects rather than as SequenceIteratorinstances.

In the NodeInfo interface, the two methods getTypedValue() and atomize() for obtaining the typed value of a node have been unified into a single method, atomize(), which returns the new type AtomicSequence. In the vast majority of cases the typed value is a single atomic value, and this case works efficiently because AtomicValue implements AtomicSequence.

Also in the NodeInfo interface, the method getAttributeValue(fingerprint) has been dropped; callers should use getAttributeValue(uri, local) instead.

The OutputURIResolver interface has been changed: a new method newInstance() is added. This change is made because xsl:result-document is now multi-threaded by default, and since it's likely that existing implementations of OutputURIResolver won't be thread-safe, making an interface change is better than a semantic change that will cause the code to break in difficult-to-diagnose ways. The new method typically returns a new instance of the OutputURIResolver class, so that each instance of the class only needs to remember information about one result document, and is automatically thread-safe.

In the Java classes that implement the schema component model (in package com.saxonica.schema) many of the methods that return an Iterator have been replaced by methods that return a Set or a List. This simplifies processing by enabling use of the Java for-each construct, and is more convenient when the saxon:schema extension function is used to process schema components in the form of XPath functions.

The toString() method on the Expression class has been more carefully defined and implemented in an attempt to achieve the result that (for an Expression within the scope of XPath) the result is a legal and equivalent XPath 3.0 expression with no dependencies on namespace prefixes other than (a) the binding of the prefix "xs" to the standard Schema namespace, and (b) the assumption that the XPath functions namespace is the default function namespace. In other cases QNames are expanded using EQName notation Q{uri}local. There may be a few remaining cases where the output does not yet satisfy these intentions.


Saxon has long provided the ability to have an Item that wraps an external Java or .NET object, which can be supplied in arguments to extension function calls or used in the response from an extension function calls. In the past, such values have appeared in the type hierarchy under "atomic value", that is, as a peer of types such as xs:boolean and xs:string. This has changed so they are no long atomic values, instead forming a fourth kind of item alongside nodes, atomic values, and function items.

The string value and typed value of an external object are the same; they are the xs:string value that results from calling its toString() method.

The handling of extension items that wrap a Java null has been tidied up. When a Java extension function returns null, this is mapped to an XDM empty sequence, not to an extension item that wraps null. When an empty sequence is supplied to a Java extension function that expects an Object, the value null will be passed to the Java method. Extension items are therefore no longer allowed to wrap a Java null.

With reflexive extension functions, the handling of Java arrays has been improved; the component type of the array is now considered when deciding which of several overloaded methods is the best fit.

With reflexive extension functions, if there are two methods where one expects a String and the other a CharSequence, the one expecting a CharSequence is now preferred. Previously the function call was rejected as ambiguous. Although there is no particular reason for choosing one rather than the other, it's likely that both methods will have the same effect, so choosing one arbitrarily is better than reporting an error.

With reflexive extension functions, if there are two methods of the right name and arity, the decision which to use is now postponed until after type checking has been done. This is particularly useful when arguments are supplied in the form of variable references. Previously the decision was postponed only if the early analysis showed both methods as equally preferred; now it is always postponed.

When a constructor is called the code now does what the documentation has always said: a wrapped external object is returned with no conversion. So for example Date:new() will return a wrapped java.util.Date object rather than an instance of xs:dateTime, and HashMap:new() will return a wrapped HashMap rather than an XPath map. The code also avoids doing conversions other than object unwrapping for the first argument of a call to an instance method.

When a reflexive extension function throws an exception, the exception details are now captured in an error that can be caught using the try/catch capabilities in XSLT 3.0 and XQuery 3.0. In particular, the error code is a QName whose namespace URI is " and whose local name is the class name of the exception type, for

It is now possible to use the reflexive call mechanism to call instance-level methods of the implementation classes for XDM values. For example, the following gets the current Julian instant:

<xsl:value-of select="d:toJulianInstant(current-dateTime())"

When an instance of java.util.Map is passed as a parameter to a stylesheet or query, then by default it is accessible within the stylesheet or query as an XPath 3.0 map object. However, if the parameter is declared with a required type of jt:java.util.Map, then it will instead be retained as an external object wrapping the java.util.Map, which means for example that (with the usual caveats and disclaimers) it is updateable by calling its put() method as an extension function.

The SQL Extension

Added the value #auto for the column-tag and row-tag attributes of sql:query, and made this the default; when this value is used, the element names used for rows and cells are taken from the table name used in the query, and the column names returned in the query metadata. To get the old default behaviour, specify row-tag="row" and column-tag="col".


Support for XHTML 5 serialization, as defined in the 3.0 Serialization specification, is available. (Use <xsl:output method="xhtml" html-version="5.0"/>).

Comments and processing instructions no longer prevent an immediately following start tag being indented if it would have been indented in the absence of the comment or PI.

(However, comments and PIs are not themselves indented, because it is not possible to do this safely without looking ahead to see if the comment or PI is followed by significant character content.)

I've been working on a stylesheet that converts the W3C XQuery/XPath 3.0 test suite into an XSLT 3.0 test suite. The idea is that all the XPath tests only need to be written once, allowing them to form XSLT tests and thus avoiding the duplication of effort that arises if separate test suites are written.

The stylesheet generates 19,700 output files (38Mb) from 1,863 input files (122Mb) so it's a good test in its own right. I'm therefore running it using the almost-ready Saxon 9.5 as part of the system testing before we ship the product. Here are a few points I'd like to highlight.

Firstly, some files are copied unchanged. For this purpose the extension function file:copy(), which is part of the EXPath file module, comes in very handy. This llibrary will be included as standard with Saxon 9.5 (like all extensions, it will be in the PE and EE editions only).

Secondly, I followed a development approach which I've found very useful and would like to share. There's a schema for the target document vocabulary (the XSLT 3 test suite metadata), so it makes sense to use a schema-aware stylesheet that validates the result documents as they are produced. However, I don't switch on validation until the transformation is "roughly working", that is, until it produces output that looks correct to the naked eye. The reason for that is simply that the validation error messages are a nuisance during the stage when you know the transformation is incomplete and incorrect. It's only when the transformation is running without errors and handling all the input that you need to start checking that the output is correct, and at this stage adding an xsl:import-schema declaration for the output schema, plus the attribute validation='strict" on the xsl:result-document instruction, is absolutely invaluable.

Except, I found it could be improved. The validation error messages tell you what's wrong (for example, you've output an attribute value which isn't allowed for that attribute), and they tell you where the code is in the stylesheet that caused the error (the offending xsl:attribute or xsl:copy instruction), but they weren't telling you what input was being processed at the time: where did the offending attribute value come from? So I've fixed that, or at least tried to. It's one of those cases that sounds easy but is actually quite tricky, because the code that detects the validation error is very remote from any code than knows where we are in the source document (or indeed, which of 1,863 source documents we are processing) at the time. The way I've done this is that the PipelineConfiguration - which holds all the context information used by a push pipeline, including schema valdiation - now includes a stack of items representing items being processed using xsl:apply-templates; so the validation error message can tell you what the latest node processed by a template rule is. Doesn't work unfortunately for those who use monolithic xsl:for-each structures, or for those who use XQuery - but this is specifically focussed on this kind of workload handling many inputs and many outputs, for which XQuery is unsuitable anyway. There is of course a small performance hit for maintaining this stack, but I'm confident it's very small, and I always reckon that a small amount of machine time to save a large amount of programmer time is a good trade-off.

The third thing I wanted to mention in this article is asynchronous processing of xsl:result-document. This is a new feature for Saxon-EE 9.5. By default, an xsl:result-document instruction executes in its own thread, independently of the rest of the stylesheet. Threads are allocated from a pool, whose size defaults to the number of processors available (on my Mac laptop, that's 8). If no free threads are available, the instruction executes in the current thread.

I did some experiments on the performance as the number of available threads is increased:

1 - 14.3s
4 - 8.52s
8 - 8.92s
12 - 9.81s
16 - 10.6s

so the default of 8 appears to be a reasonable choice. The saving in elapsed time by using multi-threading seems to be well worth having, though frankly, for a task like this, 14s is already fantastically good performance and any improvement is simply icing on the cake. It's becoming quite hard to find workloads where performance improvements are genuinely needed.

Incidentally, switching bytecode generation off has almost no noticeable effect on this transformation, and the same is true when optimization is switched off. That's because it's a very simple transformation that happens to be processing a large amount of data.

The xsl:result-document instruction is a good candidate for parallel processing because there's no need to combine its results with those of any other instruction, which minimizes the synchronization overheads that usually come with parallel processing. However, there is still some synchronization needed, for example because global variables are evaluated lazily, so there is the chance that two threads will encounter the same unevaluated variable. Unfortunately finding all these cases is very difficult to do methodically, and this test stylesheet found a few bugs that needed to be addressed. Realistically, it's likely there will be further bugs that we won't hit until after release. So we've been careful to make sure the feature can be disabled - that's proved very useful with bytecode generation, which is another area where it inevitably took a few months of field exposure to iron out the last bugs.

At Saxonica, we have for a long time now used a tailor-made java application to create and issue licenses for all commercial products we develop. There is no real database at the back-end, but just a local XML file with customer details and copies of the licenses created and issued. For a one man company this poses no real problem, but inevitably as the company has expanded over the last two years this has been a major concern.

Early last year Mike Kay presented me with the task to create a new saxon-license application with the following requirements:

  • Accessible  to all employees, preferable web-based
  • Core Java tool should remain intact 
  • Centralised store, must be XML-based
  • Secure application

From the outset we thought that for such a tool which heavily relies on XML and XSLT at its core, that  requirements would be best met using XSLTForms and Servlex to develop the tool.

In this blog post I would like to share my own experiences in the development of the saxon-license webapp using Servlex and XSLTForms. In the discussion I include how we stitched on our existing back-end core Java tool and challenges faced with encoding. Specific details of the features and functions are not that important here, only the engineering process is of interest.

On the client-side we write XForms [1] documents, which are manipulated by XSLTForms [2] (created by Alain Couthures)  to render in the browsers. XSLTForms is an open source client-side implementation, not a plug-in or install, that works with all major browsers.

On the server-side we integrate the core Saxon-license tool in a Servlex webapp [3] as Saxon extension functions called from within XSL. The Servlex is an open-source implementation of the EXPath webapp framework [4] based on Saxon and Calabash as its XSLT, XQuery and XProc processors.  Servlex provides a way to write web applications directly in XSLT. It is developed as a Java EE application requiring Servlet technology, sitting on tomcat for binding to HTTP server-side.

Saxon License-tool functionality

The server-side Servlex works as a dispatching and routing mechanism to components (implementation as XSLT stylesheets), applying a configuration-based mapping between the request URI and the component used to process that URI. The container communicates with the components by means of an XML representation of the HTTP request, and receives in turn XML data with HTML at the request body with XForms content and XSLTForms references to render the page.  The representation of the HTTP response is sent back to the client. There are buttons on the forms, which if pressed trigger the action HTTP PUT request; made through the client-side XSLTForms. These requests are handled by Servlex.

There are 7=5 main XSLT functions described below, which map the URIs to generate the various XForms to tunnel the instance data between the XForms. These functions all make calls to the core Saxon-license tool written in Java, made available as a Saxon extensions calls from the XSLT:

  1. fnRunMainForm: A request to serve the main form is made with the following URI pattern:

    License requests are usually made through the main saxonica website either for evaluation or paid order (See: and, respectively), these orders are receives as an email, which are then copied and pasted on the main form.  This data is sent in the form of a XForms instance in a web request, picked up by servlex.

  2. fnManualEntry: Manual Entry form for manual creation of the customer details to create a license. A request is made to servlex with the following URI pattern:

  3. fnFetchRecord: Existing licenses created we can retrieve and re-issue. A request is made to Servlex with the following URI pattern. We observe the parameter after the ? Is the license number to fetch:

  4. fnReport: This function generates an HTML page containing all license created or such the last 20.

  5. fnEditParseRequest: Manual Entry form: The manual form with the client data populated. The order request from the main form is parsed and returned as a Xforms instance data which is used to generate the form on the server. A request is made to Servlex with the following URI pattern:

Securing access to the saxon-license webapp is achieved through apache2 configuration.

Encoding problem

A long-standing problem we faced in this application was the handling of non-ASCII characters. We raised this issue with Alain and Florent the creators of XSLTForms and Servlex, respectively, to get to the bottom of this problem.

Basically, if the user enters data on a form, we're sending it back to the server in a url-encoded POST message, and it's emerging from Servlex in the form of XML presented as a string, and if there are non-ASCII characters then they are miscoded. In the form we set the submission method attribute to 'xml-urlencoded-post' to guarantee that the next page will fully replace the current one: XMLHttpRequest is not used in this case.

We were seeing the typical pattern that you get when the input characters are encoded as a UTF-8 byte sequence and the byte sequence is then decoded to characters by someone who believes it to be 8859-1. We were not able to work out where the incorrect decoding was happening. We originally circumvented the problem by reversing the error: we converted the string back to bytes treating each char as a byte, and then decoded the bytes as UTF-8.

A feature of XSLTForms is the profiler (enabled by pressing F1 or setting debug='yes' in the xsltforms-options process instruction). The profiler allows the inspection of the instance data. Another mechanism is to inspect the requests sent by the browser with the network profiler of a debugger.

We established that on the client side, there is an HTML Form Element that gets built, and just before the submit() method gets called on this object, the data appears to be OK. But when we look at the Tomcat log of the POST request, it's wrong. Somewhere between the form.submit() on the client and the logging of the message on the server, it's getting corrupted. We can't actually see where the encoding and decoding is happening between these two points.

To tackle this problem Florent provided a development version of Servlex, which added logging of the octets as they are read from the binary stream (the logger org.expath.servlex must be set to trace, which should be the default in that version).  In addition to logging the raw headers, as they are read by Tomcat.

With this new version of Servlex in place I inputted the following data on the main form. We observe the euro symbol at the end of my first name 'O'Neil' is a non-ASCII character which needs to be preserved:

First Name: O'Neil€

Last Name: Delpratt

Company: Saxonica

Country: United Kingdom

Email Address:


Agree to Terms: checked

After submitting this data to the URI pattern: .../app/license-tool/editParseRequest  we see below the the log data reported by tomcat. What is interesting is the line 'DEBUG [2013-03-04 18:06:34,281]: Request - header   : content-type / application/x-www-form-urlencoded'.  Also at this stage the input to the receiving form has been corrupted to 'O'Neilâ&#x82' which should be 'O'Neil€' :

DEBUG [2013-03-04 18:06:34,279]: Request - servlet  : parseRequest
DEBUG [2013-03-04 18:06:34,280]: Request - path     : /parseRequest
DEBUG [2013-03-04 18:06:34,280]: Request - method   : POST
DEBUG [2013-03-04 18:06:34,280]: Request - uri      : http://localhost:8080/app/license-tool/parseRequest
DEBUG [2013-03-04 18:06:34,280]: Request - authority: http://localhost:8080
DEBUG [2013-03-04 18:06:34,280]: Request - ctxt_root: /app/license-tool
DEBUG [2013-03-04 18:06:34,280]: Request - param    : postdata / <Document><Data>First Name: O'Neil€

Last Name: Delpratt

Company: Saxonica

Country: United Kingdom

Email Address:


Agree to Terms: checked </Data><Options><Confirmed>false</Confirmed><Create>false</Create><Send>false</Send><Generate>false</Generate><Existing/></Options></Document>
DEBUG [2013-03-04 18:06:34,281]: Request - header   : host / localhost:8080
DEBUG [2013-03-04 18:06:34,281]: Request - header   : user-agent / Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:18.0) Gecko/20100101 Firefox/18.0
DEBUG [2013-03-04 18:06:34,281]: Request - header   : accept / text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
DEBUG [2013-03-04 18:06:34,281]: Request - header   : accept-language / en-gb,en;q=0.5
DEBUG [2013-03-04 18:06:34,281]: Request - header   : accept-encoding / gzip, deflate
DEBUG [2013-03-04 18:06:34,281]: Request - header   : referer / http://localhost:8080/app/license-tool/main
DEBUG [2013-03-04 18:06:34,281]: Request - header   : connection / keep-alive
DEBUG [2013-03-04 18:06:34,281]: Request - header   : content-type / application/x-www-form-urlencoded
DEBUG [2013-03-04 18:06:34,281]: Request - header   : content-length / 482
DEBUG [2013-03-04 18:06:34,281]: Raw body content type: application/x-www-form-urlencoded
TRACE [2013-03-04 18:06:34,281]: TraceInputStream(org.apache.catalina.connector.CoyoteInputStream@771eeb)
TRACE [2013-03-04 18:06:34,282]: read([B@1a70476): -1

Florent made the following observations:

  1. The content-type is application/x-www-form-urlencoded, which should conform to, but seems not to: the XML seems to be passed as is, instead of been split into individual elements and their string values. But I am not an expert on XForms so I might be wrong.

  2. Still about application/x-www-form-urlencoded and the same section, it says that the non-ASCII characters are replaced based on the octets of their UTF-8 representation, so the encoding should not be used here.  This content-type does not carry any charset parameter anyway, if I am right.

  3. Again about application/x-www-form-urlencoded, it is actually handled by Java EE as parameters, instead of simply giving the raw POST entity content.  I am not sure exactly how it works WRT the encoding.

Alain provided the following example to test the assumptions made by Florent.

<html xmlns="" xmlns:xf="">
        <title>Encoding Test</title>
                <data xmlns=""/>
            <xf:submission id="s01" method="xml-urlencoded-post" replace="all" action="">
                <xf:message level="modeless" ev:event="xforms-submit-error">Submit error.</xf:message>
        <xf:input ref=".">

<html xmlns="">
        <title>HTTP XML POST Dump</title>
        <h1>HTTP XML POST Dump</h1>
        <h2>Raw Data :</h2>
        $body = file_get_contents("php://input");
        echo strlen($body);
        echo " bytes: <br/>";
        echo "<pre>$body</pre>";
        if(substr($body,0,9) == "postdata=") {
            $body = urldecode(substr($body,strpos($body,"=")+1));
        $xml = new DOMDocument();
        $xslt = new XSLTProcessor();
        $xsl = new DOMDocument();
        $indent = "<xsl:stylesheet xmlns:xsl=\"\" version=\"1.0\"><xsl:output method=\"xml\" indent=\"yes\" encoding=\"UTF-8\"/><xsl:template match=\"@*|node()\"><xsl:copy-of select=\".\"/></xsl:template></xsl:stylesheet>";
        $result = $xslt->transformToXml($xml);
        $result = substr($result, strpos($result,"?>")+3);
        echo "<h2>Indented XML :</h2><pre>".htmlspecialchars($result, ENT_QUOTES)."</pre>";

When submitting '€', I get this:

Raw Data :
41 bytes:
Indented XML :

and with Firebug, I can see following, which is correct:

saxon-license img3

Florent states:

What should be in the content of the HTTP request is %E2%82%AC to represent the Euro symbol as URL- encoded (because that represents the 3 octets of € in UTF-8).
Because of the "automatic" handling of that Content-Type by Java EE, I am afraid the only way to know for sure what is on the wire is to actually look into it (using a packet sniffer, like Wireshark for instance).

At this stage it was important to check what packets are being sent. The following is a snippet of the reports from Wireshark, with the data format correct at this point.

HTTP    1207    POST /app/license-tool/parseRequest HTTP/1.1  (application/x-www-form-urlencoded)

[truncated] postdata=%3CDocument+xmlns%3D%22%22%3E%3CData%3EFirst+Name%3A+O%27Neil%E2%82%AC%0D%0A%0D%0ALast+Name%3A+Delpratt%0D%0A%0D%0ACompany%3A+Saxonica%0D%0A%0D%0ACountry%3A+United+Kingdom%0D%0A%0D%0AEmail+Address%3A+oneil%40saxonica.c ...

Florent discovered using Alain's test case that it was actually Tomcat itself interpreting the %xx encoding as Latin-1!  More infos at:

In summary, the message is the decoding was done using 8859-1 not UTF-8 as one would expect.

To overcome the problem Florent created a new config property for Servlex, which is named org.expath.servlex.default.charset, the value of which can be set to "UTF-8" in Tomcat's conf/ If set, it's value is used as the charset for requests without an explicit charset in Content-Type.

Thanks to Florent, Alain and Mike the encoding problem has now been resolved. The lesson learnt in all, is that tracking down encoding problems can still be very hard work.

[1] XForms. W3C.
[2] XSLTForms. Alain Couthures.
[3] Servlex. Florent George. Gihub:  Google Project:
[4] EXPath Webapp.

XSLT in MicroXML? - Saxon diaries

| No Comments | No TrackBacks
Listening to Uche Ogbuji's talk at XML Prague this morning, it was great news to hear that namespaces are being thrown out of the window[1]. The complexity that namespaces add to the XML stack is vast, and the value they add is tiny. 

But it raises the question, can one represent an XSLT stylesheet as an XML document without using namespaces? Clearly there needs to be some way to distinguish XSLT instructions from literal result elements, and this seems to be one of the more justifiable uses of namespaces.

My immediate thought is, let the outermost element of an XSLT stylesheet be


where xxx is any prefix you care to choose; and then let all XSLT instructions in the stylesheet use the same prefix "xxx". For most people "xsl" would be a suitable prefix, so you would write


  <xsl.template match="/">
      <xsl.copy-of select="."/>


it takes a bit of getting used to typing "." instead of ":", but apart from that, it seems to work perfectly well. And you don't need xsl:namespace-alias any more to read or write XSLT stylesheets, which must be good: you simply use a different prefix for the output stylesheet.

Norm Walsh tweeted that XSLT was claiming ownership of all names ending in ".transform". But that's not the case. The name doesn't have global significance; it only has significance in the context of the input supplied to an XSLT processor. Names are local and contextual, that's the point. Arguably there is still more redundancy than we need - we know the document we are reading is a stylesheet, so why do we need a top-level element to confirm the fact? (That's not to say all redundancy is bad, just to make the point that it's redundant).

[1] Defenestration seems the thing to do when in Prague 

Further thoughts

It's not only the XSLT namespace and its use in XSLT instructions that needs to be addressed, of course.

Namespaces in the names of variables, templates, modes, keys, decimal formats and so on can safely be ignored. Hardly anyone uses them anyway.

Functions are a bit more awkward. XSLT has never seriously committed to the use of the "fn" namespace for system functions (it's always the default, so no prefix is ever needed), but it does require user-written functions to be in a namespace to distinguish them from system-supplied functions; and external function libraries also use namespaces in a similar way. I've always felt the right approach for resolving function names is a search list - it shouldn't be necessary to prefix each function name to identify its library, rather it should be possible to define a list of function libraries to be searched when looking for functions with a given local name. The MicroXML philosophy would suggest getting rid of URIs from function names entirely, but that's a big change; transitionally, a list of URIs to be searched plus the XSLT 3.0 EQName facility if you need to be more explicitly would probably meet the requirement.

Step back, however. What are we trying to achieve? If we want to eliminate namespaces entirely from the whole stack, then we need to think about a namespace-less XSD (shudder!) or a typeless XSLT (losing integers and dates would be a big step back). If our goals are less ambitious, then what are they? Being able to transform MicroXML documents doesn't require an XSLT change at all.

I think MicroXML is a first step along a road that is not well charted. While the goals are laudable, the route to a simpler cleaner world is not a straightforward one.