Another regex rewrite - Saxon diaries

| No Comments | No TrackBacks

In Saxon 9.5, a new regular expression engine was introduced. This was obviously a risky thing to do, but I'm not averse to such risks, and although there is often some short-term inconvenience to users, I believe such things usually bring long-term benefits. The risks relate to both functionality and performance. In both cases we were able to minimize the risks thanks to a good collection of test material, but we always knew that contact with real life problems might throw up things that the test suite didn't.


Why did we need the new engine? There were several factors. On Saxon-CE, the GWT technology didn't offer access to the JDK regex library, only to the Javascript regex engine, which has different syntax and semantics, and no support for non-BMP Unicode characters. We could have translated XPath regexes to Javascript regexes the way we were doing for the JDK, but the expansions to handle non-BMP characters are very cumbersome. (We could also have abandoned W3C regex conformance, but that's not my style.) For "server-side Saxon", I had been impatient with the existing design for some time. There are bugs I reported in the JDK regex engine 5 years ago which Sun/Oracle have shown no inclination to fix; the approach of translating one regex dialect to another is slow and clunky; and the JDK wasn't keeping pace with new Unicode versions. I felt it was time we brought the regex engine in-house.


I looked around for an open source regex engine suitable for forking, and chose the Apache Jakarta engine. Unlike some other options the code was clear and well commented, the performance seemed reasonable, and the license was compatible. I made all the changes needed to support the XPath regex syntax and semantics, and a few other changes including some optimizations, and it passed all the tests. In fact, it has proved very reliable; there have been no bug reports against its functionality.


But there have been two performance glitches.


  • One is the way it handles integer repetition bounds such as a{3,7}. The algorithm it uses for this is the same as Saxon uses in its schema processor, which is described in a paper by Thompson and Tobin [1]; it essentially translates such an expression to aaaa?a?a?a?, and then generates a finite state machine accordingly. This becomes problematic when the integer bounds start to become significantly large; the size of the finite state machine and the time taken to generate it both increase rapidly with size. (For the technically minded, I'm not sure how rapidly, but even if it's only linear, it's a significant problem.)

  • The second problem is the handling of backtracking. The Jakarta engine, like most regex engines, relies on backtracking to handle ambiguous expressions. Backtracking relies on maintaining a stack so you can go back to where you were. In Jakarta, this data was kept on the Java program stack; each checkpoint was represented by a recursive function call. In many cases each character position in the input string is a potential checkpoint, so the depth of recursion was essentially equal to the length of the input string. That puts a limit of around 1K to 2K on the size of the input, which is not really acceptable.


For the 9.5 release I mitigated this problem by introducing an optimization: in cases where repetition was unambiguous, the recursive calls were eliminated. This catches many cases like A*B where you know that you have to keep reading A's until you hit a B, and when you do hit a B you know you won't have to try again from somewhere in the middle of the sequence of A's. But unfortunately, ambiguous expressions are common (an example would be .*B), and we have to handle them.


So I did something that I very rarely do; I studied how the JDK regex engine avoided this problem, by following its activity through the debugger. I was surprised to see that the JDK engine doesn't actually build a finite state machine. Instead it uses a pure interpreter pattern: it builds an abstract syntax tree representing the grammar of the regex, and then evaluates the regex by direct interpretation of the constructs on this tree. The maximum recursive depth with this approach is represented by the depth of the abstract syntax tree (that is, it depends on the regex but not on the input string), and the data needed for backtracking is held on the heap rather than on the Java stack.


Over the last week I have been rewriting the back end of the (ex-)Jakarta regex engine to use a similar design. It's passing all the functionality tests; I still have some performance testing to do, but it's looking good. The front-end code, that is the regex parser, is largely unchanged, but it now generates an operation tree rather than an FSM. I retained many of the features of the existing design, such as the use of a variety of implementations of integer sets and predicates to support the different kind of character classes, but the operations have all been re-implemented to work interpretively. I departed from the JDK design by using an approach similar to Saxon's XPath interpreter, in that the implementation of each operation returns an iterator over all the strings matched by the relevant expression. The key method supported by each operation is


IntIterator iterateMatches(final REMatcher matcher, final int position)


which is passed an REMatcher (contain all necessary information about the match operation, including the input string and regex flags, plus support for captured groups and backreferences), and a start position for matching; it return an IntIterator representing a sequence of integers which are the possible positions where a match can end.


The class handling a repeat operation such as (a|aa)* implements this operation by repeated calls on the underlying choice operation. The first choice operation will return an iterator over the two possible end positions (1,2); the next time round there are four possible positions (1,2,2,3) (there is no attempt to eliminate duplicates), and so on.The method maintains a stack of IntIterator objects representing these successive calls; initially only the first hit from each iterator is used, but when backtracking is necessary, the stack is popped, and the iterator now at the top of the stack is called to deliver its next match.


Saxon continues to take advantage of the optimization that recognizes unambiguous repetitions. With patterns such as A*B, or A{5}A, there is no need for backtracking so the stack of iterators is not used, and only the first match from each iterator is used.


There is scope for further optimization here. Some regular expression engines such as PCRE are known to handle highly-ambiguous expressions much more efficiently than this crude backtracking approach, mainly by eliminating the duplication that arises when a backtrack attempt ends up doing the same matches at the same positions as a previous attempt. Such improvements lie in the future; for the moment, being as good as the Java engine in performance, and better in features such as Unicode support and buglessness, is enough to declare success.


The new code is currently operational on the Saxon 9.6 development branch. In due course it will be retrofitted to Saxon-CE, and probably also to a Saxon 9.5 maintenance release. Probably only 10% of the original Jakarta code remains, but that's probably enough that I can't get rid of the Apache license that goes with it, for those who care about these things.


[1] Using Finite-State Automata to Implement W3C XML Schema Content Model Validation and Restriction Checking. Henry Thompson and Richard Tobin. XML Europe 2003



At the XML Summer School 2013, Tony Graham presented a lightning talk about life after libxslt 1.0.  I was not present for this summer school, but it was clear from the feedback of the discussions I received that there is a major gap of XSLT 2.0 support in the large developer community of C/Perl/PHP/Python/Ruby world and associated tools that rely on libxslt.

It is a known problem, which has never, to my knowledge been addressed. At Saxonica, we wanted to try and plug this gap by porting the Saxon processor from Java to C/C++, which would enable us to communicate with the languages specified above. One of our goals, if possible was to interface with libxml and libxslt. Providing such a bridge or cross-compiled version of a full fledged Java application to C/C++ is always a daunting task. In this blog post I discuss the technical steps in our quest to achieve our goals and give some details of the experiences gained along the way. I will begin by detailing the various technologies that we tried, and how we have have ended up using a commercial Java native compiler after several failed attempts with tools that either did not work, cumbersome or were just too error prone.

LLVM

At the summer school there were discussions that the tool LLVM could do the job of compiling Java to native code. As claimed on the project website LLVM is a collection of modular and reusable compiler and toolchain technologies. The LLVM project seems very active with many projects using it to do various task, but I found it difficult to get anything working. In particular, I tried using the VMKit which relies on LLVM to compile some a simple 'Hello World' examples to machine code, but even that seemed cumbersome.

GCJ

Secondly, I looked at the GCJ technology. GCJ is a tool that I have used before, so I was confident that it would work. However, from my past experience using this tool is that it can be error prone and contains long standing bugs, which is a result of the project being dormant for several years, it seems unlikely that bugs will be fixed. The other worrying fact is that GCJ only supports up-to JDK 1.5. Nevertheless for lack of other options, I persevered with GCJ and I had much better success given that I managed to compile Saxon-HE to native machine code  and actually got it to execute my example stylesheets. I had some problems because of classes that were not present in the GCJ implementation of JDK 1.5, such as the packages java.math and javax.xml. Therefore, I had to include my own version of these packages.

The next step was to create a shared library of Saxon-HE, so that I could interface it with C/C++. This proved to be a real battle, which in the end I succeeded. I decided to use Compiled Native Interface (CNI), which presents a convenient way to write Java native methods using C++. The alternative was JNI (Java Native Interface), which may be viewed as more portable. Both interfaces though have similar principles: you need a Java/CNI-aware C++ compiler, any recent version of G++ is capable, and then you must include the header file for each Java class it uses. These header files, if not automatically generated, can be done using gcjh. I soon gave up on using GCJ: I stumbled upon a few known bugs and because if I was having major issues with the setup and prerequisites required then surely users would have the same problems.

Excelsior JET

The Excelsior JET tool is the final technology we looked at and thankfully it is what we have ended up using in the alpha release. JET is a commercial product that provides a Java native compiler for both Linux and Windows platforms. What is good about this software tool is that it provides an easy to use Graphical interface to build native executables and shared libraries from jar file(s). It also has the feature to package up the software into an installer ready to be deployed onto its intended host machine. This was great for us! 

There is a lot I could write about JET, but it would be a repeat of the plethora of information currently available on their website and forum. However, just to mention we started with their evaluation version which offers 90-days free usage of their software before purchasing the professional edition. Another point of interest is that Excelsior offer a free-of-charge license for use in conjunction with open-source software.

We know that there will be some sections of the open-source community that dislike the dependency upon using a commercial tool, but it is not that dissimilar from the early years of Java when the Sun compiler was freely available but not open-sourced.  

Implementation notes using JET

After creating the shared library, to interface it with C/C++ I used JNI. It is possible to use JET's own Java interface to external functions called xFunction, which is recommended if starting from scratch, but having used JNI with GCJ I continued with that approach. To get started there are a few examples of invoking a library with C/C++. In essence, you need to load the library and initialize the JET run-time before you can use it, see the code below (from the file xsltProcessor.cc):


/* Load dll. */
HANDLE loadDll(char* name)
{
  HANDLE hDll = LoadLibrary (name);

  if (!hDll) {
    printf ("Unable to load %s\n", name);
    exit(1);
  }

  printf ("%s loaded\n", name);
  return hDll;
}

extern "C" {jint (JNICALL * JNI_GetDefaultJavaVMInitArgs_func) (void *args);
            jint (JNICALL * JNI_CreateJavaVM_func) (JavaVM **pvm, void **penv, void *args);
}

/*Initialize JET run-time.*/
extern "C" void initJavaRT(HANDLE myDllHandle, JavaVM** pjvm, JNIEnv** penv)
  int result;
  JavaVMInitArgs args;

  JNI_GetDefaultJavaVMInitArgs_func =
  (jint (JNICALL *) (void *args))
  GetProcAddress (myDllHandle, "JNI_GetDefaultJavaVMInitArgs");
  JNI_CreateJavaVM_func =
  (jint (JNICALL *) (JavaVM **pvm, void **penv, void *args))
  GetProcAddress (myDllHandle, "JNI_CreateJavaVM");

  if(!JNI_GetDefaultJavaVMInitArgs_func) {
    printf ("%s doesn't contain public JNI_GetDefaultJavaVMInitArgs\n", dllname);
    exit (1);
  }

  if(!JNI_CreateJavaVM_func) {
    printf ("%s doesn't contain public JNI_CreateJavaVM\n", dllname);
    exit (1);
  }

  memset (&args, 0, sizeof(args));
  args.version = JNI_VERSION_1_2;
  result = JNI_GetDefaultJavaVMInitArgs_func(&args);
  if (result != JNI_OK) {
    printf ("JNI_GetDefaultJavaVMInitArgs() failed with result %d\n", result);
    exit(1);
  }

  /* NOTE: no JVM is actually created
  * this call to JNI_CreateJavaVM is intended for JET RT initialization
  */
  result = JNI_CreateJavaVM_func (pjvm, (void **)penv, &args);
  if (result != JNI_OK) {
    printf ("JNI_CreateJavaVM() failed with result %d\n", result);
    exit(1);
  }
  printf ("JET RT initialized\n");
  fflush (stdout);
}

XsltProcessor::XsltProcessor(bool license) {
  /* * First of all, load required component.
  * By the time of JET initialization, all components should be loaded.
  */
  myDllHandle = loadDll (dllname);

  /*
  * Initialize JET run-time.
  * The handle of loaded component is used to retrieve Invocation API.
  */
  initJavaRT (myDllHandle, &jvm, &env);
  
  /* Look for class.*/
  cppClass = lookForClass(env, "net/sf/saxon/option/cpp/XsltProcessorForCpp");
  versionClass = lookForClass(env, "net/sf/saxon/Version");

  cpp = createObject (env, cppClass, "(Z)V", license);
  jmethodID debugMID = env->GetStaticMethodID(cppClass, "setDebugMode", "(Z)V");
  if(debugMID){
    env->CallStaticVoidMethod(cppClass, debugMID, (jboolean)false);
  }
  ....
}
...

In the constructor method of XsltProcessor we see that once we have loaded the library and initialized the JET run-time we can now make calls to the environment, which has been created to get class definitions and create instance(s) of the class in the Java world. This is before we make method calls on the object.

PHP Extension in C/C++

After successfully getting XSLT transformations to work within C/C++, the next step was to try and develop a PHP extension, which would operate like libxslt. There is a lot of material on the web and books in regards to PHP extensions and I found the following guide very useful: http://devzone.zend.com/1435/wrapping-c-classes-in-a-php-extension/. I literally followed it step-by-step, adding a few steps of my own when I worked out what I was doing.

Testing

As a proof of concept I wrote a test harness in PHP which makes use of the PHP extension (see: xslt30TestSuite.php in the download library). This is a test driver designed to run the public W3C XSLT test suite at https://dvcs.w3.org/hg/xslt30-test/. The test driver in its current form requires Saxon-EE, which is not yet available in this alpha release; nevertheless, the program may serve as a useful example of how the API can be used. Note that it is written to use libXML to read the test catalog, but to use Saxon for running the tests and assessing the results.

Performance Testing

I now draw comparisons between running Saxon-HE (on Java) vs running Saxon-HE/C on C++ and on PHP on some preliminary tests. I also compare these times to libxslt (C/C++). An important aim is to get a good measure of the costs of crossing the Java/C++ boundary using JNI and also to see what the effect is with the PHP extension. 

I used Saxon-HE 9.5.1.3 as the baseline. The test machine was a Intel Core i5 processor 430M laptop with 4GB memory, 2.26Ghz CPU and 3MB L3 cache, running Ubuntu 13.10 Linux. Servers Apache2 and PHP version 5.5.3-1ubuntu2. The compiler was Sun/Oracle Java 1.6.0.43. 

The experiments were based on the XMark benchmark. I used query q8, which was converted into the stylesheet below. The choice of q8.xsl is because we should expect some performance bottle-necks across the implementations due to its equijoins in the query:

<result xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xsl:version="2.0">
<!-- Q8.  List the names of persons and the number of items they bought.
          (joins person, closed_auction) -->

  <xsl:for-each select="/site/people/person">
    <xsl:variable name="a" 
       select="/site/closed_auctions/closed_auction[buyer/@person = current()/@id]"/>
    <item person="{name}"><xsl:value-of select="count($a)"/></item>  
  </xsl:for-each>
  
</result>  

The running times of executing q8.xsl on the document xmark64.xml, which is a 64MB size document are as follows:

Saxon-HE (Java):    60.5 seconds

Saxon-HE/C (C++): 132 seconds 

Saxon-HE/C (PHP): 137 seconds

libxslt (C/C++):        213 seconds

* Update on the times reported for Saxon-HE/C as a result of optimizations in the JET compiler.
* Code used to get libxslt time taken from: http://xmlsoft.org/XSLT/tutorial/libxslttutorial.html

The times for Saxon-HE/C are without the cost of JET initialisation and loading the library, which accounted for only 4 seconds. So we observe that there is not a big overhead between C++ and the PHP extension. The biggest cost as expected is between Java and C++, where we see on the C++/PHP platform a slowdown of ~ x2.2. We also observe that Saxon-HE/C out performs libxslt on C/C++ by ~40% on q8.

See project page on Saxon/C.
 




Stripping the DOM - Saxon diaries

| No Comments | No TrackBacks
I discovered yesterday that Saxon-CE isn't applying xsl:strip-space directives to input documents.  An unfortunate bug: not that many users seem too bothered by it, but conformance is always important.

A reminder: xsl:strip-space and xsl:preserve-space are designed to remove "ignorable" whitespace from the application's view. Because XML  doesn't distinguish significant from insignificant whitespace (a bad design mistake), this can only be done under program control. The idea is to present the application (the stylesheet) with a view of the input tree in which the insignificant whitespace does not appear, and the xsl:strip-space and xsl:preserve-space directives allow the stylesheet author to say which whitespace is considered significant.

How to fix the bug? Server-side Saxon (what I am starting to call "Big Saxon") traditionally provides two mechanisms: either it wraps the supplied DOM in a wrapping layer that hides the offending whitespace text nodes, or it bulk-copies the supplied DOM to an internal tree structure in which the whitespace is not copied across. Both methods have disadvantages. The "wrapping" approach imposes a significant extra cost on navigation of the tree (the cost is higher than it might appear, because whitespace stripping also has to take account of xml:space attributes in the document, which means a lot of searching for attributes-of-ancestors). The bulk-copy approach imposes a significant start-up cost to the transformation, often larger than the cost of doing the real transformation, and the cost is particularly high when processing large input documents because it doubles the memory requirement.

I'm considering using a third approach: in-situ modification of the supplied DOM. For Big Saxon, I have tended to treat this option as unthinkable: you don't muck with the user's data in this way. The same DOM, after all, might be input to multiple stylesheets, perhaps concurrently (though DOM is not thread-safe so this is a bad idea....) and they might have different space-stripping directives. A Java application might pass the DOM to an XSLT stylesheet to do some processing, but might want to continue operating on the same DOM afterwards.

In Saxon-CE, however, these considerations hardly apply. In the vast majority of cases the DOM is built solely in order to act as XSLT input. Unlike Big Saxon, where one can assume that if users supply a DOM they are doing so by choice, with Saxon-CE there is no alternative: all input XML arrives this way.

In-situ clean-up of the input DOM has other attractions; it can be used to solve the other two big problems with DOM processing (both arising from a mismatch between the DOM model and the XDM model): adjacent text nodes, and namespaces. The same pre-scan that strips unwanted whitespace nodes can also concatenate adjacent text nodes (eliminating entity references and CDATA sections in the process); presenting a view of adjacent text nodes as a single node accounts for a great deal of complexity (and lines of code, and execution time) in navigating a DOM. Equally for namespaces. To determine the namespace part of the name of an element or attribute, Saxon can't simply call getNamespaceURI() on the underlying DOM node. That method is defined to have no effect if "the node was created using DOM level-1 interfaces". Saxon of course has no idea how the node was created, and DOM provides no way of finding out. So it does an optimistic call of getNamespaceURI() just in case, and if this returns null then it has no choice but to search the node's ancestors and attributes looking for namespace declarations - and since this is done every time you look at a node to see whether it satisfies an axis step, it's no wonder that Saxon can be ten times slower processing a DOM than when processing its native tree structure. We should be able to cut this cost dramatically by doing a pre-scan of the document and calling setNamespaceURI() on all element and attribute nodes.

The more I think about this, the more appealing the idea of doing the same thing on Big Saxon. The fact is, most people who supply a DOM as input to XSLT do so out of ignorance; they have no idea of the extra costs they are incurring, or of the fact that supplying SAX input would be ten times faster. It's important to deliver good performance not only to experts who know how to get the ultimate performance out of the system, but also to the average user who just copies JAXP code from somewhere on the web without any great thought or understanding. For that kind of user, in-situ modification of the supplied DOM might be the right answer. We can always continue to provide wrapping or bulk-copying as alternative configuration options, for people to invoke if they don't like their data being trampled upon.

Returning to Saxon-CE, the other complication is the HTML DOM. Running XPath against an HTML DOM is already sufficiently quirky (e.g. because of case-insensitive element names, namespace non-support, etc etc) that it's probably wisest to keep things as simple as possible on this front. I don't think we need to apply xsl:strip-space directives to the HTML document; there's nothing in the spec that would make this a conformance issue, and there's no usability imperative. So we'll leave well alone here.

Reducing the size of Saxon-CE - Saxon diaries

| No Comments | No TrackBacks
During the early stages of Saxon-CE development, the major part of the work was cutting Saxon down to size to run on the browser. That initial work was described in an earlier post: see Cutting Saxon down to size: http://dev.saxonica.com/blog/mike/2010/11/#000184

In the last couple of weeks I've resumed the effort to reduce the size of the code. This article examines the changes I've made, and the opportunities for further reduction.

Saxon-CE 1.1 is around 880Kb, and so far, I've reduced that to 780Kb. I'm measuring the size of the main Javascript download here, not the ancillary data files that may or may not be downloaded, depending on what the stylesheet actually does. Reducing the size is important because the download time causes an appreciable delay on the first visit to a site, and first impressions are important; cutting the delay from say 7 seconds to 5 seconds will result in a corresponding reduction in the number of visitors who don't bother to wait.

What has gone?

About half this reduction was achieved by rewriting two data-intensive classes to load their data from XML files on demand, rather that defining it in static initialization code. The two classes in question are Categories, which contains the mapping of Unicode characters to categories used in regular expressions such as Lu and Nd; and CaseVariants, used when the "i" flag appears in a regular expression to determine all the upper/lower case variants of any character. These XML files are downloaded only if the relevant language features are used in the stylesheet.

A couple of other classes were also performing lengthy static initializing: StandardFunctions, which contains details of all the system functions in the core library, and StandardNames, which contains the names of all XSLT elements, attributes, and built-in types. In the case of StandardFunctions the amount of initialization code has been vastly reduced by introducing a micro-syntax to describe function signatures in the form of a string literal, which can be easily parsed in a few lines of code. In the case of StandardNames, the class was setting up a mapping from integer constants to QNames; this has been removed by changing the product to use QNames throughout (or in the case of types, the object instances representing each built-in type) rather than integer constants.

Associated with this second change, the NamePool has gone. Saxon-CE was not getting any real benefit from the NamePool, because the source document is always in the form of a DOM, in which names are held as strings. So integer fingerprints were being allocated on demand, which cancels any saving from faster integer comparisons. The change to use QName objects in place of namecodes and fingerprints throughout the product was of course very extensive, but it seems to have been trouble-free. There is only one very minor loss of functionality: in cases where Saxon has to invent a prefix for use on a constructed element or attribute, it is no longer able to consult the NamePool to see what prefixes have previously been used with that URI.

The other changes are minor and tedious. They fall into a number of categories:

* Finding code that could never be executed. Although GWT automatically eliminates code that is completely unreachable, there is code that cannot be executed in practice for details that GWT cannot detect. Two examples: (a) code for matching types such as attribute(*, xs:date). Although a basic XSLT processor allows this syntax, all nodes in Saxon-CE are untyped, so this type will never match anything. It can therefore be replaced during parsing by a type that matches no instances, which cuts out a lot (well a couple of Kb) of useless code. (b) there was still a lot of code designed to report the line number on which errors occurred, both statically and dynamically. But in the browser environment, the XML parser provides no line number information, so this code was useless.

* Eliminating duplication. For example, I found that there were two completely separate mechanisms for maintaining comparison keys, one mechanism used for xsl:for-each-group and distinct-values, the other used for sorting and various other operations. Although there were a few semantics differences (for example in the handling of NaN) it proved reasonably easy to combine these two mechanisms into one.

* Cutting out minor optimizations. Do we really need a separate class to handle hash maps in which the key is an unboxed integer, rather than using the standard HashMap with a boxed integer key? Is that optimization worth 1000 lines of code? Probably not. There are many similar examples.

* Cutting out small classes. Arithmetic expressions, for example, delegated the computation to one of a large number of small Calculator classes, each dedicated to one particular operation, such as (integer plus integer). This is an elegant design, but in GWT it appears to be expensive: each of these classes was generating up to 2Kb of code. Combining the classes into a set of switch expressions in the parent class appears to give a significant reduction in code size (though I have to admit I am not entirely sure why).

In deciding whether to remove a particular piece of code, I find it useful to reverse the thinking: if this wasn't there already, would I consider it worthwhile adding it? This is analogous to zero-based budgeting: there is a presumption that something isn't needed unless it can be justified, rather than a presumption in favour of the status quo.

Where will the next 100Kb come from?

Getting down from 880Kb to 780Kb is useful, but it isn't enough. The target should be something like 500Kb. Finding big savings, however, gets more and more difficult. There are plently of small salami-slices available, but they get harder to find.

Some of the original areas in my earlier post remain unaddressed. There is still a lot of duplication between the "style" classes representing the source stylesheet, and the "instruct" classes on the expression tree. Historically, it was important to allow the source stylesheet tree to be garbage collected once compilation is finished; in Saxon-CE, that reasoning doesn't apply. Probably we should keep the instruction classes on the expression tree, but have them delegate a lot more to the original "style" elements to avoid duplication.

The expression tree is a major opportunity for simplification. There are two issues: there are too many different kinds of expression (too many subclasses), and they implement too many methods. Specifically, there are 167 subclasses of Expression, and the class defines some 48 methods. Some of these classes are admittedly very small (FirstItemExpression, for example, which implements constructs of the form EXP[1], is just 157 bytes -- I haven't worked out why these classes are so much smaller than the Calculator classes mentioned above). So compiling EXP[1] to a call on subsequence(EXP,1,1) would probably give no net saving. But despite much use of superclasses to hold shared functionality, there's still a fair bit of duplication that could be eliminated, and a large number of minor optimizations that probably have little merit.

Could we drop the LinkedTree implementation of XDM? The Tinytree is already gone from Saxon-CE. The LinkedTree is only used (a) for stylesheets, and (b) for temporary trees: source documents and result documents use the Javascript DOM. (Not the GWT DOM). We could consider using the DOM also for stylesheets and for temporary trees. But it's only 20Kb. A more interesting way forward might be some kind of hybrid of the current LinkedTree and HTML-DOM-wrapped-tree; given the way Javascript works, we could consider adding properties to the DOM to make it implement the NodeInfo interface more directly, supplementing some of its deficiencies for example (notoriously) its poor support for namespaces.

There is a lot of code whose main purpose is lazy (pipelined) evaluation. This leads to a proliferation of small specialized classes. Some of these are probably an unnecessary luxury. For the index-of() function, for example, do we really need a custom iterator that searches for the next match and returns it on demand, or would it do any harm to find all the matches up-front? In some cases several iterator implementations could be combined into one: for example the index-of() iterator could be handled by a flag in the filter iterator that returns the position of the matched item instead of the item itself.

Generally, I think that finding another 100Kb of discardable code looks feasible, though it's a lot of work.

Note: Saxon-CE today is 64800 non-comment lines of Java code. Saxon-EE for comparison is 261100 non-comment lines.

Update two weeks later:

This kind of exercise can become a bit obsessive. I'm now down just below 700Kb, and at the Java level the number of source lines is down to about 55000. It's amazing how much dead wood there is, and it's a great pleasure to get rid of it. By and large, most of the reductions have been achieved without adverse effects that anyone will notice, and in some cases it's been a real opportunity to tidy up the code and improve it. I'm very inclined to port some of the changes back to "big Saxon" (that is, the server-side code base).

The changes made fall into a number of general categories:

  • Avoiding duplication. For example, the code to do compile-time static type checking of function arguments (to implement the function conversion rules) was extremely similar to the code for dynamic conversion of supplied stylesheet parameters to their required type. A little bit of thought enabled the two to be combined. There was extensive duplication in the code for iterating different axes for different tree models. The code for xsl:next-match was about 95% the same as the code for xsl:apply-imports. There was perhaps 70% code overlap in the iterators for the union, except, and difference operators. There were half-a-dozen places in the product containing almost identical code for converting a lexical QName to an expanded QName given a namespace context. In some cases the duplication could be justified for "big Saxon", because of the imperative to minimize run-time pathlengths. But that's the wrong trade-off for Saxon-CE; the size of the code really does matter.
  • Eliminating trivial optimizations. I already mentioned the "integer hash-map" example. There are plenty of examples of static rewrite optimizations that really don't justify themselves: rewriting string-length($x) = 0 as not(string($x)) is an example. You have to look at these and ask (a) whether the saving achieved is significant, and (b) whether the rule will fire often enough to be worthwhile. It's easy to make mistakes on both counts - nearly all these optimizations were added at some time because of a real use case where they made a significant difference. But over time they accumulate and gather dust; sometimes they cease to be relevant because something else has changed elsewhere in the system.
  • Avoiding over-generalized code. GWT does a good job of identifying methods that are never called. But it can't spot methods that are capable of doing 17 things, and are only ever called upon to do one of those 17 things.
  • Rewriting the code to be more compact. A couple of examples here: the code for checking the select attribute of xsl:value-of now reads "checkAttribute("select", "e")", where "select" is the name of the attribute, and "e" indicates that its value must be an expression. The previous code was about four times this long, and the extra stuff was essentially the same for every attribute of every element. Similarly, the code for registering details of the signatures of built-in functions, and the code for enumerating the subexpressions of every kind of expression: these fragments of code occur very often, and cutting them down from 4 lines to 1 can make a big difference overall.
  • Simplifying error handling. Good diagnostics for errors are still important, but there was a lot of verbosity in this area that could go. If a date is invalid, how much detail do you need to give the user about the fact? Might it not be good enough to tell them that 2013-03-59 is an invalid date?

What's new in Saxon 9.5? - Saxon diaries

| No Comments | No TrackBacks

The following is a provisional list of changes expected in the Saxon 9.5 release, which is in the final testing phase. There could still be changes to this list, for example a feature could be dropped if it is found to be unreliable during testing.

Optimization

The xsl:result-document instruction in Saxon-EE is now asynchronous. That is, the code to output the result document runs in a separate thread, in parallel with other processing. The maximum number of threads used by xsl:result-document instructions is limited by the configuration option FeatureKeys.RESULT_DOCUMENT_THREADS which defaults to the number of processors available to the Java VM; setting this to zero or one will suppress multithreading. Setting FeatureKeys.ALLOW_MULTITHREADING to false has the same effect. (This can be useful when debugging, because otherwise the output from xsl:message and fn:trace() can be very confusing).

The facility can also potentially cause problems if the code calls extension functions that have side-effects. Multi-threading can therefore be controlled, if required, using thesaxon:asynchronous attribute on the xsl:result-document instruction: use saxon:asynchronous="no" to suppress multi-threading. Asynchronous processing of xsl:result-documentis automatically suppressed if tracing (using a TraceListener) is enabled.

The collection() function is also now multi-threaded in Saxon-EE. Each document in the collection is parsed in a separate thread, and the documents are processed in the order in which parsing completes. This makes the order of the documents less predictable than in previous releases, though it was never guaranteed or documented.

Extensions

New XSLT 3.0 instructions such as xsl:iterate and xsl:try no longer have synonyms in the Saxon namespace.

Extension function saxon:for-each-group() is dropped (superseded by "group by" in XQuery).

New extension functions saxon:schema() and saxon:type are available, giving access to schema information. The saxon:schema() function obtains information from all the schema components available in the query or stylesheet; the saxon:type function gives information about the type annotation of a node. In both cases, the information is returned as a function item which behaves like a map from schema component property names to values; for example the name of the type annotation of a node is given by saxon:type($node)('name').

A new extension function saxon:send-mail() is available to send email via an SMTP server.

A new extension function saxon:key-map() is available: this allows the index constructed using xsl:key to be viewed as a map, giving extra functionality such as getting all the entries in a given range, enumerating the key values, concatenating keys across multiple documents, processing the entries in sorted order, etc.

The third argument of saxon:transform(), which supplies parameters to an XSLT transformation, may now take the form of a map, allowing the parameter value to be any value whatsoever (the old mechanism restricted it to atomic values.)

The extension function saxon:index has changed to expect a function as its second argument rather than a prepared expression, and it now returns a map which can be accessed using all the standard map functions. The extension function saxon:find is now a synonym for map:get. There is no longer an option to specify a collation.

A new flag "v" has been added to saxon:deep-equal() to suppress the check that two elements have the same "variety" of type: for example if one has element-only content, the other must have element-only content. This check was not performed in previous releases; in this release it is performed by default (as required by the fn:deep-equal() specification), but may be suppressed using this option. The option is useful when comparing validated and unvalidated documents.

The proposed EXPath file module (see http://www.expath.org/modules/file/) is implemented (in Saxon-PE and -EE). This provides a number of extension functions for reading and writing files in filestore, creating and deleting files and directories, listing the contents of directories, and so on.

The EXPath zip module (see http://expath.org/spec/zip) is implemented. The implementation is derived from the original implementation by Florent Georges, but has been more closely integrated into Saxon and more thoroughly tested. This module is open source code; the extensions are integrated into the Saxon-PE and Saxon-EE distribution, and are available to Saxon-HE users in source code form, where they can be linked to the product in the same way as any other extension functions (see .

A new serialization option saxon:attribute-order is available. The value is a whitespace-separated list of attribute names. Attributes whose names are present in the list come first, in the order specified; other attributes follow, sorted first by namespace URI and then by local name.

When the -wrap option is used in XQuery, the serializer is now capable of generating an XML representation of a map, using nested elements to represent the entries in the map with their keys and values.

External Object Models

Support for JDOM2 is added.

Support for Apache Axiom is added.

A number of optimizations have been made to the support modules for other external object models, noticeably to speed up navigation of the descendant axis.

The class DocumentBuilderFactoryImpl, which constructed a DOM wrapper around a TinyTree, and which has been deprecated since 9.3, is now removed.

XSD Extensions

Parameterized validation: a new extension to XSD 1.1 is implemented to allow schema validation to be parameterized. The saxon:param element can be added to a schema to declare a parameter; the value of the parameter can be referenced in XPath expressions, for example in assertions. The parameter values can be set from the command line when running the Validatecommand, or from the s9api (Java) and Saxon.Api (.NET) interfaces when validating from an application. It is also possible to initiate parameterized validation using a new saxon:validateextension function available in XSLT and XQuery.

Internal Changes

Regular expressions: a new regular expression engine is introduced, based on the Apache Jakarta product.

The implementation of the base64 encoding and decoding routines has been rewritten in order to simplify the set of open source licenses in use

XPath 3.0 Functions and Operators

fn:serialize() function: the implementation has been changed to match the 2011 version of the W3C specification. This changes the format of the serialization parameters supplied in the second argument to the function.

The functions has-children()innermost(), and outermost() are implemented.

The function parse-xml-fragment() is implemented.

In format-date() and friends, timezone formatting has been updated to match the current spec. This may create incompatible changes. The changes apply whether or not XPath 3.0 processing is enabled.

The implementation of deep-equal() has been changed to enforce the rule that the type annotations of two element nodes must have the same variety. This corrects a long-standing non-conformance, but may cause incompatibility, especially when comparing a schema-validated document against an unvalidated document. A new flag "v" has been added to the Saxon equivalent function saxon:deep-equal() to suppress this check.

Command line

Command-line interfaces: added the option -quit:off to prevent exiting the Java VM in the event of a failure; instead, a RunTimeException is thrown. Useful when the command line interfaces are invoked from another Java application, for example ant, as it allows the calling application to recover.

When the -wrap option is used in XQuery, the serializer is now capable of generating an XML representation of a map, using nested elements to represent the entries in the map with their keys and values.

XSD 1.0 Support

There have been changes to the way the schema processor handles recovery from validation errors. By default the processor tries to continue validating to the end of the file, reporting as many validation errors as it can. All errors are reported to the error() method of the ErrorListener, which by default throws no exception; if the error() method does choose to throw an exception then validation terminates immediately. Validation can also be terminated early if an error limit is reached; the error limit can be set in the ParseOptions object that controls the validation. (The value 1 causes termination after the first error is reported).

At the end of the document, if there have been one or more validation errors, then a fatal error is thrown unless the "recover after validation errors" option (FeatureKeys.VALIDATION_WARNINGS) is set, in which case no exception occurs and processing can continue. Note that in this case the type annotations in any result document are unreliable. The setting FeatureKeys.VALIDATION_WARNINGS is ignored in the case where a document used as input to XQuery or XSLT is being validated, because in that case the incorrect type annotations would cause inconsistencies during subsequent processing. TODO: check this works.

If an element is found that cannot legitimately appear according to the content model of its parent, Saxon previously abandoned the validation of the content of that element, as well as the content of its following siblings. This has been changed so that the content of the offending element, and the content of its siblings, is now validated, using the context-determined type for the element (the "element declarations consistent" constraint ensures that if the same element name appears more than once in a content model, each must be associated with the same type).

The validation exception made available to the ErrorListener now includes a structured path indicating location of the offending node (previously the path was available only as a string). It also includes, where available, a reference to the schema type against which validation was attempted.

The error messages produced when a sequence of elements does not conform to the content model of a complex type has been improved. There is now more effort to distinguish different causes of the error: for example, too many repetitions of a repeated element, a mandatory element that has been omitted, an element that is in the wrong namespace.

Saxon now recognizes the XLink namespace and fetches the schema for this namespace locally rather than fetching it from the W3C web site (which will often time out).

XPath 3.0 Support

Revised the EQName syntax as per spec bug 15399: it is now Q{uri}local.

Affecting XPath 2.0 also, some very long-standing bugs have been fixed in the handling of schema-validated documents using xsi:nil - specifically, ensuring that the typed value of a nilled element is always an empty sequence, getting the matching of types element(N, T) and element(N, T?) right, as well as the impact of nillability on type subsumption.

XSLT 3.0 Support

The bind-group and bind-grouping-key variables on the xsl:for-each-group element are now implemented.

This involved some internal refactoring of the way variables are managed during the XSLT static analysis phase.

Implemented the composite attribute of the xsl:for-each-group element.

Changed the implementation of xsl:merge to reflect the revisions in the draft XSLT 3.0 specification (removed xsl:merge-input element; added sort-before-merge attribute).

Implemented accumulators (the new xsl:accumulator feature) for both streamed and unstreamed documents.

Implemented the xsl:stream instruction.

Implemented the error-code attribute of xsl:message.

Implemented the start-at attribute of xsl:number.

Implemented the context-item attribute of xsl:evaluate.

Implemented the xsl:assert instruction.

Implemented the xsl:map and xsl:map-entry instructions.

The restriction that xsl:import must precede other declarations in a stylesheet module has been removed.

Implemented the on-empty attribute of xsl:attribute.

Implemented the xsl:on-empty attribute of literal result elements. Tested with and without byte code generation. TODO: test with streaming, and with freestanding construction (evaluateItem).

Saxon now allows the EQName syntax Q{uri}local in places where a QName is required, for example the name attribute of xsl:variablexsl:templatexsl:function etc, the first argument of the functions key()system-property(), and function-available(), the final argument of format-number(), etc. This is useful for static names in the case where stylesheets are generated programmatically, and it is always useful for dynamic names because it avoids the dependency on the static namespace context.

It is now a static error if the same NodeTest appears in an xsl:strip-space and an xsl:preserve-space declaration at the same import precedence. This is a backwards incompatibility in the 3.0 specification.

There is now a warning message if the namespace URI of the document element of the principal input document does not match the namespace URIs used in the template rules of the stylesheet. This is designed to catch the common beginner's mistake of writing (for example) match="html" when the element to be matched is actually in a (default) namespace. The precise conditions for the warning are that the stylesheet contains one or more template rules in the initial mode that match an explicit element name, and none of these template rules matches an element in the namespace used for the top level element of the principal source document.

Implemented more of the new pattern syntax: patterns matching variables, namespace nodes, ... Patterns that match atomic values can no longer be used as part of a pattern that uses "union", "intersect", or "except" (as a result of clarification of the XSLT 3.0 specification.)

Affecting XSLT 2.0 also, a very long-standing bug has been fixed: documents read using the collection() function are now subjected to whitespace stripping as defined by xsl:strip-space declarations in the stylesheet.

Also affecting XSLT 2.0, a change has been made in the behavior of xsl:result-document when the href attribute is absent. Previously this caused the constructed result document to be serialized to the location specified by the base output URI. Now it causes the constructed document to be sent to the primary output destination selected in the calling API (which might, for example, be a DOMResult or an XdmDestination). Both interpretations appear to be allowed by the specification. Note that omitting the href attribute is done typically when you want to validate the result document against a schema, though the same effect can be achieved using xsl:document.

Whitespace stripping: the xsl:strip-space declaration now has no effect on a node if it is within the scope of an XSD 1.1 assertion: that is, whitespace text nodes are not stripped if any ancestor node has been validated against a type that contains an assertion. This is because changing the content of such a node could invalidate the assertion, thus breaking type safety.

Implemented content value templates. These allow expressions contained in curly braces to be contained in text nodes within a sequence constructor, rather like attribute value templates; the facility can be enabled by setting expand-text="yes" on any enclosing XSLT element (for example, <xsl:stylesheet>). Example: {$greeting}, {$first} {$last}, how are things?]]>. Note that the details of the facility in the specification (for example, handling of boundary whitespace) are subject to change.

The new match pattern syntax match="?{ expr }" is implemented. This matches an item T if the expression has an effective boolean value of true when evaluated with T as the context item. This construct replaces the ~ itemType syntax defined in earlier XSLT 3.0 drafts, but for the moment Saxon implements both.

Some of the features NOT implemented in XSLT 3.0 include:

  • Packages

  • xsl:context-item

Streaming

There has been considerable development of the streaming capability, much of it involving removal of restrictions that were not previously documented (or known).

The xsl:stream instruction is now implemented, making the saxon:stream extension function obsolescent (though it is retained for the time being). Accumulators are also implemented, in both streaming and non-streaming mode.

Streaming: Saxon is moving towards a design where it implements the "guaranteed streamability" rules in the W3C draft precisely, unless the configuration optionALLOW_STREAMABILITY_EXTENSIONS is set, in which case Saxon may implement streaming (or near-streaming) for some additional cases.

Certain constructs using positional filters can now be evaluated in streaming mode. The filter must be on a node test that uses the child axis and selects element nodes. The forms accepted are expressions that can be expressed as x[position() op N] where N is an expression that is independent of the focus and is statically known to evaluate to a number, x is a node test using the child axis, and op is one of the operators eq, le, lt, gt, or ge. Alternative forms of this construct such as x[N], remove(x, 1), head(x), tail(x), and subsequence(x, 1, N) are also accepted.

Streaming is now possible for xsl:for-each-group using the group-adjacentgroup-starting-with, or group-ending-with attributes.

XQuery 3.0

Saxon 9.5 fully implements the XQuery 3.0 Candidate Recommendation of January 2013.

Forwards references to global variables are now allowed.

Added support for variables such as err:code and err:description within the catch block of try/catch.

Added support for function annotations.

Added support for the required-feature and prohibited-feature declarations.

XQuery 3.0 extensions can now co-exist with XQuery Update extensions in the same query. 

.NET API

The XPathSelector interface now has an EffectiveBooleanValue method.

Various methods have been added to bring the interface closer to the functionality level of the Java s9api interface.

System Programming Interfaces

In the Configuration class, many of the getters and setters for individual configuration properties have been removed, relying instead on the general-purpose methodsgetConfigurationProperty() and setConfigurationProperty(). To make these easier to use in the common case of boolean properties, the new methods getBooleanProperty() andsetBooleanProperty() are introduced. Property names that are only relevant to Saxon-PE or Saxon-EE are now generally recognized only by a ProfessionalConfiguration orEnterpriseConfiguration respectively; previously some were also recognized by a Saxon-HE configuration while others were not.

The classes ValueRepresentation and Value have been replaced by the new class Sequence. In some interfaces Sequence also replaces SequenceIterator, since passing aLazySequence (which implements Sequence) has all the same performance benefits as passing a SequenceIterator, while still allowing an Item to be passed directly, without wrapping.

The change affects (simplifies) the interface for integrated extension functions, where arguments and result values are now passed as Sequence objects rather than as SequenceIteratorinstances.

In the NodeInfo interface, the two methods getTypedValue() and atomize() for obtaining the typed value of a node have been unified into a single method, atomize(), which returns the new type AtomicSequence. In the vast majority of cases the typed value is a single atomic value, and this case works efficiently because AtomicValue implements AtomicSequence.

Also in the NodeInfo interface, the method getAttributeValue(fingerprint) has been dropped; callers should use getAttributeValue(uri, local) instead.

The OutputURIResolver interface has been changed: a new method newInstance() is added. This change is made because xsl:result-document is now multi-threaded by default, and since it's likely that existing implementations of OutputURIResolver won't be thread-safe, making an interface change is better than a semantic change that will cause the code to break in difficult-to-diagnose ways. The new method typically returns a new instance of the OutputURIResolver class, so that each instance of the class only needs to remember information about one result document, and is automatically thread-safe.

In the Java classes that implement the schema component model (in package com.saxonica.schema) many of the methods that return an Iterator have been replaced by methods that return a Set or a List. This simplifies processing by enabling use of the Java for-each construct, and is more convenient when the saxon:schema extension function is used to process schema components in the form of XPath functions.

The toString() method on the Expression class has been more carefully defined and implemented in an attempt to achieve the result that (for an Expression within the scope of XPath) the result is a legal and equivalent XPath 3.0 expression with no dependencies on namespace prefixes other than (a) the binding of the prefix "xs" to the standard Schema namespace, and (b) the assumption that the XPath functions namespace is the default function namespace. In other cases QNames are expanded using EQName notation Q{uri}local. There may be a few remaining cases where the output does not yet satisfy these intentions.

Extensibility

Saxon has long provided the ability to have an Item that wraps an external Java or .NET object, which can be supplied in arguments to extension function calls or used in the response from an extension function calls. In the past, such values have appeared in the type hierarchy under "atomic value", that is, as a peer of types such as xs:boolean and xs:string. This has changed so they are no long atomic values, instead forming a fourth kind of item alongside nodes, atomic values, and function items.

The string value and typed value of an external object are the same; they are the xs:string value that results from calling its toString() method.

The handling of extension items that wrap a Java null has been tidied up. When a Java extension function returns null, this is mapped to an XDM empty sequence, not to an extension item that wraps null. When an empty sequence is supplied to a Java extension function that expects an Object, the value null will be passed to the Java method. Extension items are therefore no longer allowed to wrap a Java null.

With reflexive extension functions, the handling of Java arrays has been improved; the component type of the array is now considered when deciding which of several overloaded methods is the best fit.

With reflexive extension functions, if there are two methods where one expects a String and the other a CharSequence, the one expecting a CharSequence is now preferred. Previously the function call was rejected as ambiguous. Although there is no particular reason for choosing one rather than the other, it's likely that both methods will have the same effect, so choosing one arbitrarily is better than reporting an error.

With reflexive extension functions, if there are two methods of the right name and arity, the decision which to use is now postponed until after type checking has been done. This is particularly useful when arguments are supplied in the form of variable references. Previously the decision was postponed only if the early analysis showed both methods as equally preferred; now it is always postponed.

When a constructor is called the code now does what the documentation has always said: a wrapped external object is returned with no conversion. So for example Date:new() will return a wrapped java.util.Date object rather than an instance of xs:dateTime, and HashMap:new() will return a wrapped HashMap rather than an XPath map. The code also avoids doing conversions other than object unwrapping for the first argument of a call to an instance method.

When a reflexive extension function throws an exception, the exception details are now captured in an error that can be caught using the try/catch capabilities in XSLT 3.0 and XQuery 3.0. In particular, the error code is a QName whose namespace URI is "http://saxon.sf.net/java-type and whose local name is the class name of the exception type, for examplejava.net.URISyntaxException.

It is now possible to use the reflexive call mechanism to call instance-level methods of the implementation classes for XDM values. For example, the following gets the current Julian instant:

<xsl:value-of select="d:toJulianInstant(current-dateTime())"
         xmlns:d="java:net.sf.saxon.value.DateTimeValue"/>

When an instance of java.util.Map is passed as a parameter to a stylesheet or query, then by default it is accessible within the stylesheet or query as an XPath 3.0 map object. However, if the parameter is declared with a required type of jt:java.util.Map, then it will instead be retained as an external object wrapping the java.util.Map, which means for example that (with the usual caveats and disclaimers) it is updateable by calling its put() method as an extension function.

The SQL Extension

Added the value #auto for the column-tag and row-tag attributes of sql:query, and made this the default; when this value is used, the element names used for rows and cells are taken from the table name used in the query, and the column names returned in the query metadata. To get the old default behaviour, specify row-tag="row" and column-tag="col".

Serialization

Support for XHTML 5 serialization, as defined in the 3.0 Serialization specification, is available. (Use <xsl:output method="xhtml" html-version="5.0"/>).

Comments and processing instructions no longer prevent an immediately following start tag being indented if it would have been indented in the absence of the comment or PI.

(However, comments and PIs are not themselves indented, because it is not possible to do this safely without looking ahead to see if the comment or PI is followed by significant character content.)

I've been working on a stylesheet that converts the W3C XQuery/XPath 3.0 test suite into an XSLT 3.0 test suite. The idea is that all the XPath tests only need to be written once, allowing them to form XSLT tests and thus avoiding the duplication of effort that arises if separate test suites are written.

The stylesheet generates 19,700 output files (38Mb) from 1,863 input files (122Mb) so it's a good test in its own right. I'm therefore running it using the almost-ready Saxon 9.5 as part of the system testing before we ship the product. Here are a few points I'd like to highlight.

Firstly, some files are copied unchanged. For this purpose the extension function file:copy(), which is part of the EXPath file module, comes in very handy. This llibrary will be included as standard with Saxon 9.5 (like all extensions, it will be in the PE and EE editions only).

Secondly, I followed a development approach which I've found very useful and would like to share. There's a schema for the target document vocabulary (the XSLT 3 test suite metadata), so it makes sense to use a schema-aware stylesheet that validates the result documents as they are produced. However, I don't switch on validation until the transformation is "roughly working", that is, until it produces output that looks correct to the naked eye. The reason for that is simply that the validation error messages are a nuisance during the stage when you know the transformation is incomplete and incorrect. It's only when the transformation is running without errors and handling all the input that you need to start checking that the output is correct, and at this stage adding an xsl:import-schema declaration for the output schema, plus the attribute validation='strict" on the xsl:result-document instruction, is absolutely invaluable.

Except, I found it could be improved. The validation error messages tell you what's wrong (for example, you've output an attribute value which isn't allowed for that attribute), and they tell you where the code is in the stylesheet that caused the error (the offending xsl:attribute or xsl:copy instruction), but they weren't telling you what input was being processed at the time: where did the offending attribute value come from? So I've fixed that, or at least tried to. It's one of those cases that sounds easy but is actually quite tricky, because the code that detects the validation error is very remote from any code than knows where we are in the source document (or indeed, which of 1,863 source documents we are processing) at the time. The way I've done this is that the PipelineConfiguration - which holds all the context information used by a push pipeline, including schema valdiation - now includes a stack of items representing items being processed using xsl:apply-templates; so the validation error message can tell you what the latest node processed by a template rule is. Doesn't work unfortunately for those who use monolithic xsl:for-each structures, or for those who use XQuery - but this is specifically focussed on this kind of workload handling many inputs and many outputs, for which XQuery is unsuitable anyway. There is of course a small performance hit for maintaining this stack, but I'm confident it's very small, and I always reckon that a small amount of machine time to save a large amount of programmer time is a good trade-off.

The third thing I wanted to mention in this article is asynchronous processing of xsl:result-document. This is a new feature for Saxon-EE 9.5. By default, an xsl:result-document instruction executes in its own thread, independently of the rest of the stylesheet. Threads are allocated from a pool, whose size defaults to the number of processors available (on my Mac laptop, that's 8). If no free threads are available, the instruction executes in the current thread.

I did some experiments on the performance as the number of available threads is increased:

1 - 14.3s
4 - 8.52s
8 - 8.92s
12 - 9.81s
16 - 10.6s

so the default of 8 appears to be a reasonable choice. The saving in elapsed time by using multi-threading seems to be well worth having, though frankly, for a task like this, 14s is already fantastically good performance and any improvement is simply icing on the cake. It's becoming quite hard to find workloads where performance improvements are genuinely needed.

Incidentally, switching bytecode generation off has almost no noticeable effect on this transformation, and the same is true when optimization is switched off. That's because it's a very simple transformation that happens to be processing a large amount of data.

The xsl:result-document instruction is a good candidate for parallel processing because there's no need to combine its results with those of any other instruction, which minimizes the synchronization overheads that usually come with parallel processing. However, there is still some synchronization needed, for example because global variables are evaluated lazily, so there is the chance that two threads will encounter the same unevaluated variable. Unfortunately finding all these cases is very difficult to do methodically, and this test stylesheet found a few bugs that needed to be addressed. Realistically, it's likely there will be further bugs that we won't hit until after release. So we've been careful to make sure the feature can be disabled - that's proved very useful with bytecode generation, which is another area where it inevitably took a few months of field exposure to iron out the last bugs.

At Saxonica, we have for a long time now used a tailor-made java application to create and issue licenses for all commercial products we develop. There is no real database at the back-end, but just a local XML file with customer details and copies of the licenses created and issued. For a one man company this poses no real problem, but inevitably as the company has expanded over the last two years this has been a major concern.

Early last year Mike Kay presented me with the task to create a new saxon-license application with the following requirements:

  • Accessible  to all employees, preferable web-based
  • Core Java tool should remain intact 
  • Centralised store, must be XML-based
  • Secure application

From the outset we thought that for such a tool which heavily relies on XML and XSLT at its core, that  requirements would be best met using XSLTForms and Servlex to develop the tool.

In this blog post I would like to share my own experiences in the development of the saxon-license webapp using Servlex and XSLTForms. In the discussion I include how we stitched on our existing back-end core Java tool and challenges faced with encoding. Specific details of the features and functions are not that important here, only the engineering process is of interest.

On the client-side we write XForms [1] documents, which are manipulated by XSLTForms [2] (created by Alain Couthures)  to render in the browsers. XSLTForms is an open source client-side implementation, not a plug-in or install, that works with all major browsers.

On the server-side we integrate the core Saxon-license tool in a Servlex webapp [3] as Saxon extension functions called from within XSL. The Servlex is an open-source implementation of the EXPath webapp framework [4] based on Saxon and Calabash as its XSLT, XQuery and XProc processors.  Servlex provides a way to write web applications directly in XSLT. It is developed as a Java EE application requiring Servlet technology, sitting on tomcat for binding to HTTP server-side.

Saxon License-tool functionality

The server-side Servlex works as a dispatching and routing mechanism to components (implementation as XSLT stylesheets), applying a configuration-based mapping between the request URI and the component used to process that URI. The container communicates with the components by means of an XML representation of the HTTP request, and receives in turn XML data with HTML at the request body with XForms content and XSLTForms references to render the page.  The representation of the HTTP response is sent back to the client. There are buttons on the forms, which if pressed trigger the action HTTP PUT request; made through the client-side XSLTForms. These requests are handled by Servlex.

There are 7=5 main XSLT functions described below, which map the URIs to generate the various XForms to tunnel the instance data between the XForms. These functions all make calls to the core Saxon-license tool written in Java, made available as a Saxon extensions calls from the XSLT:


  1. fnRunMainForm: A request to serve the main form is made with the following URI pattern:

    http://192.168.0.2:8080/app/license-tool/main

    License requests are usually made through the main saxonica website either for evaluation or paid order (See: http://www.saxonica.com/download/download.xml and http://www.saxonica.com/purchase/purchase.xml, respectively), these orders are receives as an email, which are then copied and pasted on the main form.  This data is sent in the form of a XForms instance in a web request, picked up by servlex.

  2. fnManualEntry: Manual Entry form for manual creation of the customer details to create a license. A request is made to servlex with the following URI pattern:http://192.168.0.2:8080/app/license-tool/manualEntry

  3. fnFetchRecord: Existing licenses created we can retrieve and re-issue. A request is made to Servlex with the following URI pattern. We observe the parameter after the ? Is the license number to fetch:
    http://192.168.0.2:8080/app/license-tool/fetchRecord?Select=X002110

  4. fnReport: This function generates an HTML page containing all license created or such the last 20.
    http://192.168.0.2:8080/app/license-tool/report

  5. fnEditParseRequest: Manual Entry form: The manual form with the client data populated. The order request from the main form is parsed and returned as a Xforms instance data which is used to generate the form on the server. A request is made to Servlex with the following URI pattern:
    http://192.168.0.2:8080/app/license-tool/editParseRequest

     
Securing access to the saxon-license webapp is achieved through apache2 configuration.

Encoding problem

A long-standing problem we faced in this application was the handling of non-ASCII characters. We raised this issue with Alain and Florent the creators of XSLTForms and Servlex, respectively, to get to the bottom of this problem.

Basically, if the user enters data on a form, we're sending it back to the server in a url-encoded POST message, and it's emerging from Servlex in the form of XML presented as a string, and if there are non-ASCII characters then they are miscoded. In the form we set the submission method attribute to 'xml-urlencoded-post' to guarantee that the next page will fully replace the current one: XMLHttpRequest is not used in this case.

We were seeing the typical pattern that you get when the input characters are encoded as a UTF-8 byte sequence and the byte sequence is then decoded to characters by someone who believes it to be 8859-1. We were not able to work out where the incorrect decoding was happening. We originally circumvented the problem by reversing the error: we converted the string back to bytes treating each char as a byte, and then decoded the bytes as UTF-8.

A feature of XSLTForms is the profiler (enabled by pressing F1 or setting debug='yes' in the xsltforms-options process instruction). The profiler allows the inspection of the instance data. Another mechanism is to inspect the requests sent by the browser with the network profiler of a debugger.

We established that on the client side, there is an HTML Form Element that gets built, and just before the submit() method gets called on this object, the data appears to be OK. But when we look at the Tomcat log of the POST request, it's wrong. Somewhere between the form.submit() on the client and the logging of the message on the server, it's getting corrupted. We can't actually see where the encoding and decoding is happening between these two points.

To tackle this problem Florent provided a development version of Servlex, which added logging of the octets as they are read from the binary stream (the logger org.expath.servlex must be set to trace, which should be the default in that version).  In addition to logging the raw headers, as they are read by Tomcat.

With this new version of Servlex in place I inputted the following data on the main form. We observe the euro symbol at the end of my first name 'O'Neil' is a non-ASCII character which needs to be preserved:


First Name: O'Neil€

Last Name: Delpratt

Company: Saxonica

Country: United Kingdom

Email Address: oneil@saxonica.com

Phone:

Agree to Terms: checked

After submitting this data to the URI pattern: .../app/license-tool/editParseRequest  we see below the the log data reported by tomcat. What is interesting is the line 'DEBUG [2013-03-04 18:06:34,281]: Request - header   : content-type / application/x-www-form-urlencoded'.  Also at this stage the input to the receiving form has been corrupted to 'O'Neilâ&#x82' which should be 'O'Neil€' :

DEBUG [2013-03-04 18:06:34,279]: Request - servlet  : parseRequest
DEBUG [2013-03-04 18:06:34,280]: Request - path     : /parseRequest
DEBUG [2013-03-04 18:06:34,280]: Request - method   : POST
DEBUG [2013-03-04 18:06:34,280]: Request - uri      : http://localhost:8080/app/license-tool/parseRequest
DEBUG [2013-03-04 18:06:34,280]: Request - authority: http://localhost:8080
DEBUG [2013-03-04 18:06:34,280]: Request - ctxt_root: /app/license-tool
DEBUG [2013-03-04 18:06:34,280]: Request - param    : postdata / <Document><Data>First Name: O'Neil€

Last Name: Delpratt

Company: Saxonica

Country: United Kingdom

Email Address: oneil@saxonica.com

Phone:

Agree to Terms: checked </Data><Options><Confirmed>false</Confirmed><Create>false</Create><Send>false</Send><Generate>false</Generate><Existing/></Options></Document>
DEBUG [2013-03-04 18:06:34,281]: Request - header   : host / localhost:8080
DEBUG [2013-03-04 18:06:34,281]: Request - header   : user-agent / Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:18.0) Gecko/20100101 Firefox/18.0
DEBUG [2013-03-04 18:06:34,281]: Request - header   : accept / text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
DEBUG [2013-03-04 18:06:34,281]: Request - header   : accept-language / en-gb,en;q=0.5
DEBUG [2013-03-04 18:06:34,281]: Request - header   : accept-encoding / gzip, deflate
DEBUG [2013-03-04 18:06:34,281]: Request - header   : referer / http://localhost:8080/app/license-tool/main
DEBUG [2013-03-04 18:06:34,281]: Request - header   : connection / keep-alive
DEBUG [2013-03-04 18:06:34,281]: Request - header   : content-type / application/x-www-form-urlencoded
DEBUG [2013-03-04 18:06:34,281]: Request - header   : content-length / 482
DEBUG [2013-03-04 18:06:34,281]: Raw body content type: application/x-www-form-urlencoded
TRACE [2013-03-04 18:06:34,281]: TraceInputStream(org.apache.catalina.connector.CoyoteInputStream@771eeb)
TRACE [2013-03-04 18:06:34,282]: read([B@1a70476): -1

Florent made the following observations:

  1. The content-type is application/x-www-form-urlencoded, which should conform to http://www.w3.org/TR/xforms/#serialize-urlencode, but seems not to: the XML seems to be passed as is, instead of been split into individual elements and their string values. But I am not an expert on XForms so I might be wrong.

  2. Still about application/x-www-form-urlencoded and the same section, it says that the non-ASCII characters are replaced based on the octets of their UTF-8 representation, so the encoding should not be used here.  This content-type does not carry any charset parameter anyway, if I am right.

  3. Again about application/x-www-form-urlencoded, it is actually handled by Java EE as parameters, instead of simply giving the raw POST entity content.  I am not sure exactly how it works WRT the encoding.

Alain provided the following example to test the assumptions made by Florent.

Encoding.xhtml:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xf="http://www.w3.org/2002/xforms">
    <head>
        <title>Encoding Test</title>
        <xf:model>
            <xf:instance>
                <data xmlns=""/>
            </xf:instance>
            <xf:submission id="s01" method="xml-urlencoded-post" replace="all" action="http://www.agencexml.com/xsltforms/dump.php">
                <xf:message level="modeless" ev:event="xforms-submit-error">Submit error.</xf:message>
            </xf:submission>
        </xf:model>
    </head>
    <body>
        <xf:input ref=".">
            <xf:label>Input:</xf:label>
        </xf:input>
        <xf:submit>
            <xf:label>Save</xf:label>
        </xf:submit>
    </body>
</html>

dump.php:
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>HTTP XML POST Dump</title>
    </head>
    <body>
        <h1>HTTP XML POST Dump</h1>
        <h2>Raw Data :</h2>
        <?php
        $body = file_get_contents("php://input");
        echo strlen($body);
        echo " bytes: <br/>";
        echo "<pre>$body</pre>";
        if(substr($body,0,9) == "postdata=") {
            $body = urldecode(substr($body,strpos($body,"=")+1));
        }
        $xml = new DOMDocument();
        $xml->loadXML($body);
        $xslt = new XSLTProcessor();
        $xsl = new DOMDocument();
        $indent = "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\"><xsl:output method=\"xml\" indent=\"yes\" encoding=\"UTF-8\"/><xsl:template match=\"@*|node()\"><xsl:copy-of select=\".\"/></xsl:template></xsl:stylesheet>";
        $xsl->loadXML($indent);
        $xslt->importStylesheet($xsl);
        $result = $xslt->transformToXml($xml);
        $result = substr($result, strpos($result,"?>")+3);
        echo "<h2>Indented XML :</h2><pre>".htmlspecialchars($result, ENT_QUOTES)."</pre>";
        ?>
    </body>
</html>

When submitting '€', I get this:

HTTP XML POST Dump
Raw Data :
41 bytes:
postdata=%3Cdata%3E%E2%82%AC%3C%2Fdata%3E
Indented XML :
<data>€</data>

and with Firebug, I can see following, which is correct:

saxon-license img3




Florent states:

What should be in the content of the HTTP request is %E2%82%AC to represent the Euro symbol as URL- encoded (because that represents the 3 octets of € in UTF-8).
Because of the "automatic" handling of that Content-Type by Java EE, I am afraid the only way to know for sure what is on the wire is to actually look into it (using a packet sniffer, like Wireshark for instance).

At this stage it was important to check what packets are being sent. The following is a snippet of the reports from Wireshark, with the data format correct at this point.


HTTP    1207    POST /app/license-tool/parseRequest HTTP/1.1  (application/x-www-form-urlencoded)

[truncated] postdata=%3CDocument+xmlns%3D%22%22%3E%3CData%3EFirst+Name%3A+O%27Neil%E2%82%AC%0D%0A%0D%0ALast+Name%3A+Delpratt%0D%0A%0D%0ACompany%3A+Saxonica%0D%0A%0D%0ACountry%3A+United+Kingdom%0D%0A%0D%0AEmail+Address%3A+oneil%40saxonica.c ...

Florent discovered using Alain's test case that it was actually Tomcat itself interpreting the %xx encoding as Latin-1!  More infos at:

http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

In summary, the message is the decoding was done using 8859-1 not UTF-8 as one would expect.

To overcome the problem Florent created a new config property for Servlex, which is named org.expath.servlex.default.charset, the value of which can be set to "UTF-8" in Tomcat's conf/catalina.properties. If set, it's value is used as the charset for requests without an explicit charset in Content-Type.

Thanks to Florent, Alain and Mike the encoding problem has now been resolved. The lesson learnt in all, is that tracking down encoding problems can still be very hard work.

References
[1] XForms. W3C. http://www.w3.org/MarkUp/Forms/
[2] XSLTForms. Alain Couthures. http://www.agencexml.com/xsltforms
[3] Servlex. Florent George. Gihub: https://github.com/fgeorges/servlex  Google Project: http://code.google.com/p/servlex/
[4] EXPath Webapp. http://expath.org/wiki/Webapp


XSLT in MicroXML? - Saxon diaries

| No Comments | No TrackBacks
Listening to Uche Ogbuji's talk at XML Prague this morning, it was great news to hear that namespaces are being thrown out of the window[1]. The complexity that namespaces add to the XML stack is vast, and the value they add is tiny. 

But it raises the question, can one represent an XSLT stylesheet as an XML document without using namespaces? Clearly there needs to be some way to distinguish XSLT instructions from literal result elements, and this seems to be one of the more justifiable uses of namespaces.

My immediate thought is, let the outermost element of an XSLT stylesheet be

<xxx.transform>

where xxx is any prefix you care to choose; and then let all XSLT instructions in the stylesheet use the same prefix "xxx". For most people "xsl" would be a suitable prefix, so you would write

<xsl.transform>

  <xsl.template match="/">
    <out>
      <xsl.copy-of select="."/>
    </out>
 </xsl.template>

</xsl.transform>

it takes a bit of getting used to typing "." instead of ":", but apart from that, it seems to work perfectly well. And you don't need xsl:namespace-alias any more to read or write XSLT stylesheets, which must be good: you simply use a different prefix for the output stylesheet.

Norm Walsh tweeted that XSLT was claiming ownership of all names ending in ".transform". But that's not the case. The name doesn't have global significance; it only has significance in the context of the input supplied to an XSLT processor. Names are local and contextual, that's the point. Arguably there is still more redundancy than we need - we know the document we are reading is a stylesheet, so why do we need a top-level element to confirm the fact? (That's not to say all redundancy is bad, just to make the point that it's redundant).

[1] Defenestration seems the thing to do when in Prague 

Further thoughts

It's not only the XSLT namespace and its use in XSLT instructions that needs to be addressed, of course.

Namespaces in the names of variables, templates, modes, keys, decimal formats and so on can safely be ignored. Hardly anyone uses them anyway.

Functions are a bit more awkward. XSLT has never seriously committed to the use of the "fn" namespace for system functions (it's always the default, so no prefix is ever needed), but it does require user-written functions to be in a namespace to distinguish them from system-supplied functions; and external function libraries also use namespaces in a similar way. I've always felt the right approach for resolving function names is a search list - it shouldn't be necessary to prefix each function name to identify its library, rather it should be possible to define a list of function libraries to be searched when looking for functions with a given local name. The MicroXML philosophy would suggest getting rid of URIs from function names entirely, but that's a big change; transitionally, a list of URIs to be searched plus the XSLT 3.0 EQName facility if you need to be more explicitly would probably meet the requirement.

Step back, however. What are we trying to achieve? If we want to eliminate namespaces entirely from the whole stack, then we need to think about a namespace-less XSD (shudder!) or a typeless XSLT (losing integers and dates would be a big step back). If our goals are less ambitious, then what are they? Being able to transform MicroXML documents doesn't require an XSLT change at all.

I think MicroXML is a first step along a road that is not well charted. While the goals are laudable, the route to a simpler cleaner world is not a straightforward one.

Keys and maps - Saxon diaries

| No Comments | No TrackBacks
I've felt for a while there ought to be some kind of connection between keys (in the xsl:key sense) and maps.

Keys are great, but their limitations are frustrating. There's no way of enumerating the entries in a key, there's no way of using them for grouping or sorting. If you're going to the effort of building an index, then one feels it ought to be possible to get a higher return on this investment.

I've been trying out a fairly simple way of achieving this.

Firstly, to the xsl:key declaration we add the option saxon:range-key="yes". The main effect of this is that we build the resulting index as a TreeMap rather than a HashMap, which means that the keys are sorted.

Then, we add a single extension function: saxon:key-map('keyname', doc, min, max). The effect of this function is to return a map. The keys in the map are the values of the "use" expression in the key declaration; the map contains all entries for this key relevant to the document supplied in the doc argument; and min and max are the minimum and maximum (inclusive) values of the key - either can be supplied as an empty sequence to indicate "no lower/upper limit". Perhaps for convenience we could also have a version of the function in which only the first two arguments are supplied.

So, for example, given a list of transactions with a date attribute, you would declare a key like this:

<xsl:key name="transactions-by-date" match="transaction" use="@date" saxon:range-key="yes"/>

and then the call

saxon:range-key('transactions-by-date', doc('transactions.xml'), '2012-12-01', '2012-12-31')

would return a map containing all the transactions for December. With the result bound to variable $map, map:keys() returns the dates of these transactions, and $map($x) returns the transactions for a specific date. It's a shame there isn't a nicer way to iterate over the contents of a map (we're still working on that one), but by presenting the result as a map we get access to all the map functionality without having to invent a lot of new functions just for this use case.

We're guaranteeing that for this particular kind of map, keys() will return the keys in sorted order. (Perhaps that's a facility that should also be available for other kinds of map? I don't think this design precludes that.)

Let me know what you think.
I would like to report on some Saxon performance measure on a Word ladder solution implemented in XSLT.

Firstly, some background information on the Word ladder problem. From Wikipedia, the free encyclopedia:


A word ladder (also known as a doublets, word-links, or Word golf) is a word game invented by Lewis Carroll. A word ladder puzzle begins with two given words, and to solve the puzzle one must find the shortest chain of other words to link the two given words, in which chain every two adjacent words (that is, words in successive steps) differ by exactly by one letter.

XSLT interest in this problem was first started (to the best of my knowledge) by Dimitre Novatchev through the mulberry mailing list, who provides a 20 step guide to create a stylesheet in his blog to solve the Word ladder problem (FindChainOfWordsHamming.xsl). Following the post on the list, there has been some interest; another solution to this problem was given by Wolfgang Laun (please see thread, file: FindChainOfWordsHamming2.xsl).

Experimental Evaluation

Our interest resides in the Saxon performances only. I was curious and surprised by the results reported by Dimitre. The question I had is why Dimitre's stylesheet was much slower than Wolfgang's stylesheet in Saxon and faster in another XSLT processor: there must be some optimization step we were not making. I was motivated to understand were the bottle necks were and how we could improve the performance in Saxon.

Wolfgang wrote: "The XSLT program is three times faster on one XSLT implementation than on another one is strange, 'very' strange".

Mike Kay addressed Wolfgang's comment by writing in the thread: "No, it's extremely common. In fact, very much larger factors than this are possible. Sometimes Saxon-EE runs 1000 times faster than Saxon-HE. This effect is normal with declarative languages where powerful optimizations are deployed - SQL users will be very familiar with the effect."

The table below shows the execution times of the stylesheets in Saxon 9.XX (for some recent X). Time were reported by Dimitre.

TransformationTimes (secs)
Dimitre39
Wolfgang25

We observe that Wolfgang's transformation is 1.56 times faster. Please note that with Wolfgang's stylesheet his results lists all solutions (i.e. ladders), whereas Dimitre only finds one.

Saxon represents a stylesheet as a compiled abstract syntax tree (AST) which is processed in a interpreted manner. Since the release of Saxon 9.4 we have included the bytecode generation feature, which allows us at the compilation phase to generate directly the byte code representation of the entire AST or sub-trees of it where performance benefits can be achieved. We make use of properties we know at compile time (See full paper).

Analysis of Dimitre's Stylesheet

Step one was to see how well Saxon does with the bytecode feature switched on. This proved inconclusive because we discovered a bug in the bytecode generated. A useful exercise already, we managed to fix the bug (see bug issue: #1653). The problem was in the function processQueue the tail recursive call was not being properly generated into bytecode.

The Table below shows running times of the stylesheets under Saxon 9.4.0.6. We observe that Wolfgang's stylesheet was 2.07 and 3.22 faster in Saxon Intepreted and bytecode, respectively.

TransformationInterpreted - Times (secs)With bytecode generation - Times (secs)
Dimitre7.957.78
Wolfgang3.832.41

Analyzing Dimitre's stylesheet with the Saxon tracing profile (i.e. option -TP) proved useful. See the html output produced by Saxon below. We observe that there is a big hit on the processNode method, with the most time spent in this function.

 

Analysis of Stylesheet Execution Time

Total time: 9498.871 milliseconds

Time spent in each template or function:

The table below is ordered by the total net time spent in the template or function. Gross time means the time including called templates and functions; net time means time excluding time spent in called templates and functions.

file line instruction count avg time (gross) total time (gross) avg time (net) total time (net)
"*rdsHamming.xsl" 79 function my:processNode 2053 4.12 8470.67 3.729 7655.792
"*rdsHamming.xsl" 21 function my:chainOfWords 1 9491.1 9491.12 993.34 993.34
"*rdsHamming.xsl" 131 function f:eq 3993 0.06 230.02 0.058 230.26
"*rdsHamming.xsl" 131 function my:HammingDistance 3993 0.20 807.38 0.049 194.77
"*func-apply.xsl" 21 function f:apply 15972 0.01 290.01 0.011 175.00
"*-Operators.xsl" 244 template f:eq 15972 0.01 115.01 0.004 68.23
"*-Operators.xsl" 248 function f:eq 15972 0.003 46.77 0.003 46.77
"*nc-zipWith.xsl" 21 function f:zipWith 19965 0.002 33.11 0.002 33.11
"*nc-zipWith.xsl" 9 function f:zipWith 19965 0.003 57.67 0.001 24.56
"*func-apply.xsl" 16 function f:apply 15972 0.019 309.52 0.001 19.52
"*rdsHamming.xsl" 70 function my:processQueue 2053 0.009 18.35 0.009 18.35
"*hFunctions.xsl" 498 function f:string-to-codepoints 3993 0.003 10.52 0.003 10.52
"*rdsHamming.xsl" 120 function my:HammingDistance 3993 0.204 814.48 0.002 7.09
"*hFunctions.xsl" 498 function f:string-to-codepoints 3993 0.001 4.88 0.001 4.88
"*rdsHamming.xsl" 73 function my:processNode 2053 4.128 8475.2 0.002 4.57
"*rdsHamming.xsl" 54 function my:processQueue 2053 0.011 22.20 0.002 3.85
"*rdsHamming.xsl" 17 template /* 1 9491.87 9491.9 0.756 0.76
"*rdsHamming.xsl" 40 function my:chainOfWords 1 0.344 0.34 0.344 0.34
"*rdsHamming.xsl" 117 function my:enumerate 10 0.166 1.65 0.029 0.29
"*rdsHamming.xsl" 111 function my:enumerate 10 0.176 1.76 0.010 0.10

In addition to the Saxon tracing profile I ran the Java hrof profiling tool, which showed up that most time was spent in comparing strings. See the Java profile results below. It was now obvious that the GeneralComparison expression was in question. Specifically we narrowed it down to the instruction: <xsl:for-each select="$vNeighbors[not(. = $pExcluded)]">. For the interpreted code we were doing some unnecessary runtime type checking when we know statically at compile time that we are comparing string values. More Specifically, we know at compile time that $vNeighbors is a sequence of untyped atomic values and $pExcluded is a sequence of strings. We were unnecessarily checking at runtime that untyped atomic and string literal were comparable and we were doing an unnecessary conversion from an untyped atomic to string. 

CPU SAMPLES BEGIN (total = 1213) Thu Nov 29 14:42:47 2012
rank   self  accum   count trace method
   1 24.24% 24.24%     294 300547 java.lang.Integer.hashCode
   2 19.13% 43.36%     232 300581 net.sf.saxon.expr.GeneralComparison.compare
   3  7.75% 51.11%      94 300613 java.util.HashMap.getEntry
   4  2.14% 53.26%      26 300570 java.util.LinkedHashMap$Entry.recordAccess
   5  2.06% 55.32%      25 300234 java.lang.ClassLoader.defineClass1
   6  2.06% 57.38%      25 300616 com.saxonica.expr.ee.GeneralComparisonEE.effectiveBooleanValue
   7  1.98% 59.36%      24 300603 java.util.LinkedHashMap$Entry.recordAccess
   8  1.98% 61.34%      24 300609 net.sf.saxon.type.Converter.convert
....
See full hprof results: java.hprof-DN.txt

Improvements in Bytecode generation

In the bytecode we discovered we were missing out on opportunities to capitalise on static properties we know at compile time. For example during atomization we were doing an instanceof test to see whether each item was a node when we already know from static analysis that this was the case. We were also able to avoid unnecessary conversions of the strings, checking of instanceof and we found we could avoid repeated conversions by saving of string values for reuse when appropriate.

With the code improvements discussed above we were able to apply them in Saxon-EE 9.5 (pre-release). The table below shows these running times on the stylesheet written by Dimitre and Wolfgang. We observe that in the interpreted code that Wolfgang's XSL is 2.13 times faster than Dimitre (This is similar to Dimitre results above). With the bytecode generation feature switched on: Dimitre's stylesheet has dramatically improved in performance and is now 1.19 times faster than Wolfgang's XSL.

TransformationInterpreted - Times (secs)With bytecode generation - Times (secs)
Dimitre7.3731.938
Wolfgang3.4502.17

We have not done any similar analysis on Wolfgang's stylesheet, we will now attempt to do this.

To be continued....