Saxon performance measures of the Word Ladders problem in XSLT

By O’Neil Delpratt on December 06, 2012 at 04:23p.m.

I would like to report on some Saxon performance measure on a Word ladder solution implemented in XSLT.

Firstly, some background information on the Word ladder problem. From Wikipedia, the free encyclopedia:

A word ladder (also known as a doublets, word-links, or Word golf) is a word game invented by Lewis Carroll. A word ladder puzzle begins with two given words, and to solve the puzzle one must find the shortest chain of other words to link the two given words, in which chain every two adjacent words (that is, words in successive steps) differ by exactly by one letter.

XSLT interest in this problem was first started (to the best of my knowledge) by Dimitre Novatchev through the mulberry mailing list, who provides a 20 step guide to create a stylesheet in his blog to solve the Word ladder problem (FindChainOfWordsHamming.xsl). Following the post on the list, there has been some interest; another solution to this problem was given by Wolfgang Laun (please see thread, file: FindChainOfWordsHamming2.xsl).

Experimental Evaluation

Our interest resides in the Saxon performances only. I was curious and surprised by the results reported by Dimitre. The question I had is why Dimitre's stylesheet was much slower than Wolfgang's stylesheet in Saxon and faster in another XSLT processor: there must be some optimization step we were not making. I was motivated to understand were the bottle necks were and how we could improve the performance in Saxon.

Wolfgang wrote: "The XSLT program is three times faster on one XSLT implementation than on another one is strange, 'very' strange".

Mike Kay addressed Wolfgang's comment by writing in the thread: "No, it's extremely common. In fact, very much larger factors than this are possible. Sometimes Saxon-EE runs 1000 times faster than Saxon-HE. This effect is normal with declarative languages where powerful optimizations are deployed - SQL users will be very familiar with the effect."

The table below shows the execution times of the stylesheets in Saxon 9.XX (for some recent X). Time were reported by Dimitre.

Transformation Times (secs)
Dimitre 39
Wolfgang 25

We observe that Wolfgang's transformation is 1.56 times faster. Please note that with Wolfgang's stylesheet his results lists all solutions (i.e. ladders), whereas Dimitre only finds one.

Saxon represents a stylesheet as a compiled abstract syntax tree (AST) which is processed in a interpreted manner. Since the release of Saxon 9.4 we have included the bytecode generation feature, which allows us at the compilation phase to generate directly the byte code representation of the entire AST or sub-trees of it where performance benefits can be achieved. We make use of properties we know at compile time (See full paper).

Analysis of Dimitre's Stylesheet

Step one was to see how well Saxon does with the bytecode feature switched on. This proved inconclusive because we discovered a bug in the bytecode generated. A useful exercise already, we managed to fix the bug (see bug issue: #1653). The problem was in the function processQueue the tail recursive call was not being properly generated into bytecode.

The Table below shows running times of the stylesheets under Saxon 9.4.0.6. We observe that Wolfgang's stylesheet was 2.07 and 3.22 faster in Saxon Intepreted and bytecode, respectively.

Transformation Interpreted - Times (secs) With bytecode generation - Times (secs)
Dimitre 7.95 7.78
Wolfgang 3.83 2.41

Analyzing Dimitre's stylesheet with the Saxon tracing profile (i.e. option -TP) proved useful. See the html output produced by Saxon below. We observe that there is a big hit on the processNode method, with the most time spent in this function.

Analysis of Stylesheet Execution Time

Total time: 9498.871 milliseconds

Time spent in each template or function:

The table below is ordered by the total net time spent in the template or function. Gross time means the time including called templates and functions; net time means time excluding time spent in called templates and functions.

file line instruction count avg time (gross) total time (gross) avg time (net) total time (net)
"*rdsHamming.xsl" 79 function my:processNode 2053 4.12 8470.67 3.729 7655.792
"*rdsHamming.xsl" 21 function my:chainOfWords 1 9491.1 9491.12 993.34 993.34
"*rdsHamming.xsl" 131 function f:eq 3993 0.06 230.02 0.058 230.26
"*rdsHamming.xsl" 131 function my:HammingDistance 3993 0.20 807.38 0.049 194.77
"*func-apply.xsl" 21 function f:apply 15972 0.01 290.01 0.011 175.00
"*-Operators.xsl" 244 template f:eq 15972 0.01 115.01 0.004 68.23
"*-Operators.xsl" 248 function f:eq 15972 0.003 46.77 0.003 46.77
"*nc-zipWith.xsl" 21 function f:zipWith 19965 0.002 33.11 0.002 33.11
"*nc-zipWith.xsl" 9 function f:zipWith 19965 0.003 57.67 0.001 24.56
"*func-apply.xsl" 16 function f:apply 15972 0.019 309.52 0.001 19.52
"*rdsHamming.xsl" 70 function my:processQueue 2053 0.009 18.35 0.009 18.35
"*hFunctions.xsl" 498 function f:string-to-codepoints 3993 0.003 10.52 0.003 10.52
"*rdsHamming.xsl" 120 function my:HammingDistance 3993 0.204 814.48 0.002 7.09
"*hFunctions.xsl" 498 function f:string-to-codepoints 3993 0.001 4.88 0.001 4.88
"*rdsHamming.xsl" 73 function my:processNode 2053 4.128 8475.2 0.002 4.57
"*rdsHamming.xsl" 54 function my:processQueue 2053 0.011 22.20 0.002 3.85
"*rdsHamming.xsl" 17 template /* 1 9491.87 9491.9 0.756 0.76
"*rdsHamming.xsl" 40 function my:chainOfWords 1 0.344 0.34 0.344 0.34
"*rdsHamming.xsl" 117 function my:enumerate 10 0.166 1.65 0.029 0.29
"*rdsHamming.xsl" 111 function my:enumerate 10 0.176 1.76 0.010 0.10

In addition to the Saxon tracing profile I ran the Java hrof profiling tool, which showed up that most time was spent in comparing strings. See the Java profile results below. It was now obvious that the GeneralComparison expression was in question. Specifically we narrowed it down to the instruction: <xsl:for-each select="$vNeighbors[not(. = $pExcluded)]">. For the interpreted code we were doing some unnecessary runtime type checking when we know statically at compile time that we are comparing string values. More Specifically, we know at compile time that $vNeighbors is a sequence of untyped atomic values and $pExcluded is a sequence of strings. We were unnecessarily checking at runtime that untyped atomic and string literal were comparable and we were doing an unnecessary conversion from an untyped atomic to string. 

CPU SAMPLES BEGIN (total = 1213) Thu Nov 29 14:42:47 2012
rank   self  accum   count trace method
   1 24.24% 24.24%     294 300547 java.lang.Integer.hashCode
   2 19.13% 43.36%     232 300581 net.sf.saxon.expr.GeneralComparison.compare
   3  7.75% 51.11%      94 300613 java.util.HashMap.getEntry
   4  2.14% 53.26%      26 300570 java.util.LinkedHashMap$Entry.recordAccess
   5  2.06% 55.32%      25 300234 java.lang.ClassLoader.defineClass1
   6  2.06% 57.38%      25 300616 com.saxonica.expr.ee.GeneralComparisonEE.effectiveBooleanValue
   7  1.98% 59.36%      24 300603 java.util.LinkedHashMap$Entry.recordAccess
   8  1.98% 61.34%      24 300609 net.sf.saxon.type.Converter.convert
....

See full hprof results: java.hprof-DN.txt

Improvements in Bytecode generation

In the bytecode we discovered we were missing out on opportunities to capitalise on static properties we know at compile time. For example during atomization we were doing an instanceof test to see whether each item was a node when we already know from static analysis that this was the case. We were also able to avoid unnecessary conversions of the strings, checking of instanceof and we found we could avoid repeated conversions by saving of string values for reuse when appropriate.

With the code improvements discussed above we were able to apply them in Saxon-EE 9.5 (pre-release). The table below shows these running times on the stylesheet written by Dimitre and Wolfgang. We observe that in the interpreted code that Wolfgang's XSL is 2.13 times faster than Dimitre (This is similar to Dimitre results above). With the bytecode generation feature switched on: Dimitre's stylesheet has dramatically improved in performance and is now 1.19 times faster than Wolfgang's XSL.

Transformation Interpreted - Times (secs) With bytecode generation - Times (secs)
Dimitre 7.373 1.938
Wolfgang 3.450 2.17

We have not done any similar analysis on Wolfgang's stylesheet, we will now attempt to do this.

To be continued....