Lazy Evaluation - Saxon diaries

| No Comments | No TrackBacks

We've seen some VM dumps recently that showed evidence of contention problems when multiple threads (created, for example, using <xsl:for-each> with the saxon:threads attribute) were attempting lazy evaluation of the same local variable. So I've been looking at the lazy evaluation code in Saxon to try and understand all the permutations of how it works. A blog posting is a good way to try and capture that understanding before I forget it all again. But I won't go into the extra complexities of parallel execution just yet: I'll come back to that at the end.


Lazy evaluation applies when a variable binding, for example "let $v := //x[@y=3]" isn't evaluated immediately when the variable declaration is encountered, but only when the variable is actually referenced. This is possible in functional languages because evaluating an expression has no side-effects, so it doesn't matter when (or how often) it is done. In some functional languages such as Scheme, lazy evaluation happens only if you explicitly request it. In others, such as Haskell, lazy evaluation is mandated by the language specification (which means that a variable can hold an infinite sequence, so long as you don't try to process its entire value). In XSLT and XQuery, lazy evaluation is entirely at the discretion of the compiler, and in this post I shall try to summarize how Saxon makes use of this freedom.


Internally, when a local variable is evaluated lazily, Saxon instead of putting the variable's value in the relevant slot on the stack, will instead put a data structure that contains all the information needed to evaluate the variable: that is, the expression itself, and any part of the evaluation context on which it depends. In Saxon this data structure is called a Closure. The terminology isn't quite right, because it's not quite the same thing as the closure of an inline function, but the concepts are closely related: in some languages, lazy evaluation is implemented by storing, as the value of the variable, not the variable's actual value, but a function which delivers that value when invoked, and the data needed by this function to achieve that task is correctly called a closure. (If higher-order functions had been available in Saxon a few years earlier, we might well have implemented lazy evaluation this way.)


We can distinguish two levels of lazy evaluation. We might use the term "deferred evaluation" to indicate that a variable is not evaluated until it is first referenced, and "incremental evaluation" to indicate that when it is referenced, it is only evaluated to the extent necessary. For example, if the first reference is the function call head($v), only the first item in the sequence $v will be evaluated; remaining items will only be evaluated if a subsequent reference to the variable requires them.


Lazy evaluation can apply to global variables, local variables, parameters of templates and functions, and return values from templates and functions. Saxon handles each case slightly differently.


We should mention some static optimizations which are not directly related to lazy evaluation, but are often confused with it. First, a variable that is never referenced is eliminated at compile-time, so its initializing expression is never evaluated at all. Secondly, a variable that is only referenced once, and where the reference is not in any kind of loop, is inlined: that is, the variable reference is replaced by the expression used to initialize the variable, and the variable itself is then eliminated. So when someone writes "let $x := /a/b/c return $x[d=3]", Saxon turns this into the expression "(/a/b/c)[d=3]". (Achieving this of course requires careful attention to the static and dynamic context, but we won't go into the details here.)


Another static optimization that interacts with variable evaluation is loop-lifting. If an expression within a looping construct (for example the content of xsl:for-each, or of a predicate, or the right-hand-side of the "/" operator) will have the same value for every iteration of the loop, then a new local variable bound to this expression is created outside the loop, and the original expression is replaced by a reference to the variable. In this situation we need to take care that the expression is not evaluated unless the loop is executed at least once (both to avoid wasted evaluation cost, and to give the right behaviour in the event that evaluating the expression fails with a dynamic error.) So lazy evaluation of such a variable becomes mandatory.


The combined effect of these static optimizations, together with lazy evaluation, is that the order of evaluation of expressions can be quite unintuitive. To enable users to understand what is going on when debugging, it is therefore normal for some of these rewrites to be suppressed if debugging or tracing are enabled.


For global variables, Saxon uses deferred evaluation but not incremental evaluation. A global variable is not evaluated until it is first referenced, but at that point it is completely evaluated, and the sequence representing its value is held in memory in its entirety.


For local variables, evaluation is generally both deferred and incremental. However, the rules are quite complex.


  • If the static type shows that the value will be a singleton, then it will be evaluated eagerly. [It's not at all clear that this rule makes sense. Certainly, incremental evaluation makes no sense for singletons. But deferred evaluation could still be very useful, for example if the evaluation is expensive and the variable is only referenced within a branch of a conditional, so the value is not always needed.]

  • Eager evaluation is used when the binding expression is very simple: in particular when it is a literal or a reference to another variable.

  • Eager evaluation is used for binding expressions that depend on position() or last(), to avoid the complexities of saving these values in the Closure.

  • There are some optimizations which take precedence over lazy evaluation. For example if there are variable references using predicates, such as $v[@x=3], then the variable will not only be evaluated eagerly, but will also be indexed on the value of the attribute @x. Another example: if a variable is initialized to an expression such as ($v, x) - that is, a sequence that appends an item to another variable - then we use a "shared append expression" which is a data structure that allows a sequence to be constructed by appending to an existing sequence without copying the entire sequence, which is a common pattern in algorithms using head-tail recursion.

  • Lazy evaluation (and inlining) need special care if the variable is declared outside a try/catch block, but is referenced within it. In such a case a dynamic error that occurs while evaluating the initialization expression must not be caught by the try/catch; it is logically outside its scope. (Writing this has made me realise that this is not yet implemented in Saxon; I have written a test case and it currently fails.)


If none of these special circumstances apply, lazy evaluation is chosen. There is one more choice to be made: between a Closure and a MemoClosure. The common case is a MemoClosure, and in this case, as the variable is incrementally evaluated, the value is saved for use when evaluating subsequent variable references. A (non-memo) closure is used when it is known that the value will only be needed once. Because most such cases have been handled by variable inlining, the main case where a non-memo closure is used is for the return value of a function. Functions, like variables, are lazily evaluated, so that the value returned to the caller is not actually a sequence in memory, but a closure containing all the information needed to materialize the sequence. (Like most rules in this story, there is an important exception: tail-call optimization, where the last thing a function does is to call itself, takes precedence over lazy evaluation).


So let's look more closely at the MemoClosure. A MemoClosure is a data structure that holds the following information:


  • The Expression itself (a pointer to a node in the expression tree). The Expression object also holds any information from the static context that is needed during evaluation, for example namespace bindings.

  • A copy of the dynamic context at the point where the variable is bound. This includes the context item, and values of any local variables referenced by the expression.

  • The current evaluation state: one of UNREAD (no access to the variable has yet been made), MAYBE_MORE (some items in the value of the variable are available, but there may be more to come), ALL_READ (the value of the variable is fully available), BUSY (the variable is being evaluated), or EMPTY (special case of ALL_READ in which the value is known to be an empty sequence).

  • An InputIterator: an iterator over the results of the expression, relevant when evaluation has started but has not finished

  • A reservoir: a list containing the items delivered by the InputIterator so far.


Many variable references, for example count($v), or index-of($v, 'z') result in the variable being evaluated in full. If this is the first reference to the variable, that is if the state is UNREAD, the logic is essentially


inputIterator = expression.iterate(savedContext);

for item in inputIterator {

   reservoir.add(item);

}

state = ALL_READ;

return new SequenceExtent(reservoir);


(However, Saxon doesn't optimize this case, and it occurs to me on writing this that it could.)


Other variable references, such as head($v), or $v[1], or subsequence($v, 1, 5), require only partial evaluation of the expression. In such cases Saxon creates and returns a ProgressiveIterator, and the requesting expression reads as many items from the ProgressiveIterator as it needs. Requests to get items from the ProgressiveIterator fetch items from the reservoir to the extent they are available; on exhaustion of the reservoir, they then attempt to fetch items from the InputIterator until either enough items are available, or the InputIterator is exhausted. Items delivered from the InputIterator are copied to the reservoir as they are found.


So far so good. This has all been in place for years, and works well. We have no evidence that it is in any way optimal, but it has been carefully tweaked over the years to deal with particular cases where it was performing badly. What has changed recently is that local variables can be referenced from multiple threads. There are two particular cases where this happens today: when xsl:result-document is used in Saxon-EE, it executes by default asynchronously in a new thread; and when the extension attribute saxon:threads is used on xsl:for-each, the items selected by the xsl:for-each are processed in parallel rather than sequentially.


The effect of this is that the MemoClosure object needs to be thread-safe: multiple requests to access the variable can come simultaneously from different threads. To achieve this a number of methods are synchronized. One of these is the next() method of the ProgressiveIterator: if two threads reference the variable at the same time, each gets its own ProgressiveIterator, and the next() method on one of these iterators is forced to wait until the other has finished.


This works, but it is risky. Brian Goetz in his excellent book Java Concurrency in Practice recommends that a method should not be synchronized unless (a) its execution time is short, and (b) as the author of the method, you know exactly what code will execute while it is active. In this case neither condition is satisfied. The next() method of ProgressiveIterator calls the next() method of the InputIterator, and this may perform expensive computation, for example retrieving and parsing a document using the doc() function. Further, we have no way of analyzing exactly what code is executed: in the worst case, it may include user-written code (for example, an extension function or a URIResolver). The mechanism can't deadlock with itself (because there cannot be a cycle of variable references) but it is practically impossible to prove that it can't deadlock with other subsystems that use synchronization, and in the face of maliciously-written used code, it's probably safe to assume that deadlock can occur. We haven't seen deadlock happen in practice, but it's unsatisfactory that we can't prove its impossibility.


So what should we do about it?


I think the answer is, add yet another exception to the list of cases where lazy evaluation is used: specifically, don't use it for a variable that can be referenced from a different thread. I'm pretty sure it's possible to detect such cases statically, and they won't be very common. In such cases, use eager evaluation instead.


We must be careful not to do this in the case of a loop-lifted variable, where the correct error semantics depend on lazy evaluation. So another tweak to the rules is, don't loop-lift code out of a multithreaded execution block.


This investigation also suggests a few other refinements we might make.


  • It seems worth optimizing for the case where the entire value of a variable is needed, since this case is so common. The problem is, it's not easy to detect this case: a calling expression such as count($v) will ask for an iterator over the variable value, without giving any indication that it intends to read the iterator to completion.

  • We need to reassess the rule that singleton local variables are evaluated eagerly.

  • We currently avoid using lazy evaluation for expressions with certain dependencies on the dynamic context (for example, position() and last()). But in the course of implementing higher-order functions, we have acquired the capability to hold such values in a saved copy of the dynamic context.

  • We could look at a complete redesign that takes advantage of higher-order functions and their closures. This might be much simpler than the current design; but it would discard the benefits of years of fine-tuning of the current design.

  • I'm not convinced that it makes sense for a MemoClosure to defer creation of the InputIterator until the first request for the variable value. It would be a lot simpler to call inputIterator = Expression.iterate(context) at the point of variable declaration; in most cases the implementation will defer evaluation to the extent that this makes sense, and this approach saves the cost of the elaborate code to save the necessary parts of the dynamic context. It's worth trying the other approach and making some performance measurements.



A redesign of the NamePool - Saxon diaries

| No Comments | No TrackBacks
As explained in my previous post, the NamePool in Saxon is a potential problem for scaleability, both because access can cause contention, and also because it has serious limits on the number of names it can hold: there's a maximum of one million QNames, and performance starts getting seriously bad long before this limit is reached.

Essentially, the old NamePool is a home-grown hash table. It uses a fixed number of buckets (1024), and when hash collisions occur, the chains of hash duplicates are searched serially. The fact that the number of buckets is fixed, and entries are only added to the end of a chain, is what makes it (reasonably) safe for read access to the pool to occur without locking.

One thing I have been doing over a period of time is to reduce the amount of unnecessary use of the NamePool. Most recently I've changed the implementation of the schema component model so that references from one schema component to another are no longer implemented using NamePool fingerprints. But this is peripheral: the core usage of the NamePool for comparing names in a query against names in a source document will always remain the dominant usage, and we need to make this scaleable as parallelism increases.

Today I've been exploring an alternative design for the NamePool (and some variations on the implementation of the design). The new design has at its core two Java ConcurrentHashMaps, one from QNames to fingerprints, and one from fingerprints to QNames. The ConcurrentHashMap, which was introduced in Java 5, doesn't just offer safe multi-threaded access, it also offers very low contention: it uses fine-grained locking to ensure that multiple writers, and any number of readers, can access the data structure simulaneously.

Using two maps, one of which is the inverse of the other, at first seemed a problem. How can we ensure that the two maps are consistent with each other, without updating both under an exclusive lock, which would negate all the benefits? The answer is that we can't completely, but we can get close enough.

The logic is like this:

private final ConcurrentHashMap<StructuredQName, Integer> qNameToInteger = new ConcurrentHashMap<StructuredQName, Integer>(1000);
private final ConcurrentHashMap<Integer, StructuredQName> integerToQName = new ConcurrentHashMap<Integer, StructuredQName>(1000);
private AtomicInteger unique = new AtomicInteger();
// Allocate fingerprint to QName
Integer existing = qNameToInteger.get(qName);
if (existing != null) {
return existing;
}
Integer next = unique.getAndIncrement();
existing = qNameToInteger.putIfAbsent(qName, next);
if (existing == null) {
integerToQName.put(next, qName);
return next;
} else {
return existing;
}
Now, there are several things slightly unsafe about this. We might find that the QName doesn't exist in the map on our first look, but by the time we get to the "putIfAbsent" call, someone else has added it. The worst that happens here is that we've used up an integer from the "unique" sequence unnecessarily. Also, someone else doing concurrent read access might see the NamePool in a state where one map has been updated and the other hasn't. But I believe this doesn't matter: clients aren't going to look for a fingerprint in the map unless they have good reason to believe that fingerprint exists, and it's highly implausible that this knowledge comes from a different thread that has only just added the fingerprint to the map.
There's another ConcurrentHashMap involved as well, which is a map from URIs to lists of prefixes used in conjunction with that URI. I won't go into that detail.
The external interface to the NamePool doesn't change at all by this redesign. We still use 20-bit fingerprints plus 10-bit prefix codes, so we still have the limit of a million distinct names. But performance no longer degrades when we get close to that limit; and the limit is no longer quite so hard-coded.
My first attempt at measuring the performance of this found the expected benefits in scalability as the concurrency increases and as the size of the vocabulary increases, but the performance under more normal conditions was worse than the existing design: execution time of 5s versus 3s for executing 100,000 cycles each of which performed an addition (from a pool of 10,000 distinct names so 90% of the additions were already present) followed by 20 retrievals.
I suspected that the performance degradation was caused by the need to update two maps, whereas the existing design only uses one (it's cleverly done so that the fingerprint generated for a QName is closely related to its hash key, which enables us to use the fingerprint to navigate back into the hash table to reconstruct the original QName).
But it turned out that the cause was somewhere else. The old NamePool design was hashing QNames by considering only the local part of the name and ignoring the namespace URI, whereas the new design was computing a hash based on both the local name and the URI. Because URIs are often rather long, computing the hash code is expensive, and in this case it adds very little value: it's unusual for the same local name to be associated with more than one URI, and when it happens, the hash table is perfectly able to cope with the collision. By changing the hashing on QName objects to consider only the local name, the costs for the new design came down slightly below the current implementation (about 10% better, not enough to be noticeable).
So I feel comfortable putting this into production. There are a dozen test cases failing (out of 20,000) which I need to sort out first, but it all looks very promising.

Another look in the NamePool - Saxon diaries

| No Comments | No TrackBacks

I've been looking again at the implementation of some of the parallel processing features in Saxon (see my XMLPrague 2015 paper) and at how best to make use of the facilities in the Java platform to support them. In the course of this I've been studying Brian Goetz's excellent book Java Concurrency in Practice, which although dated, is still probably the best text available on the subject; and the fact that it only takes you up to Java 6 is an advantage in our case, because we still want Saxon to run on Java 6.


Reading the book has made me think again about the venerable design of Saxon's NamePool, which is the oldest thing in the product where multithreading is relevant. The NamePool is basically a shared data structure holding QNames.


The design of the NamePool hasn't changed much over the years. On the whole it works well, but there are a number of limitations:


  • Updating the NamePool requires an exclusive lock, and on occasions there has been heavy contention on that lock, which reduces the effectiveness of running more threads.

  • We read the NamePool without acquiring a lock. All the textbooks say that's bad practice because of the risk of subtle bugs. We've been using this design for over a dozen years without a single of these subtle bugs coming to the surface, but that doesn't mean they aren't there. It's very hard to prove that the design is thread-safe, and it might only take an unusual JVM, or an unusual hardware architecture, or a massively parallel application, or pathological data (such as a local name that appears in hundreds of namespaces) for a bug to suddenly appear: which would be a serious embarassment.

  • The fact that we read the NamePool with no lock means that the data structure itself is very conservative. We use a fixed number of hash buckets (1024), and a chain of names within each bucket. We only ever append to the end of such a chain. If the vocabulary is large, the chains can become long, and searching then takes time proportional to the length of the chains. Any attempt to change the number of buckets on the fly is out of the question so long as we have non-locking readers. So performance degrades with large vocabularies.

  • We've got a problem coming up with XSLT 3.0 packages. We want packages to be independently compiled, and we want them to be distributed. That means we can't bind names to fingerprints during package construction, because the package will have to run with different namepools at run-time. We can probably solve this problem by doing the binding of names at package "load" or "link" time; but it's a change of approach and that invites a rethink about how the NamePool works.


Although the NamePool design hasn't changed much over the years, we've changed the way it is used: which essentially means, we use it less. Some while ago we stopped using the NamePool for things such as variable names and function names: it is now used only for element names, attribute names, and type names. Around Saxon 9.4 we changed the Receiver interface (Saxon's ubiquitous interface for passing push-mode events down a pipeline) so that element and attribute names were represented by a NodeName object instead of an integer fingerprint. The NodeName can hold a name either in string form or as an integer code, or both, so this change meant that we didn't have to allocate a NamePool fingerprint just to pass events down this pipeline, which in turn meant that we didn't have to allocate fingerprints to constructed elements and attributes that were going to be immediately serialized. We also stopped using the NamePool to allocate codes to (prefix, uri) pairs representing namespace bindings.


These changes have been pretty effective and it's a while since we have seen a workload suffer from NamePool contention. However, we want to increase the level of parallelism that Saxon can support, and the NamePool remains a potential pinch point.


There are a number of things we can do that would make a difference. We could for example use a Java ReadWriteLock to allow either a single writer or multiple readers; this would allow us to introduce operations such as reconfiguring the hash table as the size of the vocabulary increases, without increasing contention because of the high frequency of read access.


But let's first try and remind ourselves what the NamePool is actually for. It is there, first and foremost, to allow fast run-time testing of whether a particular node satisfies a NameTest. Because we use the same NamePool when constructing a source tree and when compiling a stylesheet or query, the node on the tree and the NameTest in the compiled code both know the integer fingerprint of the name, and testing the node against the NameTest therefore reduces to a fast integer comparison. This is undoubtedly far faster than doing a string comparison, especially one involving long URIs.


If that was the only thing we used the NamePool for, then we would only need a single method, allocateFingerprint(namespace, localName). What are all the other methods there for?


Well, firstly, the NamePool knows the mapping of fingerprints to names, so we have read-only get methods to get the fingerprint corresponding to a name, or the name corresponding to a fingerprint. These are a convenience, but it seems that they are not essential. The client code that calls the NamePool to allocate a fingerprint to a name could retain the mapping somewhere else, so that there is no need to go back to a shared NamePool to rediscover what is already known.


The most obvious case where this happens is with the TinyTree. The TinyTree holds the names of elements and attributes as fingerprints, not as strings, so operations like the XPath local-name() and namespace-uri() functions get the fingerprint from the TinyTree and then call on the NamePool to translate this back to a string. We could avoid this by keeping a map from integers to strings within the TinyTree itself. This could potentially have other benefits: we could make fewer calls on the NamePool to allocate fingerprints during tree construction; and retargeting a TinyTree to work with a different NamePool would be easier.


Secondly, there's a lot of code in the NamePool to manage prefixes. This isn't needed for the core function of matching a node against a NameTest, since that operation ignores namespace prefixes. The detail here is that when we call NamePool.allocate(), we actually supply prefix, uri, and local-name, and we get back a 32-bit nameCode which uniquely represents this triple; the bottom 20 bits uniquely represent the local-name/uri pair, and it is these 20 bits (called the fingerprint) that are used in QName comparisons. The purpose of this exercise has nothing to do with making name comparisons faster; rather it is mainly concerned with saving space in the TinyTree. By packing the prefix information into the same integer as the local-name and URI, we save a few useful bits. But there are other ways of doing this without involving the NamePool; we could use the same few bits to index into a table of prefixes that is local to the TinyTree itself. There are of course a few complications; one of the benefits of the NamePool knowing about prefixes is that it can provide a service of suggesting a prefix to use with a given URI when the system is required to invent one: users like it when the prefix that emerges is one that has previously been associated with that URI by a human being. But there are probably less expensive ways of achieving this.


Let's suppose that we reduced the functionality of the NamePool to a single method, allocate(QName) → int. How would we then implement it to minimize contention? A simple and safe implementation might be



HashMap<QName, Integer> map;

int next = 0;


public synchronized int allocate(QName q) {

Integer n = map.get(q);
if (n == null) {
int m = ++next;
map.put(q, m);
return m;
} else {
return n;
}

}



This still serializes all allocate operations, whether or not a new fingerprint is allocated. We can almost certainly do better by taking advantage of Java's concurrent collection classes, though it's not immediately obvious what the best way of doing it is. But in any case, if we can achieve this then we've reduced the NamePool to something much simpler than it is today, so optimization becomes a lot easier. It's worth noting that the above implementation still gives us the possibility to discover the fingerprint for a known QName, but not to (efficiently) get the QName for a known fingerprint.


To get here, we need to start doing two things:


(a) get prefixes out of the NamePool, and handle them some other way.


(b) stop using the NamePool to discover the name associated with a known fingerprint.


After that, redesign becomes relatively straightforward.

How long is a (piece of) string? - Saxon diaries

| No Comments | No TrackBacks


As I explained in my previous post, I've been re-examining the way functions work in Saxon. In particular, over the last week or two, I've been changing the way system functions (such as fn:string-length) work. There's a terrific amount of detail and complexity here, but I thought it might be interesting to take one simple function (fn:string-length) as an example, to see where the complexity comes from and how it can be reduced.

At first sight, fn:string-length looks pretty simple. How long is a (piece of) string? Just ask Java to find out: surely it should just map to a simple call on java.lang.String.length(). Well, no actually.

If we look to the specification, there are two complications we have to deal with. Firstly we are counting the number of Unicode characters, not (as Java does) the number of 16-bit UTF16 codepoints. In the case of surrogate pairs, one character occupies two codepoints, and that means that a naïve implementation of string-length() takes time proportional to the length of the string.

Secondly, there are two forms of the string-length() function. With zero arguments, it's defined to mean string-length(string(.)). That's different from nearly all other functions that have 0-argument and 1-argument forms, where (for example) name() means name(.). Saxon handles functions like name() by converting them statically to name(.), and that conversion doesn't work in this case. To illustrate the difference, consider an attribute code="003", defined in the schema as an xs:integer. The function call string-length(@code) returns 1 (it atomizes the attribute to produce an integer, converts the integer to the string "3", and then returns the length of this string. But @code!string-length() returns 3 - the length of the string value of the attribute node.

The other complexity applies specifically to string-length#0 (that is, the zero-argument form). Dynamic calls to context-dependent functions bind the context at the point where the function is created, not where it is called. Consider:

<xsl:for-each select="0 to 9">
<xsl:variable name="f" select="string-length#0"/>
<xsl:for-each select="21 to 50">
<xsl:value-of select="$f()"/>
</xsl:for-each>
</xsl:for-each>

This will print the value "1" three hundred times. In each case the context item at the point where $f is bound is a one-digit integer, so $f() returns the length of that integer, which is always one. The context item at the point where $f() is evaluated is irrelevant.

Now let's take a look at the Saxon implementation. There's a Java class StringLength which in Saxon 9.6 is about 200 lines of code (including blank lines, comments, etc), and this does most of the work. But not all: in the end all it does is to call StringValue.getStringLength(), which is what really does the work. Atomic values of type xs:string are represented in Saxon by an instance of the class StringValue, which encapsulates a Java CharSequence: often, but not always, a String. The reason for the encapsulating class is to provide type safety on methods like Function.call() which returns a Sequence; StringValue implements AtomicValue which implements Item which implements Sequence, so the XDM data model is faithfully represented in the Java implementation classes.

In addition there's a class StringLengthCompiler which generates a bytecode implementation of the string-length function. This is another 60 or so lines.

Some functions also have a separate streaming implementation to accept streamed input, and one or two (string-join() and concat(), for example), have an implementation designed to produce streamed output. That's designed to ensure that an instruction like , which compiles down to a call on string-join() internally, doesn't actually assemble the whole output in memory, but rather writes each part of the result string to the output stream as it becomes available.

Since the introduction of dynamic function calls, many system functions have two separate implementations, one for static calls and one for dynamic calls. That's the case for string-length: the evaluateItem() method used for static calls is almost identical to the call() method used for dynamic calls. One reason this happened was because of a fear of performance regression that might occur if the existing code for static calls was generalized, rather than introducing a parallel path.

In 9.6, the implementation of dynamic calls to context-dependent functions like string-length#0 is rather fudged. In fact, the expression string-length#0 compiles into a call on function-lookup("fn:string", 0). The implementation of function-lookup() keeps a copy of both the static and dynamic context at the point where it is called, and this is then used when evaluating the resulting function. This is vastly more expensive than it needs to be: for functions like string-length#0 where there are no arguments other than the context, the function can actually be pre-evaluated at the point of creation. In the new 9.7 implementation, the result of the expression string-length#0 is a function implemented by the class ConstantFunction, which encapsulates its result and returns this result when it is called. (It's not quite as simple as this, because the constant function also has to remember its name and arity, just in case the user asks.)

The method StringValue.getStringLength() attempts to recognize cases where walking through the codepoints of the string to look for surrogate pairs is not actually necessary. In previous releases there was an extra bit kept in StringValue, set when the string was known to contain no surrogate pairs: so having walked the string once, it would never be done again. In 9.6 this mechanism is replaced with a different approach: Saxon includes several implementations of CharSequence that maintain the value as an array of fixed-size integers (8-bit, 16-bit, or 32-bit, as necessary). If the CharSequence within a StringValue is one of these classes (known collectively as UnicodeString), then the length of the string is the length of the array. And when getStringLength() is called on a string the first time, the string is left in this form, in the hope that future operations on the string will benefit. Of course, this will in some cases be counter-productive (and there's a further refinement in the implementation, which I won't go into, that's designed to overcome this).

There are a few other optimizations in the implementation of string-length() that are worth mentioning. Firstly, it's quite common for users to write

<xsl:if test="string-length($x) != 0">

Here we don't need to count surrogate pairs in the string: the string is zero-length if and only if the underlying CharSequence is zero-length. Saxon therefore does a static rewrite of such an expression to boolean(string($x)). (If $x is statically known to be a string, the call string($x) will then be further rewritten as $x.)

If string-length#1 is applied to a value that can be computed statically, then the string-length function is itself computed statically. (This optimization, for odd historical reasons, is often called "constant folding". It's possible only when there are no context dependencies.)

During type-checking, the implementation of string-join#0 keeps a note of whether a context item is known to exist. This is used during byte-code generation; if it's known that the context item won't be absent, then there is no need to generate code to check for this error condition. It's through tiny optimizations like this that generated bytecode ends up being faster than interpreted code.

In my current exercise refactoring the implementation of system functions such as string-length, I've been looking at how much of logic is duplicated either across the different implementations of a single function (streamed and unstreamed, static and dynamic, bytecode and interpreted) or across the implementations of functions that have a lot in common (such as string(), string-length(), and normalize-space()). I've found that with the exception of the core code in StringValue.getStringLength, and the optimization of string-length()=0, everything else can be vastly reduced. In place of the original StringLength class, there are now two (inner) classes StringLength_0 and StringLength_1 each of which consists of a single one-line method. The code for generating byte-code can also be considerably simplified by achieving more reuse across different functions.

The main essence of the reorganization is that the class StringLength (or rather, its two variants) are no longer Expressions, they are now Functions. Previously a call onto string-length($x) compiled to an expression, held as a node on the expression tree. Now it compiles into two object, a StringLength object which is a pure function, and a SystemFunctionCall object which is an expression that calls the function. The SystemFunctionCall object is generic across all functions, while the implementations of SystemFunction contain all the code that is specific to one function. This change was motivated primarily by the need to handle dynamic function calls (and hence first-class function objects) properly, but it has provided a stimulus for a refactoring that achieves much more than this.

So, how long is a piece of string? At least we now know how to work it out more efficiently. Sorry this little yarn wasn't shorter.

Functions, Function Calls, Function Items - Saxon diaries

| No Comments | No TrackBacks

XSLT and XQuery, in their 3.0 incarnations, have become fully functional programming languages, where functions are first class values that can be manipulated in the same way as other values, for example they can be bound to variables and passed as arguments to other functions. So you would expect that functions play a pretty central role in the Saxon implementation. In this article I shall review how functions work within Saxon, and how I think this needs to change.


One of the things that happens when features are added to a complex piece of software is that they tend to be added around the edges, rather than in the core. You can tell when something was added by how central it is to the class hierarchy. It shouldn't be that way, but there are good reasons why it happens. And when we look at how functions work in Saxon, that's what we see. For example, if we had always had dynamic function calls, then we would probably find that static function calls (where the function to be called is known at compile time) were treated simply as a special case of dynamic calls, much as ($x+1) is treated as a special case of ($x+$y). But because they came along later, dynamic calls are actually handled very differently.


I've been doing work recently on the design of Saxon's expression tree (the data structure produced by the compilation phase for use by the execution phase of stylesheet and query evaluation). One aspect of that has been working out how to write the expression tree to persistent storage, and then load it back again for execution. In doing this, I've been struck by the sheer number of different classes that are somehow related to functions, and by the lack of coherence between them.


The first thing we notice, which is in itself a bit smelly, is that there is no class called Function. In fact, for many functions, there is no object in Saxon that represents the function itself. For example, with system functions (recall that in XSLT 1.0 and XPath 1.0 that's the only kind there were), we have two things: a data table containing general information about functions, such as their type signatures and context dependencies, and a class SystemFunctionCall which actually represents a call to the function, not the function itself. The implementation of functions such as name() or substring() is in a class called Name or Substring which is a subclass of SystemFunctionCall. This in turn is a subclass of Expression, and as such it forms a node in the expression tree.


This works fine for static function calls, but what happens to an expression such as substring#3 or true#0 or name#0? Answer: different things in each of these three cases.


For substring#3, this is a pure function: there are no context dependencies. We create something called a SystemFunctionItem, which is an Item (and therefore a Sequence), we wrap this inside a Literal, and we put the Literal on the expression tree. This is much the same as the way we handle a constant such as "London" or 39. Internally, the SystemFunctionItem contains a reference to an instance of the Substring class, which is where things get a bit peculiar, because Substring is designed to be a function call, not a function. In fact, Substring has a dual personality, it tries to act in both roles depending which methods you call. That's a hack, and like all hacks, it comes back to bite you.

For true#0 there's a further bit of hackery, because the function call true() doesn't actually generate a SystemFunctionCall, it generates a BooleanValue: it's treated as a constant, just like "London" or 39. But we have to support dynamic calls on true(), so we introduced a SystemFunctionCall called True to handle this case, even though it doesn't have the dual personality: it acts only as a function, never as a function call.


The name#0 function is different again, because it encapsulates context. It's defined to return the name of the node that was the context node at the point where the name#0 expression was evaluated. So this should remind us that name#0 is not a value, it is an expression: the function it represents has no static existence, it can only be created dynamically by evaluating the expression with a particular dynamic context. We solve this with another hack: for context-dependent function "literals" like name#0 or position#0, the compiler actually generates a call on function-lookup("name", 0), which is an expression rather than a value, and which has to be evaluated at run-time.


The SystemFunctionItem class implements an internal interface called FunctionItem, and as one might expect, a FunctionItem is an Item and therefore a Sequence: that is, it's a run-time value. Other subclasses of FunctionItem are used for calls to user defined functions, Java extension functions, or constructor functions. But they are only ever used for dynamic calls, never for static calls.


Although there is no Function class, some functions are in fact represented as objects in their own right. The most important is UserFunction, which represents a user-written function in XSLT or XQuery. Another is InlineFunction (representing an anonymous inline function), and another is ExtensionFunctionDefinition, which represents a function whose implementation is provided in Java code: this is used both for user-written extension functions, and for many (but not all) vendor supplied extension functions in the Saxon namespace. But these classes have nothing in common with each other, there is no common superclass. This has the consequence that not only is there one set of machinery for static function calls and another quite different set for dynamic calls, but that in each case, there are many different flavours depending on what kind of function you happen to be calling.


Getting this right involves a cleaner data model. As always, a clean data model leads to cleanly structured code, and the data model should be designed to accurately reflect the specification, not for the convenience of the implementation. The specification says we have two objects of interest, a function, and a function call. Quite rightly, the spec no longer uses the term "function item" as something distinct from the function: a function is an item. There's also something called the "function implementation" which we should recognize, because two functions may share an implementation. For example, the expression name#0 returns a different function each time it is evaluated: these functions differ in the data they hold (a snapshot of the dynamic context), but they share the same implementation.


We should be trying to move towards a structure where we have a hierarchy of subclasses of Function (for system functions, constructor functions, user functions, Java extension functions, etc); where the Function is an Item; where a Function may reference a FunctionImplementation to provide for sharing; and where a FunctionCall is as far as possible independent of what kind of function it is calling, but simply has two flavours for static function calls and dynamic function calls.


It will be interesting to see how close we can get to that situation.


Redesigning the Saxon expression tree - Saxon diaries

| No Comments | No TrackBacks
I've been embarking on an exercise to redesign the Saxon expression tree. It's got a number of problems that it would be nice to fix; and there's major work ahead in being able to save and restore the tree as part of a compiled XSLT 3.0 package, so it would be nice to get the structure into better shape first.

One of the problems is that there's 150-odd different implementation classes for nodes on the expression tree, and that's before you start counting individual function calls; and there's far too much duplication between these classes. Implementing a new kind of expression involves far too much code. In addition, there's far too much scope to get this code wrong, leading to obscure bugs when a new kind of expression gets involved in some particular optimization rewrite, such as function or variable inlining. There's a steady flow of bugs, perhaps a couple a month, caused by tree corruptions of one kind or another, and it would be nice to make the whole structure more robust.

One thing I want to do is to be more systematic about the way in which static context information is held on the tree. At present the principle is that each expression saves that part of the static context that it thinks it might need. For example, some expressions need run-time access to the static base URI, some to the static namespace context, and some to the register of collation names. (This last case has been simplified in 9.6 because all collation names are now global to a Configuration, though the default collation can still vary from one expression to another). A recent shock discovery is that in XPath 3.0, general comparisons (that is, the humble "=" operator) can depend on the namespace context - if one operand turns out to be untyped atomic and another is a QName, then the untyped atomic value needs to be converted to a QName using the in-scope namespaces. This means the namespace context needs to be saved with every general comparison expression, which is heavy stuff. Given that the static context is almost always the same for all expressions in a module, it would be much better rather than saving what each expression thinks it might need, to save the changes to the static context on the tree, so each expression can discover the whole static context by processing a set of deltas held with its ancestors in the tree.

This leads to another point, which is that it would be nice to have parent pointers in the expression tree. At present you can navigate from a node to its children (that is, its subexpressions), but not in the other direction. Saxon gets around this by keeping a dynamic stack of expressions visited in some of its operations on the tree (such as type-checking and optimization) so the ancestor expressions are maintained dynamically during the recursive tree-walk rather than being maintained statically on the tree.

In 9.6, mainly to support XSLT 3.0 streamability analysis, we introduced a generalized mechanism for obtaining the subexpressions of any node: the operands() method. This returns a set of Operand objects, each of which contains a pointer to the child expression itself, and also properties in effect of the parent-child expression relationship, such as the XSLT 3.0 posture and sweep properties, and the operand usage, used in the 3.0 general streamability rules. This mechanism has proved very successful (and not just for streaming) in enabling more generalized operations on the tree. But a limitation is that modtifications to the tree, such as substitution one child expression for another (which is very common during type-checking and optimization) is still entirely ad-hoc, and has to be managed independently by each class of expression node.

As a first step in redesigning the tree for 9.7, I have extended the way in which we use Operand objects. As well as being used to navigate to subexpressions, they are also now used to modify subexpressions: the Operand object has a method setChildExpression() which can be used to replace the existing child expression with another. All structural changes to the tree are required to go via this method, which is enforced by encapsulating the reference to the child expression within the Operand object. The Operand also holds a reference to its "owner" expression, so when a child expression is changed, the single setChildExpression() method can take responsibility for housekeeping such as making sure the child expression has location information for use in error reporting, and making sure that expression properties cached on the parent node (such as the inferred type) are invalidated and recomputed when the children of the expression change.

This process is complicated by the fact that the nodes on the expression tree are highly diverse, in fact they don't even all represent expressions. The tree also has to cater, for example, for XSLT patterns and XQuery FLWOR expression clauses.

Making updates go through the Operand object enables many expressions to inherit a generic implementation of common rewrite methods such as typeCheck and optimize. For example, the default action of optimize is to call optimize() on each subexpression, and if any changes have occurred, replace the subexpression with its rewritten self. The redesign means that this "replace" operation can now be done in a generic way, meaning that the default optimize() method does not need to be tailor-made for each class of expression. The same is true of other methods such as primote().

I was hoping that channelling all updates through Operand.setChildExpression() would also make it easy to maintain parent pointers in the tree. Unfortunately this is not the case. It's easy enough, when B is set as a child of A, for B's parent pointer to be updated to point to A. The problem arises when B's parent pointer was previously set to C: what happens to C's children? Can B be a child of A and C simultaneously (making it not a tree at all?). It turns out that some rewrites on the tree involve creating a new structure over existing leaf nodes in the tree, which might then be discarded if not all the conditions for optimization are met. So we've updated the parent pointers in the leaf nodes to this new superstructure, which we then discarded, reverting to the original. It's difficult then to make sure that the parent pointers are reset properly when the rewrite is abandoned. It can be done in an ad-hoc way, of course, but we're looking for something more robust: an API for tree rewriting that doesn't allow the tree to become inconsistent. This is proving hard to achieve. We may have to resort to a different approach, doing a bulk sweep of the tree to set all parent pointers before the typecheck and optimize operations on each template or function. This is intellectually unsatisfactory because it means accepting that the tree could be temporarily inconsistent in the middle of such an operation, but it may be the best option available.

This is work in progress, but it's looking promising. I appreciate that for most users it's about as interesting as the repair works to the sewers beneath your street. Hopefully, though, the bottom line will be fewer product bugs, and more ability to take the product forward into new areas.

Sharing key indexes - Saxon diaries

| No Comments | No TrackBacks
For ever and a day, Saxon has tried to ensure that when several transformations are run using the same stylesheet and the same source document(s), any indexes built for those documents are reused across transformations. This has always required some careful juggling of Java weak references to ensure that the indexes are dropped from memory as soon as either the executable stylesheet or the source document are no longer needed.

I've now spotted a flaw in this design. It wasn't reported by a user, and it didn't arise from a test case, it simply occurred to me as a theoretical possibility, and I have now written a test case that shows it actually happens. The flaw is this: if the definition of the key includes a reference to a global variable or a stylesheet parameter, then the content of the index depends on the values of global variables, and these are potentially different in different transformations using the same stylesheet.

Fixing this is not easy, and poses some interesting dilemmas, given that no-one has hit the problem in a dozen years and it's quite likely that no-one ever will. It would be nice to roll back my discovery of the problem and pretend that it doesn't exist, but that's not my style.

One solution would be to abandon the strategy of reusing indexes. It's likely that very few users have workloads that benefit from this strategy. On the other hand, those that do could see a big performance hit. Removing an powerful optimization because of a problem that doesn't apply to these workloads is not very friendly.

The right solution is to identify those key definitions that allow indexes to be shared and distinguish them from indexes that can't be shared, and then manage the two kinds of index differently. Both the identification and the management can be done, but they introduce more complexity and hence more risk of bugs, and the whole thing feels like overkill given that we don't know any users who have the problem.

Civil engineers design structures to withstand storms that are likely to happen more than once in N years. It would be nice if we could adopt a similar approach here. But software doesn't seem to be like that; it seems to follow Murphy's law, which states that if something can go wrong, it will.

So with heavy heart, I think that there's no alternative to "doing the right thing", with all the complexity it entails.

Reference: Saxon bug 1968; W3C test case key-085a/b.

Another regex rewrite - Saxon diaries

| No Comments | No TrackBacks

In Saxon 9.5, a new regular expression engine was introduced. This was obviously a risky thing to do, but I'm not averse to such risks, and although there is often some short-term inconvenience to users, I believe such things usually bring long-term benefits. The risks relate to both functionality and performance. In both cases we were able to minimize the risks thanks to a good collection of test material, but we always knew that contact with real life problems might throw up things that the test suite didn't.


Why did we need the new engine? There were several factors. On Saxon-CE, the GWT technology didn't offer access to the JDK regex library, only to the Javascript regex engine, which has different syntax and semantics, and no support for non-BMP Unicode characters. We could have translated XPath regexes to Javascript regexes the way we were doing for the JDK, but the expansions to handle non-BMP characters are very cumbersome. (We could also have abandoned W3C regex conformance, but that's not my style.) For "server-side Saxon", I had been impatient with the existing design for some time. There are bugs I reported in the JDK regex engine 5 years ago which Sun/Oracle have shown no inclination to fix; the approach of translating one regex dialect to another is slow and clunky; and the JDK wasn't keeping pace with new Unicode versions. I felt it was time we brought the regex engine in-house.


I looked around for an open source regex engine suitable for forking, and chose the Apache Jakarta engine. Unlike some other options the code was clear and well commented, the performance seemed reasonable, and the license was compatible. I made all the changes needed to support the XPath regex syntax and semantics, and a few other changes including some optimizations, and it passed all the tests. In fact, it has proved very reliable; there have been no bug reports against its functionality.


But there have been two performance glitches.


  • One is the way it handles integer repetition bounds such as a{3,7}. The algorithm it uses for this is the same as Saxon uses in its schema processor, which is described in a paper by Thompson and Tobin [1]; it essentially translates such an expression to aaaa?a?a?a?, and then generates a finite state machine accordingly. This becomes problematic when the integer bounds start to become significantly large; the size of the finite state machine and the time taken to generate it both increase rapidly with size. (For the technically minded, I'm not sure how rapidly, but even if it's only linear, it's a significant problem.)

  • The second problem is the handling of backtracking. The Jakarta engine, like most regex engines, relies on backtracking to handle ambiguous expressions. Backtracking relies on maintaining a stack so you can go back to where you were. In Jakarta, this data was kept on the Java program stack; each checkpoint was represented by a recursive function call. In many cases each character position in the input string is a potential checkpoint, so the depth of recursion was essentially equal to the length of the input string. That puts a limit of around 1K to 2K on the size of the input, which is not really acceptable.


For the 9.5 release I mitigated this problem by introducing an optimization: in cases where repetition was unambiguous, the recursive calls were eliminated. This catches many cases like A*B where you know that you have to keep reading A's until you hit a B, and when you do hit a B you know you won't have to try again from somewhere in the middle of the sequence of A's. But unfortunately, ambiguous expressions are common (an example would be .*B), and we have to handle them.


So I did something that I very rarely do; I studied how the JDK regex engine avoided this problem, by following its activity through the debugger. I was surprised to see that the JDK engine doesn't actually build a finite state machine. Instead it uses a pure interpreter pattern: it builds an abstract syntax tree representing the grammar of the regex, and then evaluates the regex by direct interpretation of the constructs on this tree. The maximum recursive depth with this approach is represented by the depth of the abstract syntax tree (that is, it depends on the regex but not on the input string), and the data needed for backtracking is held on the heap rather than on the Java stack.


Over the last week I have been rewriting the back end of the (ex-)Jakarta regex engine to use a similar design. It's passing all the functionality tests; I still have some performance testing to do, but it's looking good. The front-end code, that is the regex parser, is largely unchanged, but it now generates an operation tree rather than an FSM. I retained many of the features of the existing design, such as the use of a variety of implementations of integer sets and predicates to support the different kind of character classes, but the operations have all been re-implemented to work interpretively. I departed from the JDK design by using an approach similar to Saxon's XPath interpreter, in that the implementation of each operation returns an iterator over all the strings matched by the relevant expression. The key method supported by each operation is


IntIterator iterateMatches(final REMatcher matcher, final int position)


which is passed an REMatcher (contain all necessary information about the match operation, including the input string and regex flags, plus support for captured groups and backreferences), and a start position for matching; it return an IntIterator representing a sequence of integers which are the possible positions where a match can end.


The class handling a repeat operation such as (a|aa)* implements this operation by repeated calls on the underlying choice operation. The first choice operation will return an iterator over the two possible end positions (1,2); the next time round there are four possible positions (1,2,2,3) (there is no attempt to eliminate duplicates), and so on.The method maintains a stack of IntIterator objects representing these successive calls; initially only the first hit from each iterator is used, but when backtracking is necessary, the stack is popped, and the iterator now at the top of the stack is called to deliver its next match.


Saxon continues to take advantage of the optimization that recognizes unambiguous repetitions. With patterns such as A*B, or A{5}A, there is no need for backtracking so the stack of iterators is not used, and only the first match from each iterator is used.


There is scope for further optimization here. Some regular expression engines such as PCRE are known to handle highly-ambiguous expressions much more efficiently than this crude backtracking approach, mainly by eliminating the duplication that arises when a backtrack attempt ends up doing the same matches at the same positions as a previous attempt. Such improvements lie in the future; for the moment, being as good as the Java engine in performance, and better in features such as Unicode support and buglessness, is enough to declare success.


The new code is currently operational on the Saxon 9.6 development branch. In due course it will be retrofitted to Saxon-CE, and probably also to a Saxon 9.5 maintenance release. Probably only 10% of the original Jakarta code remains, but that's probably enough that I can't get rid of the Apache license that goes with it, for those who care about these things.


[1] Using Finite-State Automata to Implement W3C XML Schema Content Model Validation and Restriction Checking. Henry Thompson and Richard Tobin. XML Europe 2003



At the XML Summer School 2013, Tony Graham presented a lightning talk about life after libxslt 1.0.  I was not present for this summer school, but it was clear from the feedback of the discussions I received that there is a major gap of XSLT 2.0 support in the large developer community of C/Perl/PHP/Python/Ruby world and associated tools that rely on libxslt.

It is a known problem, which has never, to my knowledge been addressed. At Saxonica, we wanted to try and plug this gap by porting the Saxon processor from Java to C/C++, which would enable us to communicate with the languages specified above. One of our goals, if possible was to interface with libxml and libxslt. Providing such a bridge or cross-compiled version of a full fledged Java application to C/C++ is always a daunting task. In this blog post I discuss the technical steps in our quest to achieve our goals and give some details of the experiences gained along the way. I will begin by detailing the various technologies that we tried, and how we have have ended up using a commercial Java native compiler after several failed attempts with tools that either did not work, cumbersome or were just too error prone.

LLVM

At the summer school there were discussions that the tool LLVM could do the job of compiling Java to native code. As claimed on the project website LLVM is a collection of modular and reusable compiler and toolchain technologies. The LLVM project seems very active with many projects using it to do various task, but I found it difficult to get anything working. In particular, I tried using the VMKit which relies on LLVM to compile some a simple 'Hello World' examples to machine code, but even that seemed cumbersome.

GCJ

Secondly, I looked at the GCJ technology. GCJ is a tool that I have used before, so I was confident that it would work. However, from my past experience using this tool is that it can be error prone and contains long standing bugs, which is a result of the project being dormant for several years, it seems unlikely that bugs will be fixed. The other worrying fact is that GCJ only supports up-to JDK 1.5. Nevertheless for lack of other options, I persevered with GCJ and I had much better success given that I managed to compile Saxon-HE to native machine code  and actually got it to execute my example stylesheets. I had some problems because of classes that were not present in the GCJ implementation of JDK 1.5, such as the packages java.math and javax.xml. Therefore, I had to include my own version of these packages.

The next step was to create a shared library of Saxon-HE, so that I could interface it with C/C++. This proved to be a real battle, which in the end I succeeded. I decided to use Compiled Native Interface (CNI), which presents a convenient way to write Java native methods using C++. The alternative was JNI (Java Native Interface), which may be viewed as more portable. Both interfaces though have similar principles: you need a Java/CNI-aware C++ compiler, any recent version of G++ is capable, and then you must include the header file for each Java class it uses. These header files, if not automatically generated, can be done using gcjh. I soon gave up on using GCJ: I stumbled upon a few known bugs and because if I was having major issues with the setup and prerequisites required then surely users would have the same problems.

Excelsior JET

The Excelsior JET tool is the final technology we looked at and thankfully it is what we have ended up using in the alpha release. JET is a commercial product that provides a Java native compiler for both Linux and Windows platforms. What is good about this software tool is that it provides an easy to use Graphical interface to build native executables and shared libraries from jar file(s). It also has the feature to package up the software into an installer ready to be deployed onto its intended host machine. This was great for us! 

There is a lot I could write about JET, but it would be a repeat of the plethora of information currently available on their website and forum. However, just to mention we started with their evaluation version which offers 90-days free usage of their software before purchasing the professional edition. Another point of interest is that Excelsior offer a free-of-charge license for use in conjunction with open-source software.

We know that there will be some sections of the open-source community that dislike the dependency upon using a commercial tool, but it is not that dissimilar from the early years of Java when the Sun compiler was freely available but not open-sourced.  

Implementation notes using JET

After creating the shared library, to interface it with C/C++ I used JNI. It is possible to use JET's own Java interface to external functions called xFunction, which is recommended if starting from scratch, but having used JNI with GCJ I continued with that approach. To get started there are a few examples of invoking a library with C/C++. In essence, you need to load the library and initialize the JET run-time before you can use it, see the code below (from the file xsltProcessor.cc):


/* Load dll. */
HANDLE loadDll(char* name)
{
  HANDLE hDll = LoadLibrary (name);

  if (!hDll) {
    printf ("Unable to load %s\n", name);
    exit(1);
  }

  printf ("%s loaded\n", name);
  return hDll;
}

extern "C" {jint (JNICALL * JNI_GetDefaultJavaVMInitArgs_func) (void *args);
            jint (JNICALL * JNI_CreateJavaVM_func) (JavaVM **pvm, void **penv, void *args);
}

/*Initialize JET run-time.*/
extern "C" void initJavaRT(HANDLE myDllHandle, JavaVM** pjvm, JNIEnv** penv)
  int result;
  JavaVMInitArgs args;

  JNI_GetDefaultJavaVMInitArgs_func =
  (jint (JNICALL *) (void *args))
  GetProcAddress (myDllHandle, "JNI_GetDefaultJavaVMInitArgs");
  JNI_CreateJavaVM_func =
  (jint (JNICALL *) (JavaVM **pvm, void **penv, void *args))
  GetProcAddress (myDllHandle, "JNI_CreateJavaVM");

  if(!JNI_GetDefaultJavaVMInitArgs_func) {
    printf ("%s doesn't contain public JNI_GetDefaultJavaVMInitArgs\n", dllname);
    exit (1);
  }

  if(!JNI_CreateJavaVM_func) {
    printf ("%s doesn't contain public JNI_CreateJavaVM\n", dllname);
    exit (1);
  }

  memset (&args, 0, sizeof(args));
  args.version = JNI_VERSION_1_2;
  result = JNI_GetDefaultJavaVMInitArgs_func(&args);
  if (result != JNI_OK) {
    printf ("JNI_GetDefaultJavaVMInitArgs() failed with result %d\n", result);
    exit(1);
  }

  /* NOTE: no JVM is actually created
  * this call to JNI_CreateJavaVM is intended for JET RT initialization
  */
  result = JNI_CreateJavaVM_func (pjvm, (void **)penv, &args);
  if (result != JNI_OK) {
    printf ("JNI_CreateJavaVM() failed with result %d\n", result);
    exit(1);
  }
  printf ("JET RT initialized\n");
  fflush (stdout);
}

XsltProcessor::XsltProcessor(bool license) {
  /* * First of all, load required component.
  * By the time of JET initialization, all components should be loaded.
  */
  myDllHandle = loadDll (dllname);

  /*
  * Initialize JET run-time.
  * The handle of loaded component is used to retrieve Invocation API.
  */
  initJavaRT (myDllHandle, &jvm, &env);
  
  /* Look for class.*/
  cppClass = lookForClass(env, "net/sf/saxon/option/cpp/XsltProcessorForCpp");
  versionClass = lookForClass(env, "net/sf/saxon/Version");

  cpp = createObject (env, cppClass, "(Z)V", license);
  jmethodID debugMID = env->GetStaticMethodID(cppClass, "setDebugMode", "(Z)V");
  if(debugMID){
    env->CallStaticVoidMethod(cppClass, debugMID, (jboolean)false);
  }
  ....
}
...

In the constructor method of XsltProcessor we see that once we have loaded the library and initialized the JET run-time we can now make calls to the environment, which has been created to get class definitions and create instance(s) of the class in the Java world. This is before we make method calls on the object.

PHP Extension in C/C++

After successfully getting XSLT transformations to work within C/C++, the next step was to try and develop a PHP extension, which would operate like libxslt. There is a lot of material on the web and books in regards to PHP extensions and I found the following guide very useful: http://devzone.zend.com/1435/wrapping-c-classes-in-a-php-extension/. I literally followed it step-by-step, adding a few steps of my own when I worked out what I was doing.

Testing

As a proof of concept I wrote a test harness in PHP which makes use of the PHP extension (see: xslt30TestSuite.php in the download library). This is a test driver designed to run the public W3C XSLT test suite at https://dvcs.w3.org/hg/xslt30-test/. The test driver in its current form requires Saxon-EE, which is not yet available in this alpha release; nevertheless, the program may serve as a useful example of how the API can be used. Note that it is written to use libXML to read the test catalog, but to use Saxon for running the tests and assessing the results.

Performance Testing

I now draw comparisons between running Saxon-HE (on Java) vs running Saxon-HE/C on C++ and on PHP on some preliminary tests. I also compare these times to libxslt (C/C++). An important aim is to get a good measure of the costs of crossing the Java/C++ boundary using JNI and also to see what the effect is with the PHP extension. 

I used Saxon-HE 9.5.1.3 as the baseline. The test machine was a Intel Core i5 processor 430M laptop with 4GB memory, 2.26Ghz CPU and 3MB L3 cache, running Ubuntu 13.10 Linux. Servers Apache2 and PHP version 5.5.3-1ubuntu2. The compiler was Sun/Oracle Java 1.6.0.43. 

The experiments were based on the XMark benchmark. I used query q8, which was converted into the stylesheet below. The choice of q8.xsl is because we should expect some performance bottle-necks across the implementations due to its equijoins in the query:

<result xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xsl:version="2.0">
<!-- Q8.  List the names of persons and the number of items they bought.
          (joins person, closed_auction) -->

  <xsl:for-each select="/site/people/person">
    <xsl:variable name="a" 
       select="/site/closed_auctions/closed_auction[buyer/@person = current()/@id]"/>
    <item person="{name}"><xsl:value-of select="count($a)"/></item>  
  </xsl:for-each>
  
</result>  

The running times of executing q8.xsl on the document xmark64.xml, which is a 64MB size document are as follows:

Saxon-HE (Java):    60.5 seconds

Saxon-HE/C (C++): 132 seconds 

Saxon-HE/C (PHP): 137 seconds

libxslt (C/C++):        213 seconds

* Update on the times reported for Saxon-HE/C as a result of optimizations in the JET compiler.
* Code used to get libxslt time taken from: http://xmlsoft.org/XSLT/tutorial/libxslttutorial.html

The times for Saxon-HE/C are without the cost of JET initialisation and loading the library, which accounted for only 4 seconds. So we observe that there is not a big overhead between C++ and the PHP extension. The biggest cost as expected is between Java and C++, where we see on the C++/PHP platform a slowdown of ~ x2.2. We also observe that Saxon-HE/C out performs libxslt on C/C++ by ~40% on q8.

See project page on Saxon/C.
 




Stripping the DOM - Saxon diaries

| No Comments | No TrackBacks
I discovered yesterday that Saxon-CE isn't applying xsl:strip-space directives to input documents.  An unfortunate bug: not that many users seem too bothered by it, but conformance is always important.

A reminder: xsl:strip-space and xsl:preserve-space are designed to remove "ignorable" whitespace from the application's view. Because XML  doesn't distinguish significant from insignificant whitespace (a bad design mistake), this can only be done under program control. The idea is to present the application (the stylesheet) with a view of the input tree in which the insignificant whitespace does not appear, and the xsl:strip-space and xsl:preserve-space directives allow the stylesheet author to say which whitespace is considered significant.

How to fix the bug? Server-side Saxon (what I am starting to call "Big Saxon") traditionally provides two mechanisms: either it wraps the supplied DOM in a wrapping layer that hides the offending whitespace text nodes, or it bulk-copies the supplied DOM to an internal tree structure in which the whitespace is not copied across. Both methods have disadvantages. The "wrapping" approach imposes a significant extra cost on navigation of the tree (the cost is higher than it might appear, because whitespace stripping also has to take account of xml:space attributes in the document, which means a lot of searching for attributes-of-ancestors). The bulk-copy approach imposes a significant start-up cost to the transformation, often larger than the cost of doing the real transformation, and the cost is particularly high when processing large input documents because it doubles the memory requirement.

I'm considering using a third approach: in-situ modification of the supplied DOM. For Big Saxon, I have tended to treat this option as unthinkable: you don't muck with the user's data in this way. The same DOM, after all, might be input to multiple stylesheets, perhaps concurrently (though DOM is not thread-safe so this is a bad idea....) and they might have different space-stripping directives. A Java application might pass the DOM to an XSLT stylesheet to do some processing, but might want to continue operating on the same DOM afterwards.

In Saxon-CE, however, these considerations hardly apply. In the vast majority of cases the DOM is built solely in order to act as XSLT input. Unlike Big Saxon, where one can assume that if users supply a DOM they are doing so by choice, with Saxon-CE there is no alternative: all input XML arrives this way.

In-situ clean-up of the input DOM has other attractions; it can be used to solve the other two big problems with DOM processing (both arising from a mismatch between the DOM model and the XDM model): adjacent text nodes, and namespaces. The same pre-scan that strips unwanted whitespace nodes can also concatenate adjacent text nodes (eliminating entity references and CDATA sections in the process); presenting a view of adjacent text nodes as a single node accounts for a great deal of complexity (and lines of code, and execution time) in navigating a DOM. Equally for namespaces. To determine the namespace part of the name of an element or attribute, Saxon can't simply call getNamespaceURI() on the underlying DOM node. That method is defined to have no effect if "the node was created using DOM level-1 interfaces". Saxon of course has no idea how the node was created, and DOM provides no way of finding out. So it does an optimistic call of getNamespaceURI() just in case, and if this returns null then it has no choice but to search the node's ancestors and attributes looking for namespace declarations - and since this is done every time you look at a node to see whether it satisfies an axis step, it's no wonder that Saxon can be ten times slower processing a DOM than when processing its native tree structure. We should be able to cut this cost dramatically by doing a pre-scan of the document and calling setNamespaceURI() on all element and attribute nodes.

The more I think about this, the more appealing the idea of doing the same thing on Big Saxon. The fact is, most people who supply a DOM as input to XSLT do so out of ignorance; they have no idea of the extra costs they are incurring, or of the fact that supplying SAX input would be ten times faster. It's important to deliver good performance not only to experts who know how to get the ultimate performance out of the system, but also to the average user who just copies JAXP code from somewhere on the web without any great thought or understanding. For that kind of user, in-situ modification of the supplied DOM might be the right answer. We can always continue to provide wrapping or bulk-copying as alternative configuration options, for people to invoke if they don't like their data being trampled upon.

Returning to Saxon-CE, the other complication is the HTML DOM. Running XPath against an HTML DOM is already sufficiently quirky (e.g. because of case-insensitive element names, namespace non-support, etc etc) that it's probably wisest to keep things as simple as possible on this front. I don't think we need to apply xsl:strip-space directives to the HTML document; there's nothing in the spec that would make this a conformance issue, and there's no usability imperative. So we'll leave well alone here.