Navigating XML trees using Java Streams - Saxon diaries

| No Comments | No TrackBacks

Navigating XML trees using Java Streams

For the next major Saxon release I am planning an extension to the s9api interface to exploit the facilities of Java 8 streams to allow powerful navigation of XDM trees: the idea is that navigation should be as easy as using XPath, but without the need to drop out of Java into a different programming language. To give a flavour, here is how you might select the elements within a document that have @class='hidden':

doc.select(descendant(isElement())
   .where(attribute("class").eq("hidden")))

We'll see how that works in due course.

Why do we need it?

The combination of Java and XML is as powerful and ubiquitous today as it as been for nearly twenty years. Java has moved on considerably (notably, as far as this article is concerned, with the Java 8 Streams API), and the world of XML processing has also made great strides (we now have XSLT 3.0, XPath 3.1, and XQuery 3.1), but for some reason the two have not moved together. The bulk of Java programmers manipulating XML, if we can judge from the questions they ask on forums such as StackOverflow, are still using DOM interfaces, perhaps with a bit of XPath 1.0 thrown in.

DOM shows its age. It was originally designed for HTML, with XML added as an afterthought, and XML namespaces thrown in as a subsequent bolt-on. Its data model predates the XML Infoset and the (XPath-2.0-defined) XDM model. It was designed as a cross-language API and so the designers deliberately eschewed the usual Java conventions and interfaces in areas such as the handling of collections and iterators, not to mention exceptions. It does everything its own way. As a navigational API it carries a lot of baggage because the underlying tree is assumed to be mutable. Many programmers only discover far too late that it's not even thread-safe (even when you confine yourself to retrieval-only operations).

There are better APIs than DOM available (for example JDOM2 and XOM) but they're all ten years old and haven't caught up with the times. There's nothing in the Java world that compares with Linq for C# users, or ElementTree in Python.

The alternative of calling out from Java to execute XPath or XQuery expressions has its own disadvantages. Any crossing of boundaries from one programming language to another involves data conversions and a loss of type safety. Embedding a sublanguage in the form of character strings within a host language (as with SQL and regular expressions) means that the host language compiler can't do any static syntax checking or type checking of the expressions in the sublanguage. Unless users go to some effort to avoid it, it's easy to find that the cost of compiling XPath expressions is incurred on each execution, rather than being incurred once and amortized. And the API for passing context from the host language to the sublanguage can be very messy. It doesn't have to be quite as messy as the JAXP interface used for invoking XPath from Java, but it still has to involve a fair bit of complexity.

Of course, there's the alternative of not using Java (or other general-purpose programming languages) at all: you can write the whole application in XSLT or XQuery. Given the capability that XSLT 3.0 and XQuery 3.1 have acquired, that's a real possibility far more often than most users realise. But it remains true that if only 10% of your application is concerned with processing XML input, and the rest is doing something more interesting, then writing the whole application in XQuery would probably be a poor choice.

Other programming languages have developed better APIs. Javascript has JQuery, C# programmers have Linq, Scala programmers have something very similar, and PHP users have SimpleXML. These APIs all have the characteristic that they are much more deeply integrated into the host language, and in particular they exploit the host language primitives for manipulation of sequences through functional programming constructs, with a reasonable level of type safety given that the actual structure of the XML document is not statically known.

That leads to the question of data binding interfaces: broadly, APIs that exploit static knowledge of the schema of the source document. Such APIs have their place, but I'm not going to consider them any further in this article. In my experience they can work well if the XML schema is very simple and very stable. If the schema is complex or changing, data binding can be a disaster.

The Java 8 Streams API

This is not the place for an extended tutorial on the new Streams API introduced in Java 8. If you haven't come across it, I suggest you find a good tutorial on the web and read it before you go any further.

But just to summarise:

  • Java 8 introduces a new interface, Stream<X>, representing a linear sequence of items of type X
  • Like iterators, streams are designed to be used once. Unlike iterators, they are manipulated using functional operations, most notably maps and filters, rather than being processed one item at a time. This makes for less error-prone programming, and allows parallel execution.

The functional nature of the Java 8 Streams API means it has much in common with the processing model of XPath. The basic thrust of the API design presented in this article is therefore to reproduce the primitives of the XPath processing model, re-expressing them in terms of the constructs provided by the Java 8 Streams API.

If the design appears to borrow concepts from other APIs such as LINQ and Scala and SimpleXML, that's not actually because I have a deep familiarity with those APIs: in fact, I have never used them in anger, and I haven't attempted to copy anything across literally. Rather, any similarity is because the functional concepts of XPath processing map so cleanly to this approach.

The Basics of the Saxon s9api API

The Saxon product primarily exists to enable XSLT, XQuery, XPath, and XML Schema processing. Some years ago I decided that the standard APIs (JAXP and XQJ) for invoking such functionality were becoming unfit for purpose. They had grown haphazardly over the years, the various APIs didn't work well together, and they weren't being updated to exploit the newer versions of the W3C specifications. Some appalling design mistakes had been unleashed on the world, and the strict backwards compatibility policy of the JDK meant these could never be corrected.

So I decided some years ago to introduce a proprietary alternative called s9api into the Saxon product (retaining JAXP support alongside), and it has been a considerable success, in that it has withstood the test of time rather well. The changes to XSLT transformation in 3.0 were sufficiently radical that I forked the XsltTransformer interface to create a 3.0 version, but apart from that it has been largely possible to add new features incrementally. That's partly because of a slightly less obsessive attitude to backwards compatibility: if I decide that something was a bad mistake, I'm prepared to change it.

Although s9api is primarily about invoking XSLT, XQuery, and XPath processing, it does include classes that represent objects in the XDM data model, and I will introduce these briefly because the new navigation API relies on these objects as its foundation. The table below lists the main classes.

Class Description
XdmValue Every value is the XDM model is a sequence of items. The XdmValue class is therefore the top of the class hierarchy. Because it's a sequence, it implements Iterable<XdmItem>, so you can use a Java foreach loop to process the items sequentially. In the latest version I have used Java generics to add a type parameter, so XdmValue<XdmNode> is a sequence of nodes, and XdmValue<XdmAtomicValue> is a sequence of atomic values. As well as an iterator() method, it has an itemAt() method to get the Nth item, and a size() method to count the items. Internally an XdmValue might exist as an actual sequence in memory, or as a "promise": sufficient data to enable the items to be materialized when they are needed.
XdmItem This class represents an Item in the XDM model. As such it is both a component of an XdmValue, and also an XdmValue (of length one) in its own right. It's an abstract class, because every item is actually something more specific (a node, an atomic value, a function). Some of the methods inherited from XdmValue become trivial (for example size() always returns 1).
XdmNode This is a subclass of XdmItem used to represent nodes. Unlike many models of XML, we don't subclass this for different kinds of node: that's mainly because XDM has deliberately aimed at uniformity, with the same accessors available for all node kinds. Many of the methods on XdmNode, such as getNodeName(), getStringValue(), getTypedValue(), and getNodeKind(), are directly equivalent to accessors defined in the W3C XDM specification. But in addition, XdmNode has a method axisIterator to navigate the tree using any of the XPath axes, the result being returned as an iterator over the selected nodes.
XdmAtomicValue Another subclass of XdmItem, this is used to represent atomic values in the XDM model. As with XdmNode, we don't define further subclasses for different atomic types. There are convenience methods to convert XdmAtomicValue instances to and from equivalent (or near-equivalent) Java classes such as String, Double, BigInteger, and Date.
XdmFunctionItem From XPath 3.0, functions are first-class values alongside nodes and atomic values. These are represented in s9api as instances of XdmFunctionItem. Two specific subclasses of function, with their own behaviours, are represented using the subclasses XdmMap and XdmArray. I won't be saying much about these in this article, because I'm primarily concerned with navigating XML trees.


The new API: Steps and Predicates

The basic concept behind the new extensions to the s9api API is navigation using steps and predicates. I'll introduce these concepts briefly in this section, and then go on to give a more detailed exposition.

The class XdmValue<T> acquires a new method:

XdmStream select(Step step)

The Step here is a function that takes an item of class T as its input, and returns a stream of items. If we consider a very simple Step, namely child(), this takes a node as input and returns a stream of nodes as its result. We can apply this step to an XdmValue consisting entirely of nodes, and it returns the concatenation of the streams of nodes obtained by applying the step to each node in the input value. This operation is equivalent to the "!" operator in XPath 3.0, or to the flatMap() method in many functional programming languages. It's not quite the same as the familiar "/" operator in XPath, because it doesn't eliminate duplicates or sort the result into document order. But for most purposes it does the same job.

There's a class net.sf.saxon.s9api.streams.Steps containing static methods which provide commonly-used steps such as child(). In my examples, I'll assume that the Java application has import net.sf.saxon.s9api.streams.Steps.*; in its header, so it can use these fields and methods without further qualification.

One of the steps defined by this class is net.sf.saxon.s9api.streams.Steps.child(): this step is a function which, given a node, returns its children. There are other similar steps for the other XPath axes. So you can find the children of a node N by writing N.select(child()).

Any two steps S and T can be combined into a single composite step by writing S.then(T): for example Step grandchildren = child().then(child()) gives you a step which can be used in the expression N.select(grandchildren) to select all the grandchildren.

The class Step inherits from the standard Java class Function, so it can be used more generally in any Java context where a Function is required.

Predicate<T> is a standard Java 8 class: it defines a function that can be applied to an object of type T to return true or false. The class net.sf.saxon.s9api.streams.Predicates defines some standard predicates that are useful when processing XML. For example isElement() gives you a predicate that can be applied to any XdmItem to determine if it is an element node.

Given a Step A and a Predicate P, the expression A.where(P) returns a new Step that filters the results of A to include only those items that satisfy the predicate P. So, for example, child().where(isElement()) is a step that selects the element children of a node, so that N.select(child().where(isElement())) selects the element children of N. This is sufficiently common that we provide a shorthand: it can also be written N.select(child(isElement())).

The predicate hasLocalName("foo") matches nodes having a local name of "foo": so N.select(child().where(hasLocalName("foo")) selects the relevant children. Again this is so common that we provide a shorthand: N.select(child("foo")). There is also a two argument version child(ns, "foo") which selects children with a given namespace URI and local name.

Another useful predicate is exists(step) which tests whether the result of applying a given step returns at least one item. So, for example N.select(child().where(exists(attribute("id")))) returns those children of N that have an attribute named "id".

The result of the select() method is always a stream of items, so you can use methods from the Java Stream class such as filter() and flatMap() to process the result. Here are some of the standard things you can do with a stream of items in Java:

  • You can get the results as an array: N.select(child()).toArray()
  • Or as a list: N.select(child()).collect(Collectors.toList())
  • You can apply a function to each item in the stream: N.select(child()).forEach(System.err::println)
  • You can get the first item in the stream: N.select(child()).findFirst().get()

However, Saxon methods such as select() always return a subclass of Stream called XdmStream, and this offers additional methods. For example:

  • You can get the results as an XdmValue: N.select(child()).asXdmValue()
  • A more convenient way to get the results as a Java List: N.select(child()).asList()
  • If you know that the stream contains a single node (or nothing), you can get this using the methods asNode() or asOptionalNode()
  • Similarly, if you know that the stream contains a single atomic value (or nothing), you can get this using the methods asAtomic() or asOptionalAtomic()
  • You can get the last item in the stream: N.select(child("para")).last()

More about Steps

The actual definition of the Step class is:

public abstract class Step<T extends XdmItem> implements Function<XdmItem, Stream<? extends T>>

What that means is that it's a function that any XdmItem as input, and delivers a stream of U items as its result (where U is XdmItem or some possibly-different subclass). (I experimented by also parameterizing the class on the type of items accepted, but that didn't work out well.)

Because the types are defined, Java can make type inferences: for example it knows that N.select(child()) will return nodes (because child() is a step that returns nodes).

As a user of this API, you can define your own kinds of Step if you want to: but most of the time you will be able to do everything you need with the standard Steps available from the class net.sf.saxon.s9api.stream.Steps. The standard steps include:

  • The axis steps ancestor(), ancestor-or-self(), attribute(), child(), descendant(), descendantOrSelf(), following(), followingSibling(), namespace(), parent(), preceding(), precedingSibling(), self().
  • For each axis, three filtered versions: for example child("foo") filters the axis to select elements by local name (ignoring the namespace if any); child(ns, local) filters the axis to select elements by namespace URI and local name, and child(predicate) filters the axis using an arbitrary predicate: this is a shorthand for child().where(predicate).
  • A composite step can be constructed using the method step1.then(step2). This applies step2 to every item in the result of step1, retaining the order of results and flattening them into a single stream.
  • A filtered step can be constructed using the method step1.where(predicate1). This selects those items in the result of step1 for which predicate1 returns true.
  • A path with several steps can be constructed using a call such aspath(child(isElement()), attribute("id")). This returns a step whose effect is to return the id attributes of all the children of the target node.
  • If the steps are sufficiently simple, a path can also by written means of a simple micro-syntax similar to XPath abbreviated steps. The previous example could also be written path("*", "@id"). Again, this returns a step that can be used like any other step. (In my own applications, I have found myself using this approach very extensively).
  • The step atomize() extracts the typed values of nodes in the input, following the rules in the XPath specification. The result is a stream of atomic values
  • The step toString() likewise extracts the string values, while toNumber() has the same effect as the XPath number() function

Last but not least, xpath(path) returns a Step that evaluates an XPath expression. For example, doc.select(xpath("//foo")) has the same effect as doc.select(descendant("foo")). A second argument to the xpath() method may be used to supply a static context for the evaluation. Note that compilation of the XPath expression occurs while the step is being created, not while it is being evaluated; so if you bind the result of xpath("//foo") to a variable, then the expression can be evaluated repeatedly without recompilation.

More about Predicates

The Predicate class is a standard Java 8 interface: it is a function that takes any object as input, and returns a boolean. You can use any predicates you like with this API, but the class net.sf.saxon.s9api.streams.Predicates provides some implementations of Predicate that are particularly useful when navigating XML documents. These include the following:

  • isElement(), isAttribute(), isText(), isComment(), isDocument(), isProcessingInstruction(), isNamespace() test that the item is a node of a particular kind
  • hasName("ns", "local"), hasLocalName("n"), and hasNamespaceUri("ns") make tests against the name of the node
  • hasType(t) tests the type of the item: for example hasType(ItemType.DATE) tests for atomic values of type xs:date
  • exists(step) tests whether the result of applying the given step is a sequence containing at least one item; conversely empty(step) tests whether the result of the step is empty. For example, exists(CHILD) is true for a node that has children.
  • some(step, predicate) tests whether at least one item selected by the step satisfies the given predicate. For example, some(CHILD, IS_ELEMENT) tests whether the item is a node with at least one element child. Similarly every(step, predicate) tests whether the predicate is true for every item selected by the step.
  • eq(string) tests whether the string value of the item is equal to the given string; while eq(double) does a numeric comparison. A two-argument version eq(step, string) is shorthand for some(step, eq(string)). For example, descendant(eq(attribute("id"), "ABC")) finds all descendant elements having an "id" attribute equal to "ABC".
  • Java provides standard methods for combining predicates using and, or, and not. For example isElement().and(eq("foo")) is a predicate that tests whether an item is an element with string-value "foo".

The XdmStream class

The fact that all this machinery is built on Java 8 streams and functions is something that many users can safely ignore; they are essential foundations, but they are hidden below the surface. At the same time, a user who understands that steps and predicates are Java Functions, and that the result of the select() method is a Java Stream, can take advantage of this knowledge.

One of the key ideas that made this possible was the idea of subclassing Stream with XdmStream. This idea was shamelessly stolen from the open-source StreamEx library by Tagir Valeev (though no StreamEx code is actually used). Subclassing Stream enables additional methods to be provided to handle the results of the stream, avoiding the need for clumsy calls on the generic collect() method. Another motivating factor here is to allow for early exit (short-circuit evaluation) when a result can be delivered without reading the whole stream. Saxon handles this by registering onClose() handlers with the stream pipeline, so that when the consumer of the stream calls the XdmStream.close() method, the underlying supplier of data to the stream is notified that no more data is needed.

Examples

This section provides some examples extracted from an actual program that uses s9api interfaces and does a mixture of Java navigation and XPath and XQuery processing to extract data from an input document.

First, some very simple examples. Constructs like this are not uncommon:

XdmNode testInput = (XdmNode) xpath.evaluateSingle("test", testCase);

This can be replaced with the much simpler and more efficient:

XdmNode testInput = testCase.selectFirst(child("test"));

Similarly, the slightly more complex expression:

XdmNode principalPackage = (XdmNode) xpath.evaluateSingle("package[@role='principal']", testInput);

becomes:

XdmNode principalPackage = testInput.selectFirst(child("package").where(eq(attribute("role"), "principal"));

A more complex example from the same application is this one:

boolean definesError = xpath.evaluate("result//error[starts-with(@code, 'XTSE')]", testCase).size() > 0;

Note here how the processing is split between XPath code and Java code. This is also using an XPath function for which we haven't provided a built-in predicate in s9api. But that's no problem, because we can invoke Java methods as predicates. So this becomes:

boolean definesError = testCase.selectFirst(child("result"), descendant("error").where(
                     some(attribute("code"), (XdmNode n) -> n.getStringValue().startsWith("XTSE"))) != null;

Capturing Accumulators - Saxon diaries

| No Comments | No TrackBacks
A recent post on StackOverflow made me realise that streaming accumulators in XSLT 3.0 are much harder to use than they need to be.

A reminder about what accumulators do. The idea is that as you stream your way through a large document, you can have a number of tasks running in the background (called accumulators) which observe the document as it goes past, and accumulate information which is then available to the "main" line of processing in the foreground. For example, you might have an accumulator that simply keeps a note of the most recent section heading in a document; that's useful because the foreground processing can't simply navigate around the document to find the current section heading when it finds that it's needed.

Accumulator rules can fire either on start tags or end tags or both, or they can be associated with text nodes or attributes. But there's a severe limitation: a streaming accumulator must be motionless: that's XSLT 3.0 streaming jargon to say that it can only see what's on the parser's stack at the time the accumulator triggers. This affects both the pattern that controls when the accumulator is triggered, and the action that it can take when the rule fires.

For example, you can't fire a rule with the pattern match="section[title='introduction']" because navigation to child elements (title) is not allowed in a motionless pattern. Similarly, if the rule fires on  match="section", then you can't access the title in the rule action (select="title") because the action too must be motionless. In some cases a workaround is to have an accumulator that matches the text nodes (match="section/title/text()[.='introduction']") but that doesn't work if section titles can have mixed content.

It turns out there's a simple fix, which I call a capturing accumulator rule. A capturing accumulator rule is indicated by the extension attribute <xsl:accumulator-rule saxon:capture="yes" phase="end">, which will always be a rule that fires on an end-element tag. For a capturing rule, the background process listens to all the parser events that occur between the start tag and the end tag, and uses these to build a snapshot copy of the node. A snapshot copy is like the result of the fn:snapshot function - it's a deep copy of the matched node, with ancestor elements and their attributes tagged on for good measure. This snapshot copy is then available to the action part of the rule processing the end tag. The match patterns that trigger the accumulator rule still need to be motionless, but the action part now has access to a complete copy of the element (plus its ancestor elements and their attributes).

Here's an example. Suppose you've got a large document like the XSLT specification, and you want to produce a sorted glossary at the end, and you want to do it all in streamed mode. Scattered throughout the document are term definitions like this:

<termdef id="dt-stylesheet" term="stylesheet">A  <term>stylesheet</term> consists of one or more packages: specifically, one
    <termref def="dt-top-level-package">top-level package</termref> and zero or
   more <termref def="dt-library-package">library packages</termref>.</termdef>



Now we can write an accumulator which simply accumulates these term definitions as they are encountered:

<xsl:accumulator name="terms" streamable="yes">
   
<xsl:accumulator-rule match="termdef" phase="end" select="($value, .)" saxon:capture="yes"/>
</xsl:accumulator>


(the select expression here takes the existing value of the accumulator, $value, and appends the snapshot of the current termdef element, which is available as the context item ".")

And now, at the end of the processing, we can output the glossary like this:

<xsl:template match="/" mode="streamable-mode">
    <html> 
        <!-- main foreground processing goes here -->
        <xsl:apply-templates mode="#current"/>
        <!-- now output the glossary -->
        <div id="glossary" class="glossary">
            <xsl:apply-templates select="accumulator-after('terms')" mode="glossary">
                <xsl:sort select="@term" lang="en"/>
            </xsl:apply-templates>
        </div>
    </html>
</xsl:template>



The value of the accumulator is a list of snapshots of termdef elements, and because these are snapshots, the processing at this point does not need to be streamable (snapshots are ordinary trees held in memory).

The amount of memory needed to accomplish this is whatever is needed to hold the glossary entries. This follows the design principle behind XSLT 3.0 streaming, which was not to do just those things that required zero working memory, but to enable the programmer to do things that weren't purely streamable, while having control over the amount of memory needed.

I think it's hard to find an easy way to tackle this particular problem without the new feature of capturing accumulator rules, so I hope it will prove a useful extension.

I've implemented this for Saxon 9.9. Interestingly, it only took about 25 lines of code: half a dozen to enable the new extension attribute, half a dozen to allow it to be exported to SEF files and re-imported, two or three to change the streamability analysis, and a few more to invoke the existing streaming implementation of the snapshot function from the accumulator watch code. Testing and documenting the feature was a lot more work than implementing it.

Here's a complete stylesheet that fleshes out the creation of a (skeletal) glossary:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:package
  name="http://www.w3.org/xslt30-test/accumulator/capture-203"
  package-version="1.0"
  declared-modes="no"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:f="http://accum001/"
  xmlns:saxon="http://saxon.sf.net/"
  exclude-result-prefixes="#all" version="3.0">

 
<!-- Stylesheet to produce a glossary using capturing accumulators -->
 
 
<!-- The source document is a W3C specification in xmlspec format, containing
    term definitions in the form <termdef term="banana">A soft <termref def="fruit"/></termdef> -->
 
 
<!-- This test case shows the essential principles of how to render such a document
    in streaming mode, with an alphabetical glossary of defined terms at the end -->
 
 
<xsl:param name="streamable" static="yes" select="'yes'"/>
 
 
<xsl:accumulator name="glossary" as="element(termdef)*" initial-value="()" streamable="yes">
   
<xsl:accumulator-rule match="termdef" phase="end" saxon:capture="yes" select="($value, .)"/>
 
</xsl:accumulator>

 
<xsl:mode streamable="yes" on-no-match="shallow-skip" use-accumulators="glossary"/>
 
 
<xsl:template name="main">
   
<xsl:source-document href="xslt.xml" streamable="yes" use-accumulators="glossary">
     
<xsl:apply-templates select="."/>
   
</xsl:source-document>
 
</xsl:template>
 
 
<xsl:template match="/">
   
<out>
     
<!-- First render the body of the document -->
     
<xsl:apply-templates/>
     
<!-- Now generate the glossary -->
     
<table>
       
<tbody>
         
<xsl:apply-templates select="accumulator-after('glossary')" mode="glossary">
           
<xsl:sort select="@term" lang="en"/>
         
</xsl:apply-templates>
       
</tbody>
     
</table>
   
</out>
 
</xsl:template>
 
 
<xsl:template match="div1|inform-div1">
   
<div id="{@id}">
     
<xsl:apply-templates/>
   
</div>
 
</xsl:template>
 
 
<!-- Main document processing: just output the headings -->
 
 
<xsl:template match="div1/head | inform-div1/head">
   
<xsl:attribute name="title" select="."/>
 
</xsl:template>
 
 
<!-- Glossary processing -->
 
 
<xsl:mode name="glossary" streamable="no"/>
 
 
<xsl:template match="termdef" mode="glossary">
   
<tr>
     
<td>
       
<xsl:value-of select="@term"/>
     
</td>
     
<td>
       
<xsl:value-of select="."/>
     
</td>
   
</tr>
 
</xsl:template>

</xsl:package>


  

Diagnostics on Type Errors - Saxon diaries

| No Comments | No TrackBacks
Providing good diagnostics for programming errors has always been a high priority in Saxon, second only to conformance with the W3C specifications. One important area of diagnostics is reporting on type errors: that is, cases where a particular context requires a value of a given type, and the supplied value is the wrong type. A classic example would be providing a string as the first argument to format-date(), which requires an xs:date to be supplied.

Of course, the more programmers follow the discipline of declaring the expected types of function parameters and variables, the more helpful the compiler can be in diagnosing programming errors caused by supplying the wrong type of value.

Type errors can be detected statically or dynamically. Saxon uses "optimistic type checking". 

At compile time, it a value of type R is required in a particular context, and the expression appearing in that context is E, then the compiler attempts to infer the static type of expression E: call this S. Sometimes this is straightforward, for example if E is a call on the node-name() function, then it knows that S is xs:QName. In other case the compiler has to be smarter: for example it knows that the static type of a call on remove() is the same as the static type of the first argument, with an adjustment to the occurrence indicator.

Optimistic type checking reports an error at compile time only if there is nothing in common between the required type R and the inferred static type of E: that is, if there is no overlap between the set of instances of the two types. That would mean that a run-time failure is inevitable (assuming the code actually gets executed), and the W3C specifications allow early reporting of such an error.

There's another interesting case where the types overlap only to the extent that both allow an empty sequence: for example if the required type is (xs:string*) and the supplied type is (xs:integer*). That's almost certainly an error, but W3C doesn't allow an error to be reported here because there is a faint chance that execution could succeed. So Saxon reports this as a warning. With maps and arrays, incidentally, there are analogous situations where the only overlap is an empty map or array, but Saxon isn't yet handling that case specially.

If the types aren't completely disjoint, there are two other possibilities: the required type R might subsume the supplied type S, meaning that no run-time type checking is needed because the call will always succeed. The other possibility is that the types overlap: evaluating the supplied expression E might or might not produce a value that matches the required type R. In this case Saxon generates code to perform run-time type checking. (This is one reason why declaring the types of parameters and variables is such good practice: the code runs faster because there is no unnecessary run-time checking.)

Until recently, the error message for a type error takes the form:

Required item type of CCC is RRR; supplied value has item type SSS

For example:

Required item type of first argument to format-date() is xs:date; supplied value has item type xs:string

which works pretty well in most cases. However, I'm finding that as I write more complex code involving maps and arrays, it's no longer good enough. The problem is that as the types become more complex, simply giving the required and actual types isn't enough to make it clear why they are incompatible. You end up with messages like this one:

Required item type of first argument of local:x() is map(xs:integer, xs:date); supplied value has item type map(xs:anyAtomicType, xs:date).

where an expert user can probably work out that the problem is that the supplied map contains an entry whose key is not an integer; but it doesn't exactly point clearly to the source of the problem.

The problem comes to a head particularly when tuple types are used (see here). If the required type is a tuple type, reporting the supplied type as a map type is particularly unhelpful.

I'm therefore changing the approach: instead of reporting on the supplied type of the value (or the inferred type of the expression, in the case of static errors), I'm reporting an explanation of why it doesn't match. Here's the new version of the message:

The required item type of the first argument of local:x() is map(xs:integer, xs:date); the supplied value map{xs:date("2018-03-16Z"):5, "x":3} does not match. The map contains a key (xs:date("2018-03-16Z")) of type xs:date that is not an instance of the required type xs:integer.

So firstly, I'm outputting the actual value, or an abbreviated form of it, rather than just its type (that only works, of course, for run-time errors). And secondly, I'm highlighting how the type-checker worked out that the value doesn't match the required type: it's saying explicitly which rule was broken.

(Another minor change you can see here is that I'm making more effort to write complete English sentences.)

This doesn't just benefit the new map and array types, you can also see the effect with node types. For example, if the required type is document-node(element(foo)), you might see the message:

The required item type of the first argument of local:x() is document-node(element(Q{}foo)); the supplied value doc() does not match. The supplied document node has an element child (<bar>) that does not satisfy the element test. The node has the wrong name.

Another change I'm making is to distribute type-checking into a sequence constructor. At present, if a function is defined to return (say) a list of element nodes, and the function body contains a sequence of a dozen instructions, one of which returns a text node, you get a message saying that the type of the function result is wrong, but it doesn't pinpoint exactly why. By distributing the type checking (applying the principle that if the function must return element nodes, then each of the instructions must return element nodes) we can (a) identify the instruction in error much more precisely, and (b) avoid the run-time cost of checking the results of those instructions that we know statically are OK.

Interestingly, all these changes were stimulated by my own recent experience in writing a complex stylesheet. I described the plans for this here and the coding has now been completed (I'll report on the outcome later). It's a classic case of dogfood: if you use your own products in anger, you find ways of improving them that you wouldn't have thought of otherwise, and that users wouldn't have suggested because they don't know what's possible.

Many computing platforms are not well-served by up to date XML technology, and in consequence Saxonica has been slowly increasing its coverage of the major platforms: extending from Java to .NET, C++, PHP, Javascript using a variety of technical approaches. This makes it desirable to implement as much as possible using portable languages, and if we want to minimize our dependence on third-party technologies (IKVMC, for example, is now effectively unsupported) we should be writing in our own languages, notably XSLT.


This note therefore asks the question, could one write an XSD Schema 1.1 processor in XSLT?


In fact a schema processor has two parts, compile time (compiling schema documents into the schema component model and SCM) and run-time (validating an instance document using the SCM).


The first part, compiling, seems to pose no intrinsic difficulty. Some of the rules and constraints that need to be enforced are fairly convoluted, but the only really tricky part is compiling grammars into finite-state-machines, and checking grammars (or the resulting finite-state-machine) for conformance with rules such as the Unique Particle Attribution constraint. But since we already have a tool (written in Java) for compiling schemas into an XML-based SCM file, and since it wouldn't really inconvenience users too much for this tool to be invoked via an HTTP interface, the priority for a portable implementation is really the run-time part of the processor rather than the compile-time part. (Note that this means ignoring xsi:schemaLocation, since that effectively causes the run-time validator to invoke the schema compiler.)


There are two ways one could envisage implementing the run-time part in XSLT: either with a universal stylesheet that takes the SCM and the instance document as inputs, or by generating a custom XSLT stylesheet from the SCM, rather as is done with Schematron. For the moment I'll keep an open mind which of these two approaches is preferable.


Ideally, the XSLT stylesheet would use streaming so the instance document being validated does not need to fit in memory. We'll bear this requirement in mind as we look at the detail.


The XSLT code, of course, cannot rely on any services from a schema processor, so it cannot be schema-aware.


Let's look at the main jobs the validator has to do.


Validating strings against simple types


Validating against a primitive type can be done simply using the XPath castable operator.


Validating against a simple type derived by restriction involves checking the various facets. For the most part, the logic of each facet is easily expressed in XPath. There are a few exceptions:


  • Patterns (regular expressions). The XPath regular expression syntax is a superset of the XSD syntax. To evaluate XSD regular expressions, we either need some kind of extension to the XPath matches() function, or we need to translate XSD regular expressions into XPath regular expressions. This translation is probably not too difficult. It mainly involves rejecting some disallowed constructs (such as back-references, non-capturing groups, and reluctant quantifiers), and escaping "^" and "$" with a backslash.

  • Length facets for hexBinary and base64Binary. Base646Binary can be cast to hexBinary, and the length of the value in octets can be computed by converting to string and dividing the string length by 2.


Validating against a list type can be achieved by tokenizing, and testing each token against the item type.


Validating against a union type can be achieved by validating against each member type (and also validing against any constraining facets defined at the level of the union itself).


Validating elements against complex types


The only difficult case here is complex content. It should be possible to achieve this by iterating over the child nodes using xsl:iterate, keeping the current state (in the FSM) as the value of the iteration parameter. On completion the element is valid if the state is a final state. As each element is processed, it needs to be checked against the state of its parent element's FSM, and in addition a new validator is established for validating its children. This is all streamable.


Assertions and Conditional Type Assignment


Evaluating XPath expressions can be achieved using xsl:evaluate. The main difficulty is setting up the node-tree to which xsl:evaluate is applied. This needs to be a copy of the original source subtree, to ensure that the assertion cannot stray outside the relevant subtree. Making this copy consumes the source subtree, which makes streaming tricky: however, the ordinary complex type validation can also happen on the copy, so I think streaming is possible.


Identity constraints (unique, key, keyref)


This is where streaming really gets quite tricky - especially given the complexity of the specification for those rare keyref cases where the key is defined on a different element from the corresponding keyref.


The obvious XSLT mechanism here is accumulators. But accumulator rules are triggered by patterns, and defining the patterns that correspond to the elements involved in a key definition is tricky. For example if sections nest recursively, a uniqueness constraint might say that for every section, its child section elements must have unique @section-number attributes. A corresponding accumulator would have to maintain a stack of sections, with a map of section numbers at each level of the stack, and the accumulator rule for a section would need to check the section number of that section at the current level, and start a new level.


A further complication is that there may be multiple (global and/or local) element declarations with the same name, with different unique / key / keyref constraints. Deciding which of these apply by means of XSLT pattern matching is certainly difficult and may be impossible.


The multiple xs:field elements within a constraint do not have to match components of the key in document order, but a streamed implementation would still be possible using the map constructor, which allows multiple downward selections - provided that the xs:field selector expressions are themselves streamable, which I think is probably always the case.


The problem of streamability could possibly be solved with some kind of dynamic pipelining. The "main" validation process, when it encounters a start tag, is able to establish which element declaration it belongs to, and could in principle spawn another transformation (processing the same input stream) for each key / unique constraint defined in that element declaration: a kind of dynamic xsl:fork.


I think as a first cut it would probably be wise not to attempt streaming in the case of a schema that uses unique / key / keyref constraints. More specifically, if any element has such constraints, it can be deep-copied, and validation can then switch to the in-memory subtree rather than the original stream. After all, we have no immediate plans to implement streaming other than in the Java product, and that will inevitably make an XSLT-based schema processor on other platforms unstreamed anyway.


Outcome of validation


There are two main scenarios we should support: validity checking, and type annotation. With validity checking we want to report many invalidities in a single validation episode, and the main output is the validation report. With type annotation, the main output is a validated version of the instance document, and a single invalidity can cause the process to terminate with a dynamic error.


It is not possible for a non-schema-aware stylesheet to add type annotations to the result tree without some kind of extensions. The XSLT language only allows type annotations to be created as the result of schema validation. So we will need an extension for this purpose: perhaps a saxon:type-annotation="QName" attribute on instructions such as xsl:element, xsl:copy, xsl:attribute.


For reporting validation errors, it's important to report the location of the invalidity. This also requires extensions, such as saxon:line-number().


Conclusion


I don't think there are any serious obstacles to writing a validation engine in XSLT. Making it streamable is harder, especially for integrity constraints. A couple of extensions are needed: the ability to add type annotations to the result tree, and the ability to get line numbers of nodes in the source.


I still have an open mind about whether a universal stylesheet should be used, or a generated stylesheet for a particular schema.

Transforming JSON - Saxon diaries

| No Comments | No TrackBacks

In my conference paper at XML Prague in 2016 I examined a couple of use cases for transforming JSON structures using XSLT 3.0. The overall conclusion was not particularly encouraging: the easiest way to achieve the desired results was to convert the JSON to XML, transform the XML, and then convert it back to JSON.

Unfortunately this study came too late to get any new features into XSLT 3.0. However, I've been taking another look at the use cases to see whether we could design language extensions to handle them, and this is looking quite encouraging.

Use case 1: bulk update

We start with the JSON document

[ { 
  "id": 3, "name": "A blue mouse", "price": 25.50, 
  "dimensions": {"length": 3.1, "width": 1.0, "height": 1.0}, 
  "warehouseLocation": {"latitude": 54.4, "longitude": -32.7 }}, 
  { 
  "id": 2, "name": "An ice sculpture", "price": 12.50, 
  "tags": ["cold", "ice"], 
  "dimensions": {"length": 7.0, "width": 12.0, "height": 9.5 }, 
  "warehouseLocation": {"latitude": -78.75, "longitude": 20.4 }
} ]

and the requirement: for all products having the tag "ice", increase the price by 10%, leaving all other data unchanged. I've prototyped a new XSLT instruction that allows this to be done as follows:

<saxon:deep-update
   root="json-doc('input.json')
   select=" ?*[?tags?* = 'ice']"
   action="map:put(., 'price', ?price * 1.1)"/>

How does this work?

First the instruction evaluates the root expression, which in this case returns the map/array representation of the input JSON document. With this root item as context item, it then evaluates the select expression to obtain a sequence of contained maps or arrays to be updated: these can appear at any depth under the root item. With each of these selected maps or arrays as the context item, it then evaluates the action expression, and uses the returned value as a replacement for the selected map or array. This update then percolates back up to the root item, and the result of the instruction is a map or array that is the same as the original except for the replacement of the selected items.

The magic here is in the way that the update is percolated back up to the root. Because maps and arrays are immutable and have no persistent identity, the only way to do this is to keep track of the maps and arrays selected en-route from the root item to the items selected for modification as we do the downward selection, and then modify these maps and arrays in reverse order on the way back up. Moreover we need to keep track of the cases where multiple updates are made to the same containing map or array. All this magic, however, is largely hidden from the user. The only thing the user needs to be aware of is that the select expression is constrained to use a limited set of constructs when making downward selections.

The select expression select="?*[?tags?* = 'ice']" perhaps needs a little bit of explanation. The root of the JSON tree is an array of maps, and the initial ?* turns this into a sequence of maps. We then want to filter this sequence of maps to include only those where the value of the "tags" field is an array containing the string "ice" as one of its members. The easiest way to test this predicate is to convert the value from an array of strings to a sequence of strings (so ?tags?*) and then use the XPath existential "=" operator to compare with the string "ice".

The action expression map:put(., 'price', ?price * 1.1) takes as input the selected map, and replaces it with a map in which the price entry is replaced with a new entry having the key "price" and the associated value computed as the old price multiplied by 1.1.

Use case 2: Hierarchic Inversion

The second use case in the XML Prague 2016 paper was a hierarchic inversion (aka grouping) problem. Specifically: we'll look at a structural transformation changing a JSON structure with information about the students enrolled for each course to its inverse, a structure with information about the courses for which each student is enrolled.

Here is the input dataset:

[{ "faculty": "humanities", 
   "courses": [ 
    { "course": "English", 
      "students": [ 
       { "first": "Mary", "last": "Smith", "email": "mary_smith@gmail.com"}, 
       { "first": "Ann", "last": "Jones", "email": "ann_jones@gmail.com"}
      ]
    },
    { "course": "History", 
      "students": [ 
        { "first": "Ann", "last": "Jones", "email": "ann_jones@gmail.com" }, 
        { "first": "John", "last": "Taylor", "email": "john_taylor@gmail.com"} 
      ] 
    } ] 
 }, { 
  "faculty": "science", 
  "courses": [ 
  { "course": "Physics", 
    "students": [ 
     { "first": "Anil", "last": "Singh", "email": "anil_singh@gmail.com"}, 
     { "first": "Amisha", "last": "Patel", "email": "amisha_patel@gmail.com"}]
  }, 
 { "course": "Chemistry", 
    "students": [ 
     { "first": "John", "last": "Taylor", "email": "john_taylor@gmail.com"}, 
     { "first": "Anil", "last": "Singh", "email": "anil_singh@gmail.com"}
   ]
 } ]
}]

The goal is to produce a list of students, sorted by last name then irst name, each containing a list of courses taken by that student, like this:

[
  { "email": "anil_singh@gmail.com", 
    "courses": ["Physics", "Chemistry" ]},
  { "email": "john_taylor@gmail.com", 
    "courses": ["History", "Chemistry" ]},
  .... 
]

The classic way of handling this is in two phases: first reduce the hierarchic input to a flat sequence in which all the required information is contained at one level, and then apply grouping to this flat sequence.

To achieve the flattening we introduce another new XSLT instruction:

<saxon:tabulate-maps
    root="json-doc('input.json')"
    select="?* ! map:find(., 'students)?*"/>

Again the root expression delivers a representation of the JSON document as an array of maps. The select expression first selects these maps ("?*"), then for each one it calls map:find() to get an array of maps each representing a student. The result of the instruction is a sequence of maps corresponding to these student maps in the input, where each output map contains not only the fields present in the input (first, last, email), but also fields inherited from parents and ancestors (faculty, course). For good measure it also contains a field _keys containing an array of keys representing the path from root to leaf, but we don't actually use that in this example.

Once we have this flat structure, we can construct a new hierarchy using XSLT grouping:

<xsl:for-each-group select="$students" group-by="?email">
   <xsl:map>
     <xsl:map-entry key="'email'" select="?email"/>
     <xsl:map-entry key="'first'" select="?first"/>
     <xsl:map-entry key="'last'" select="?last"/>
     <xsl:map-entry key="'courses'">
        <saxon:array>
           <xsl:for-each select="current-group()">
               <saxon:array-member select="?course"/>
           </xsl:for-each>
       </saxon:array>
    </xsl:map-entry>
  </xsl:map>
</xsl:for-each-group>

This can then be serialized using the JSON output method to produce to required output.

Note: the saxon:array and saxon:array-member instructions already exist in Saxon 9.8. They fill an obvious gap in the XSLT 3.0 facilities for handling arrays - a gap that exists largely because the XSL WG was unwilling to create a dependency XPath 3.1.

Use Case 3: conversion to HTML

This use case isn't in the XML Prague paper, but is included here for completeness.

The aim here is to construct an HTML page containing the information from a JSON document, without significant structural alteration. This is a classic use case for the recursive application of template rules, so the aim is to make it easy to traverse the JSON structure using templates with appropriate match patterns.

Unfortunately, although the XSLT 3.0 facilities allow patterns that match maps and arrays, they are cumbersome to use. Firstly, the syntax is awkward:

match=".[. instance of map(...)]"

We can solve this with a Saxon extension allowing the syntax

match="map()"

Secondly, the type of a map isn't enough to distinguish one map from another. To identify a map representing a student, for example, we aren't really interested in knowing that it is a map(xs:string, item()*). What we need to know is that it has fields (email, first, last). Fortunately another Saxon extension comes to our aid: tuple types, described here: http://dev.saxonica.com/blog/mike/2016/09/tuple-types-and-type-aliases.html With tuple types we can change the match pattern to

match="tuple(email, first, last)"

Even better, we can use type aliases:

<saxon:type-alias name="student" as="tuple(email, first, last)"/>
<xsl:template match="~student">...</xsl:template>

With this extension we can now render this input JSON into HTML using the stylesheet:

<?xml version="1.0" encoding="utf-8"?> 

<xsl:stylesheet
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
   xmlns:xs="http://www.w3.org/2001/XMLSchema"
   xmlns:saxon="http://saxon.sf.net/"
   exclude-result-prefixes="#all"
   expand-text="yes"

  <saxon:type-alias name="faculty" type="tuple(faculty, courses)"/>
  <saxon:type-alias name="course" type="tuple(course, students)"/>
  <saxon:type-alias name="student" type="tuple(first, last, email)"/>

  <xsl:template match="~faculty">
    <h1>{?faculty} Faculty</h1>
    <xsl:apply-templates select="?courses?*"/>
  </xsl:template>

  <xsl:template match="~course">
    <h2>{?course} Course</h2>
    <p>List of students:</p>
    <table>
      <thead>
        <tr>
          <th>Name</th>
          <th>Email</th>
        </tr>
      </thead>
      <tbody>
        <xsl:apply-templates select="?students?*">
          <xsl:sort select="?last"/>
          <xsl:sort select="?first"/>
        </xsl:apply-templates>
      </tbody>
    </table>
  </xsl:template>

  <xsl:template match="~student">
    <tr>
      <td>{?first} {?last}</td>
      <td>{?email}</td>
    </tr>
  </xsl:template>

  <xsl:template name="xsl:initial-template">
    <xsl:apply-templates select="json-doc('courses.json')"/>
  </xsl:template>

</xsl:stylesheet>

Conclusions

With only the facilities of the published XSLT 3.0 recommendation, the easiest way to transform JSON is often to convert it first to XML node trees, and then use the traditional XSLT techniques to transform the XML, before converting it back to JSON.

With a few judiciously chosen extensions to the language, however, a wide range of JSON transformations can be achieved natively.

Bugs: How well are we doing? - Saxon diaries

| No Comments | No TrackBacks
We're about to ship another Saxon 9.7 maintenance release, with another 50 or so bug clearances. The total number of patches we've issued since 9.7 was released in November 2015 has now reached almost 450. The number seems frightening and the pace is relentless. But are we getting it right, or are we getting it badly wrong?

There are frequently-quoted but poorly-sourced numbers you can find on the internet suggesting a norm of 10-25 bugs per thousand lines of code. Saxon is 300,000 lines of (non-comment) code, so that would suggest we can expect a release to have 3000 to 7500 bugs in it. One one measure that suggests we're doing a lot better than the norm. Or it could also mean that most of the bugs haven't been found yet.

I'm very sceptical of such numbers. I remember a mature product in ICL that was been maintained by a sole part-time worker, handling half a dozen bugs a month. When she went on maternity leave, the flow of bugs magically stopped. No-one else could answer the questions, so users stopped sending them in. The same happens with Oracle and Microsoft. I submitted a Java bug once, and got a response 6 years later saying it was being closed with no action. When that happens, you stop sending in bug reports. So in many ways, a high number of bug reports doesn't mean you have a buggy product, it means you have a responsive process for responding to them. I would hate the number of bug reports we get to drop because people don't think there's any point in submitting them.

And of course the definition of what is a bug is completely slippery. Very few of the bug reports we get are completely without merit, in the sense that the product is doing exactly what it says on the tin; at the same time, rather few are incontrovertible bugs either. If diagnostics are unhelpful, is that a bug?

The only important test really is whether our users are satisfied with the reliability of the product. We don't really get enough feedback on that at a high level. Perhaps we should make more effort to find out; but I so intensely hate completing customer satisfaction questionnaires myself that I'm very reluctant to inflict it on our users. Given that open source users outnumber commercial users by probably ten-to-one, and that the satisfaction of our open source users is just as important to us as the satisfaction of our commercial customers (because it's satisfied open source users who do all the sales work for us); and given that we don't actually have any way of "reaching out" to our open source users (how I hate the marketing jargon); and given that we really wouldn't know what to differently if we discovered that 60% of our users were "satisfied or very satisfied": I don't really see very much value in the exercise. But I guess putting a survey form on the web site wouldn't be difficult, some people might interpret it as a signal that we actually care.

With 9.7 there was a bit of a shift in policy towards fixing bugs pro-actively (more marketing speak). In particular, we've been in a phase where the XSLT and XQuery specs were becoming very stable but more test cases were becoming available all the time (many of them, I might add, contributed by Saxonica - often in reaction to queries from our users). So we've continuously been applying new tests to the existing release, which is probably a first. Where a test showed that we were handling edge cases incorrectly, and indeed when the spec was changed in little ways under our feet, we've raised bugs and fixes to keep the conformance level as high as possible (while also maintaining compatibility). So we've shifted the boundary a little between feature changes (which traditionally only come in the next release), and bug fixes, which come in a maintenance release. That shift also helps to explain why the gap between releases is becoming longer - though the biggest factor holding us back, I think, is the ever-increasing amount of testing that we do before a release.

Fixing bugs pro-actively (that is before any user has hit the bug) has the potential to improve user satisfaction if it means that they never do hit the bug. I think it's always as well to remember also that for every user who reports a bug there may be a dozen users who hit it and don't report it. One reason we monitor StackOverflow is that a lot of users feel more confident about reporting a problem there, rather than reporting it directly to us. Users know that their knowledge is limited and they don't want to make fools of themselves, and you need a high level of confidence to tell your software vendor that you think the product is wrong. 

On the other hand, destabilisation is a risk. A fix in one place will often expose a bug somewhere else, or re-awaken an old bug that had been laid to rest. As a release becomes more mature, we try to balance the benefits of fixing problems with the risk of de-stabilisation.

So, what about testing? Can we say that because we've fixed 450 bugs, we didn't run enough tests in the first place?

Yes, in a sense that's true, but how many more tests would have had to write in order to catch them? We probably run about a million test cases (say, 100K tests in an average of ten product configurations each) and these days the last couple of months before a major release are devoted exclusively to testing. (I know that means we don't do enough continuous testing. But sorry, it doesn't work for me. If we're doing something radical to the internals of the product then things are going to break in the process, and my style is to get the new design working while it's still fresh in my head, then pick up the broken pieces later. If everything had to work in every nightly build, we would never get the radical things done. That's a personal take, and of course what works with a 3-4 person team doesn't necessarily work with a larger project. We're probably pretty unusual in developing a 300Kloc software package with 3-4 people, so lots of our experience might not extrapolate.)

We've had a significant number of bug reports this time on performance regression. (This is of course another area where it's arguable whether it's a bug or not. Sometimes we will change the design in a way that we know benefits some workloads at the expense of others.) Probably most of these are extreme scenarios, for example compilation time for stylesheets where a single template declares 500 local variables. Should we have run tests to prevent that? Well, perhaps we should have more extreme cases in our test suite: the vast majority of our test cases are trivially small. But the problem is, there will always be users who do things that we would never have imagined. Like the user running an XSD 1.1 schema validation in which tens of thousands of assertions are expected to "fail", because they've written it in such a way that assertion failures aren't really errors, they are just a source of statistics for reporting on the data.

The bugs we hate most (and therefore should to most to prevent) are bugs in bytecode generation, streaming, and multi-threading. The reason we hate them is that they can be a pig to debug, especially when the user-written application is large and complex. 

  • For bytecode generation I think we've actually got pretty good test coverage, because we not only run every test in the QT3 and XSLT3 test suites with bytecode generation enabled, we also artificially complicate the tests to stop queries like 2+5 being evaluated by the compiler before bytecode generation kicks in. We've also got an internal recovery mechanism so if we detect that we've generated bad code, we fall back to interpreted mode and the user never notices (problem with that is of course that we never find out).
  • Streaming is tricky because the code is so convoluted (writing everything as inverted event-based code can be mind-blowing) and because the effects of getting it wrong often give very little clue as to the cause. But at least the failure is "in your face" for the user, who will therefore report the problem, and it's likely to be reproducible. Another difficulty with streaming is that because not all code is streamable, tests for streaming needed to be written from scratch.
  • Multi-threading bugs are horrible because they occur unpredictably. If there's a low probability of the problem happening then it can require a great deal of detective work to isolate the circumstances, and this often falls on the user rather than on ourselves. Fortunately we only get a couple of these a year, but they are a nightmare when they come. In 9.7 we changed our Java baseline to Java 6 and were able therefore to replace many of the hand-built multithreading code in Saxon with standard Java libraries, which I think has helped reliability a lot. But there are essentially no tools or techniques to protect you from making simple thread-safety blunders, like setting a property in a shared object without synchronization. Could we do more testing to prevent these bugs? I'm not optimistic, because the bugs we get are so few, and so particular to a specific workload, that searching the haystack just in case it contains a needle is unlikely to be effective.
Summary: Having the product perceived as reliable by our users is more important to us than the actual bug count. Fixing bugs quickly before they affect more users is probably the best way of achieving that. If the bug count is high because we're raising bugs ourselves as a result of our own testing, then that's no bad thing. It hasn't yet got to the level where we can't cope with the volumes, or where we have to filter things through staff who are only employed to do support. If we can do things better, let us know.


Guaranteed Streamability - Saxon diaries

| No Comments | No TrackBacks
The XSLT 3.0 specification in its current form provides a set of rules (that can be evaluated statically, purely by inspecting the stylesheet) for determining whether the code is (or is not) guaranteed streamable.

If the code is guaranteed streamable then every processor (if it claims to support streaming at all) must use streaming to evaluate the stylesheet; if it is not guaranteed streamable then the processor can choose whether to use streaming or not.

The tricky bit is that there's a requirement in the spec that if the code isn't guaranteed streamable, then a streaming processor (on request) has to detect this and report it. The status section of the spec says that this requirement is "at risk", meaning it might be removed if it proves too difficult to implement. There are people on the working group who believe passionately that this requirement is really important for interoperability; there are others (including me) who fully understand why users would like to have this, but have been arguing that it is extremely difficult to deliver.

In this article I'm going to try to explain why it's so difficult to achieve this requirement, and to explore possibilities for overcoming these difficulties.

Streamability analysis can't be performed until various other stages of static analysis are complete. It generally requires that names have been resolved (for example, names of modes and names of streamable functions). It also relies on rudimentary type analysis (determining the static type of constructs). For Saxon, this means that streamability analysis is done after parsing, name fixup, type analysis, and rewrite optimization.

When Saxon performs these various stages of analysis, it modifies the expression tree as it goes: not just to record the information obtained from the analysis, but to make use of the information at execution time. It goes without saying that in modifying the expression tree, it's not permitted to replace a streamable construct with a non-streamable one, and that isn't too hard to achieve (though these things are relative...). But the requirement to report departures from guaranteed streamability imposes a second requirement, which is proving much harder. If we are to report any deviations from guaranteed streamability, then up to the point where we do the streamability analysis, we must never replace a non-streamable construct with a streamable one.

There are various points at which we currently replace a non-streamable construct with a streamable one.

  • Very early in the process, the expression tree that is output by the parsing phase uses the same data structure on the expression tree to represent equivalent constructs in the source. For example, the expression tree produced by <xsl:if test="$a=2"><xsl:sequence select="3"/></xsl:if> will be identical to the expression tree produced by <xsl:sequence select="if ($a=2) then 3 else ()"/>. But streamability analysis makes a distinction between these two constructs. It's not a big distinction (in fact, the only thing it affects is exactly where you are allowed to call the accumulator-after() function) but it's big enough to count.
  • At any stage in the process, if we spot a constant expression then we're likely to replace it with its value. For example if we see the expression $v+3, and $v is a global variable whose value is 5, we will replace the expression with the literal 8. This won't usually affect streamability one way or the other. However, there are a few cases where it does. The most obvious is where we work out that an expression is void (meaning it always returns an empty sequence). For example, according to the spec, the expression (author[0], author[1]) is not streamable because it makes two downward selections. But Saxon spots that author[0] is void and rewrites the expression as (author[1]), which is streamable. Void expressions often imply some kind of user error, so we often output a warning when this happens, but just because we think the user has written nonsense doesn't absolve us from the conformance requirement to report on guaranteed streamability. Void expressions are particularly likely to be found with schema-aware analysis.
  • Inlining of calls to user-defined functions will often make a non-streamable expression streamable.
  • Many other rewrites performed by the optimizer have a similar effect, for example replacing (X|Y) by *[self::X|self::Y].
My first attempt to meet the requirement is therefore (a) to add information to the expression tree where it's needed to maintain a distinction that affects streamability, and (b) to try to avoid those rewrites that turn non-streamable expressions into streamable ones. As a first cut, skipping the optimization phase completely seems an easy way to achieve (b). But it turns out it's not sufficient, firstly because some rewrites are done during the type-checking phase, and secondly because it turns out that without an optimization pass, we actually end up finding that some expressions that should be streamable are not. The most common case for this is sorting into document order. Given the expression A/B, Saxon actually builds an expression in the form sort(A!B) relying on the sort operation to sort nodes into document order and eliminate duplicates. This relies on the subsequent optimization phase to eliminate the sort() operation when it can. If we skip the optimization phase, we are left with an unstreamable expression.

The other issue is that the streamability rules rely on type inferencing rules that are much simpler than the rules Saxon uses. It's only in rare cases that this will make a difference, of course: in fact, it requires considerable ingenuity to come up with such cases. The most obvious case where types make a difference to streamability is with a construct like <xsl:value-of select="$v"/>: this is motionless if $v is a text or attribute node, but consuming if it is a document or element node. If a global variable with private visibility is initialized with select="@price", but has no "as" attribute, Saxon will infer a type of attribute(price) for the variable, but the rules in the spec will infer a type of item()*. So to get the same streamability answer as the spec gives, we need to downgrade the static type inferencing in Saxon.

So I think the changes needed to replicate exactly the streamability rules of the XSLT 3.0 spec are fairly disruptive; moreover, implementing the changes by searching for all the cases that need to change is going to be very difficult to get right (and is very difficult to test unless there is another trustworthy implementation of the rules to test against).

This brings us to Plan B. Plan B is to meet the requirement by writing a completely free-standing tool for streamability analysis that's completely separate from the current static analysis code. One way to do this would be to build on the tool written by John Lumley and demonstrated at Balisage a couple of years ago. Unfortunately that's incomplete and out of date, so it would be a significant effort to finish it. Meeting the requirement in the spec is different here from doing something useful for users: what the spec demands is a yes/no answer as to whether the code is streamable; what users want to know is why, and what they need to change to make the code streamable. The challenge is to do this without users having to understand the difficult abstractions in the spec (posture, sweep, and the rest). John's tool produces an annotated expression tree revealing all the properties: that's great for a user who understands the methodology but probably rather bewildering to the typical end user. Doing the minimum for conformance, a tool that just says yes or no without saying why, involves a lot of work to get a "tick in the box" with a piece of software that no-one will ever use, but would be a lot easier to produce. Conformance has always been a very high priority for Saxonica, but I can't see anyone being happy with this particular solution.

So, assuming the WG maintains its insistence of having this feature (and it seems to me likely that it will), what should we do about it?

One option is simply to declare a non-conformance. Once upon a time, standards conformance was very important to Saxon's reputation in the market, but I doubt that this particular non-conformance would affect our sales.

Another option is to declare conformance, do our best to achieve it using the current analysis technology, and simply log bugs if anyone reports use cases where we get the answer wrong. That seems sloppy and dishonest, and could leave us with a continuing stream of bugs to be fixed or ignored.

Another option is the "minimal Plan B" analyser - a separate tool for streamability analysis, that simply reports a yes/no answer (without explanation). It would be significant piece of work to create this and test it, and it's unclear that anyone would use it, but it's probably the cheapest way of getting the conformance tick-in-the-box.

A final option is to go for a "fully featured" but free-standing streamability analysis tool, one which aims to not only answer the conformance question about guaranteed streamability, but also to provide genuinely useful feedback and advice helping users to create streamable stylesheets. Of course ideally such a tool would be integrated into an IDE rather than being free-standing. I've always argued that there's only a need for one such tool: it's not something that every XSLT 3.0 processor needs to provide. Doing this well would be a large project and involves different skills from those we currently have available.

In the short term, I think the only honest and affordable approach would be the first option: declare a non-conformance. Unfortunately that could threaten the viability of the spec, because we can only get a spec to Recommendation status if all features have been shown to be implementable.

No easy answers.

LATER

I've been thinking about a Plan C which might fly...

The idea here is to try and do the streamability analysis using the current expression tree structure and the current streamability logic, but applying the streamability rules to an expression tree that faithfully represents the stylesheet as parsed, with no modifications from type checking or optimization.

To do this, we need to:

* Define a configuration flag --strictStreamability which invokes the following logic.

* Fix places where the initial expression tree loses information that's needed for streamability analysis. The two that come to mind are (a) losing the information that something is an instruction rather than an expression (e.g. we lose the distinction between xsl:map-entry and a singleton map expression) - this distinction is needed to assess calls on accumulator-after(); (b) turning path expressions A/B into docSort(A!B). There may be other cases that we will discover along the road (or fail to discover, since we may not have a complete set of test cases...)

* Write a new type checker that attaches type information to this tree according to the rules in the XSLT 3.0 spec. This will be much simpler than the existing type checker, partly because the rules are much simpler, but more particularly because the only thing it will do is to assign static types: it will never report any type errors, and it will never inject any code to do run-time type checking or conversion.

* Immediately after this type-checking phase, run the existing streamability rules against the expression tree. As far as I'm aware, the streamability rules in Saxon are equivalent to the W3C rules (at any rate, most of the original differences have now been eliminated).

There are then two options. We could stop here: if the user sets the --strictStreamability flag, they get the report on streamability, but they don't get an executable that can actually be run. The alternative would be, if the streamability analysis succeeds, attempt to convert the expression tree into a form that we can actually use, by running the existing simplify / typecheck / optimize phases. The distinctions introduced to the expression tree by the changes described above would be eliminated by the simplify() phase, and we would then proceed along the current lines, probably including a rerun of the streamability analysis against the optimised expression tree (because the posture+sweep annotations are occasionally needed at run-time).

I will do some further exploration to see whether this all looks feasible. It will be very hard to prove that we've got it 100% right. But in a sense that doesn't matter, so long as the design is sound and we're passing known tests then we can report honestly that to the best of our knowledge the requirement is satisfied, which is not the case with the current approach.

Tuple types, and type aliases - Saxon diaries

| No Comments | No TrackBacks
I've been experimenting with some promising Saxon extensions.

Maps and arrays greatly increase the flexibility and power of the XPath / XSLT / XQuery type system. But one drawback is that the type declarations can be very cumbersome, and very uninformative.

Suppose you want to write a library to handle arithmetic on complex numbers. How are you going to represent a complex number? There are several possibilities: as a sequence of two doubles (xs:double*); as an array of two doubles (array(xs:double)), or as a map, for example map{"r": 0.0e0, "i": 0.0e0} (which has type map(xs:string, xs:double)).

Note that whichever of these choices you make, (a) your choice is exposed to the user of your library by the way you declare the type in your function signatures, (b) the type allows many values that aren't legitimate representations of complex numbers, and (c) there's nothing in the type declaration that tells the reader of your code that this has anything to do with complex numbers.

I think we can tackle these problems with two fairly simple extensions to the language.

First, we can define type aliases. For XSLT, I have implemented an extension that allows you to declare (as a top-level element anywhere in the stylesheet):

<saxon:type-alias name="complex" 
                  type="map(xs:string, xs:double)"/>
and then you can use this type alias (prefixed by a tilde) anywhere an item type is allowed, for example

<xsl:variable name="i" as="~complex" 
              select="cx:complex(0.0, 1.0)"/>
Secondly, we can define tuple types. So we can instead define our complex numbers as:

<saxon:type-alias name="complex" 
                  type="tuple(r: xs:double, i: xs:double)"/>
We're not actually introducing tuples here as a fundamental new type with their own set of functions and operators. Rather, a tuple declaration defines constraints on a map. It lists the keys that must be present in the map, and the type of the value to be associated with each key. The keys here are the strings "r" and "i", and in both cases the value must be an xs:double. The keys are always NCNames, which plays well with the map lookup notation M?K; if $c is a complex number, then the real and imaginary parts can be referenced as $c?r and $c?i respectively.

For this kind of data structure, tuple types provide a much more precise constraint over the contents of the map than the current map type does. It also provides much better static type checking: an expression such as $c?i can be statically checked (a) to ensure that "i" is actually a defined field in the tuple declaration, and (b) that the expression is used in a context where an xs:double value is expected.

I've been a little wary in the past of putting syntax extensions into Saxon; conformance to standards has always been a primary goal. But the standards process seems to be running out of steam, and I'm beginning to feel that it's time to push a few innovative ideas out in product to keep things moving forward. For those who would prefer to stick entirely to stuff defined by W3C, rest assured that these features will only be available if you explicitly enable extensions.

Improving Compile-Time Performance - Saxon diaries

| No Comments | No TrackBacks
For years we've been putting more and more effort into optimizing queries and stylesheets so that they would execute as fast as possible. For many workloads, in particular high throughput server-side transformations, that's a good strategy. But over the last year or two we've become aware that for some other workloads, it's the wrong thing to do.

For example, if you're running a DocBook or DITA transformation from the command line, and the source document is only a couple of KB in size, then the time taken to compile the stylesheet greatly exceeds the actual transformation time. It might take 5 seconds to compile the stylesheet, and 50 milliseconds to execute it. (Both DocBook and DITA stylesheets are vast.) For many users, that's not an untypical scenario.

If we look at the XMark benchmarks, specifically a query such as Q9, which is a fairly complex three-way join, the query executes against a 10Mb source document in just 9ms. But to achieve that, we spend 185ms compiling and optimizing the query. We also spend 380ms parsing the source document. So in an ad-hoc processing workflow, where you're compiling the query, loading a source document, and then running a query, the actual query execution cost is about 2% of the total. But it's that 2% that we've been measuring, and trying to reduce.

We haven't entirely neglected the other parts of the process. For example, one of the most under-used features of the product is document projection, which enables you during parsing, to filter out the parts of the document that the query isn't interested in. For query Q9 that cuts down the size of the source document by 65%, and reduces the execution time of the query to below 8ms. Unfortunately, although the memory saving is very useful, it actually increases the parsing time to 540ms. Some cases are even more dramatic: with Q2, the size of the source document is reduced by 97%; but parsing is still slowed down by the extra work of deciding which parts of the document to retain, and since the query only takes 2ms to execute anyway, there's no benefit other than the memory saving.

For the DocBook and DITA scenarios (unlike XMark) it's the stylesheet compilation time that hurts, rather than the source document parsing time. For a typical DocBook transformation of a small document, I'm seeing a stylesheet compile time of around 3 seconds, source document parsing time of around 0.9ms, and transformation time also around 0.9ms. Clearly, compile time here is far more important than anything else.

The traditional answer to this has always been to compile the stylesheet once and then use it repeatedly. That works if you're running hundreds of transformations using the same stylesheet, but there are many workflows where this is impractical.

Saxon 9.7 makes a big step forward by allowing the compiled form of a stylesheet to be saved to disk. This work was done as part of the implementation of XSLT 3.0 packages, but it doesn't depend on packages in any way and works just as well with 1.0 and 2.0 stylesheets. If we export the docbook stylesheets as a compiled package, and then run from this version rather than from source, the time taken for loading the compiled stylesheet is around 550ms rather than the original 3 seconds. That's a very useful saving especially if you're processing lots of source documents using a pipeline written say using a shell script or Ant build where the tools constrain you to run one transformation at a time. (To ensure that exported stylesheet packages work with tools such as Ant, we've implemented it so that in any API where a source XSLT stylesheet is accepted, we also accept an exported stylesheet package).

But the best performance improvements are those where you don't have to do anything different to get the benefits (cynically, only about 2% of users will ever read the release notes.) So we've got a couple of further projects in the pipeline.

The first is simply raw performance tuning of the optimizer. There's vast potential for this once we turn our minds to it. What we have today has grown organically, and the focus has always been on getting the last ounce of run-time performance regardless how long it takes to achieve it. One approach is to optimize a bit less thoroughly: we've done a bit of that recently in response to a user bug report showing pathological compilation times on an extremely large (20Mb) automatically generated stylesheet. But a better approach is to think harder about the data structures and algorithms we are using.

Over the last few days I've been looking at how we do loop-lifting: that is, identifying subexpressions that can be moved out of a loop because each evaluation will deliver the same result. The current approach is that the optimizer does a recursive walk of the expression tree, and at each node in the tree, the implementation of that particular kind of expression looks around to see what opportunities there are for local optimization. Many of the looping constructs (xsl:for-each, xsl:iterate, for expressions, filter expressions, path expressions) at this point initiate a search of the subtree for expressions that can be lifted out of the loop. This means that with nested loops (a) we're examining the same subtrees once for each level of loop nesting, and (b) we're hoisting the relevant expressions up the tree one loop at a time, rather than moving them straight to where they belong. This is not only a performance problem; the code is incredibly complex, it's hard to debug, and it's hard to be sure that it's doing as effective a job as it should (for example, I only found during this exercise that we aren't loop-lifting subexpressions out of xsl:for-each-group.)

In 9.7, as reported in previous blog posts, we made some improvements to the data structures used for the expression tree, but so far we've been making rather little use of this. One improvement was to add parent pointers, which enables optimizations to work bottom-up rather than top-down. Another improvement was a generic structure for holding the links from a parent node to its children, using an Operand object that (a) holds properties of the relationship (e.g. it tells you when the child expression is evaluated with a different focus from the parent), and (b) is updatable, so a child expression can replace itself by some different expression without needing the parent expression to get involved. These two improvements have enabled a complete overhaul of the way we do loop-lifting. Without knowing anything about the semantics of different kinds of expressions, we can now do a two-phase process: first we do a scan over the expression tree for a function or template to identify, for each node in the tree, what its "innermost scoping node" is: for example an expression such as "$i + @x" is scoped both by the declaration of $i and by the instruction (e.g. xsl:for-each) that sets the focus, and the innermost scoping expression is the inner one of these two. Then, in a second pass, we hoist every expression that's not at the same looping level as its innermost scoping expression to be evaluated (lazily) outside that loop. The whole process is dramatically simpler and faster than what we were doing before, and at least as effective - possibly in some cases more so.

The other project we're just starting on is to look at just-in-time compilation. The thing about stylesheets like DocBook is that they contain zillions of template rules for processing elements which typically don't appear in your average source document. So why waste time compiling template rules that are never used? All we really need to do is make a note of the match patterns, build the data structures we use to identify which rule is the best match for a node, and then do the work of compiling that rule the first time it is used. Indeed, the optimization and byte-code generation work can be deferred until we know that the rule is going to be used often enough to make it worthwhile. We're starting this project (as one should start all performance projects) by collecting instrumentation, so we can work out exactly how much time we are spending in each phase of compilation; that will tell us how much we should be doing eagerly and how much we should defer. There's a trade-off with usability here: do users want to be told about errors found while type-checking parts of the stylesheet that aren't actually exercised by a particular run?

Plenty of ideas to keep us busy for a while to come.

Introducing Saxon-JS - Saxon diaries

| No Comments | No TrackBacks

At XML Prague yesterday we got a spontaneous round of applause when we showed the animated Knight's tour application, reimplemented to use XSLT 3.0 maps and arrays, running in the browser using a new product called Saxon-JS.


So, people will be asking, what exactly is Saxon-JS?


Saxon-EE 9.7 introduces a new option -export which allows you to export a compiled stylesheet, in XML format, to a file: rather like producing a .so file from a C compiler, or a JAR file from a Java compiler. The compiled stylesheet isn't executable code, it's a decorated abstract syntax tree containing, in effect, the optimized stylesheet execution plan. There are two immediate benefits: loading a compiled stylesheet is much faster than loading the original source code, so if you are executing the same stylesheet repeatedly the cost of compilation is amortized; and in addition, it enables you to distribute XSLT code to your users with a degree of intellectual property protection analogous to that obtained from compiled code in other languages. (As with Java, it's not strong encryption - it wouldn't be too hard to write a fairly decent decompiler - but it's strong enough that most people won't attempt it.)


Saxon-JS is an interpreted, written in pure Javascript, that takes these compiled stylesheet files and executes them in a Javascript environment - typically in the browser, or on Node.js. Most of our development and testing is actually being done using Nashorn, a Javascript engine bundled with Java 8, but that's not a serious target environment for Saxon-JS because if you've got Nashorn then you've got Java, and if you've got Java then you don't need Saxon-JS.


Saxon-JS can also be seen as a rewrite of Saxon-CE. Saxon-CE was our first attempt at doing XSLT 2.0 in the browser. It was developed by producing a cut-down version of the Java product, and then cross-compiling this to Javascript using Google's GWT cross-compiler. The main drawbacks of Saxon-CE, at a technical level, were the size of the download (800Kb or so), and the dependency on GWT which made testing and debugging extremely difficult - for example, there was no way of testing our code outside a browser environment, which made running of automated test scripts very time-consuming and labour-intensive. There were also commercial factors: Saxon-CE was based on a fork of the Saxon 9.3 Java code base and re-basing to a later Saxon version would have involved a great deal of work; and there was no revenue stream to fund this work, since we found a strong expectation in the market that this kind of product should be free. As a result we effectively allowed the product to become dormant.


We'll have to see whether Saxon-JS can overcome these difficulties, but we think it has a better chance. Because it depends on Saxon-EE for the front-end (that is, there's a cost to developers but the run-time will be free) we're hoping that there'll be a reveue stream to finance support and ongoing development; and although the JS code is not just a fork but a complete rewrite of the run-time code the fact that it shares the same compiler front end means that it should be easier to keep in sync.


Development has been incredibly rapid - we only started coding at the beginning of January, and we already have about 80% of the XSLT 2.0 tests running - partly because Javascript is a powerful language, but mainly because there's little new design involved. We know how an XSLT engine works, we only have to decide which refinements to leave out. We've also done client-side XSLT before so we can take the language extensions of Saxon-CE (how to invoke templates in response to mouse events, for example) the design of its Javascript APIs, and also some of its internal design (like the way event bubbling works) and reimplement these for Saxon-JS.


One of the areas where we have to make design trade-offs is deciding how much standards conformance, performance, and error diagnostics to sacrifice in the interests of keeping the code small. There are some areas where achieving 100% conformance with the W3C specs will be extremely difficult, at least until JS6 is available everywhere: an example is support for Unicode in regular expressions. For performance, memory usage (and therefore expression pipelining) is important, but getting the last ounce of processor efficiency less so. An important factor (which we never got quite right for Saxon-CE) is asynchronous access to the server for the doc() and document() functions - I have ideas on how to do this, but it ain't easy.


It will be a few weeks before the code is robust enough for an alpha release, but we hope to get this out as soon as possible. There will probably then be a fairly extended period of testing and polishing - experience suggests that when the code is 90% working, you're less than half way there.


I haven't yet decided on the licensing model. Javascript by its nature has no technical protection, but that doesn't mean we have to give it an open source license (which would allow anyone to make changes, or to take parts of the code for reuse in other projects).


All feedback is welcome: especially on opportunities for exploiting the technology in ways that we might not have thought of.