Many computing platforms are not well-served by up to date XML technology, and in consequence Saxonica has been slowly increasing its coverage of the major platforms: extending from Java to .NET, C++, PHP, Javascript using a variety of technical approaches. This makes it desirable to implement as much as possible using portable languages, and if we want to minimize our dependence on third-party technologies (IKVMC, for example, is now effectively unsupported) we should be writing in our own languages, notably XSLT.


This note therefore asks the question, could one write an XSD Schema 1.1 processor in XSLT?


In fact a schema processor has two parts, compile time (compiling schema documents into the schema component model and SCM) and run-time (validating an instance document using the SCM).


The first part, compiling, seems to pose no intrinsic difficulty. Some of the rules and constraints that need to be enforced are fairly convoluted, but the only really tricky part is compiling grammars into finite-state-machines, and checking grammars (or the resulting finite-state-machine) for conformance with rules such as the Unique Particle Attribution constraint. But since we already have a tool (written in Java) for compiling schemas into an XML-based SCM file, and since it wouldn't really inconvenience users too much for this tool to be invoked via an HTTP interface, the priority for a portable implementation is really the run-time part of the processor rather than the compile-time part. (Note that this means ignoring xsi:schemaLocation, since that effectively causes the run-time validator to invoke the schema compiler.)


There are two ways one could envisage implementing the run-time part in XSLT: either with a universal stylesheet that takes the SCM and the instance document as inputs, or by generating a custom XSLT stylesheet from the SCM, rather as is done with Schematron. For the moment I'll keep an open mind which of these two approaches is preferable.


Ideally, the XSLT stylesheet would use streaming so the instance document being validated does not need to fit in memory. We'll bear this requirement in mind as we look at the detail.


The XSLT code, of course, cannot rely on any services from a schema processor, so it cannot be schema-aware.


Let's look at the main jobs the validator has to do.


Validating strings against simple types


Validating against a primitive type can be done simply using the XPath castable operator.


Validating against a simple type derived by restriction involves checking the various facets. For the most part, the logic of each facet is easily expressed in XPath. There are a few exceptions:


  • Patterns (regular expressions). The XPath regular expression syntax is a superset of the XSD syntax. To evaluate XSD regular expressions, we either need some kind of extension to the XPath matches() function, or we need to translate XSD regular expressions into XPath regular expressions. This translation is probably not too difficult. It mainly involves rejecting some disallowed constructs (such as back-references, non-capturing groups, and reluctant quantifiers), and escaping "^" and "$" with a backslash.

  • Length facets for hexBinary and base64Binary. Base646Binary can be cast to hexBinary, and the length of the value in octets can be computed by converting to string and dividing the string length by 2.


Validating against a list type can be achieved by tokenizing, and testing each token against the item type.


Validating against a union type can be achieved by validating against each member type (and also validing against any constraining facets defined at the level of the union itself).


Validating elements against complex types


The only difficult case here is complex content. It should be possible to achieve this by iterating over the child nodes using xsl:iterate, keeping the current state (in the FSM) as the value of the iteration parameter. On completion the element is valid if the state is a final state. As each element is processed, it needs to be checked against the state of its parent element's FSM, and in addition a new validator is established for validating its children. This is all streamable.


Assertions and Conditional Type Assignment


Evaluating XPath expressions can be achieved using xsl:evaluate. The main difficulty is setting up the node-tree to which xsl:evaluate is applied. This needs to be a copy of the original source subtree, to ensure that the assertion cannot stray outside the relevant subtree. Making this copy consumes the source subtree, which makes streaming tricky: however, the ordinary complex type validation can also happen on the copy, so I think streaming is possible.


Identity constraints (unique, key, keyref)


This is where streaming really gets quite tricky - especially given the complexity of the specification for those rare keyref cases where the key is defined on a different element from the corresponding keyref.


The obvious XSLT mechanism here is accumulators. But accumulator rules are triggered by patterns, and defining the patterns that correspond to the elements involved in a key definition is tricky. For example if sections nest recursively, a uniqueness constraint might say that for every section, its child section elements must have unique @section-number attributes. A corresponding accumulator would have to maintain a stack of sections, with a map of section numbers at each level of the stack, and the accumulator rule for a section would need to check the section number of that section at the current level, and start a new level.


A further complication is that there may be multiple (global and/or local) element declarations with the same name, with different unique / key / keyref constraints. Deciding which of these apply by means of XSLT pattern matching is certainly difficult and may be impossible.


The multiple xs:field elements within a constraint do not have to match components of the key in document order, but a streamed implementation would still be possible using the map constructor, which allows multiple downward selections - provided that the xs:field selector expressions are themselves streamable, which I think is probably always the case.


The problem of streamability could possibly be solved with some kind of dynamic pipelining. The "main" validation process, when it encounters a start tag, is able to establish which element declaration it belongs to, and could in principle spawn another transformation (processing the same input stream) for each key / unique constraint defined in that element declaration: a kind of dynamic xsl:fork.


I think as a first cut it would probably be wise not to attempt streaming in the case of a schema that uses unique / key / keyref constraints. More specifically, if any element has such constraints, it can be deep-copied, and validation can then switch to the in-memory subtree rather than the original stream. After all, we have no immediate plans to implement streaming other than in the Java product, and that will inevitably make an XSLT-based schema processor on other platforms unstreamed anyway.


Outcome of validation


There are two main scenarios we should support: validity checking, and type annotation. With validity checking we want to report many invalidities in a single validation episode, and the main output is the validation report. With type annotation, the main output is a validated version of the instance document, and a single invalidity can cause the process to terminate with a dynamic error.


It is not possible for a non-schema-aware stylesheet to add type annotations to the result tree without some kind of extensions. The XSLT language only allows type annotations to be created as the result of schema validation. So we will need an extension for this purpose: perhaps a saxon:type-annotation="QName" attribute on instructions such as xsl:element, xsl:copy, xsl:attribute.


For reporting validation errors, it's important to report the location of the invalidity. This also requires extensions, such as saxon:line-number().


Conclusion


I don't think there are any serious obstacles to writing a validation engine in XSLT. Making it streamable is harder, especially for integrity constraints. A couple of extensions are needed: the ability to add type annotations to the result tree, and the ability to get line numbers of nodes in the source.


I still have an open mind about whether a universal stylesheet should be used, or a generated stylesheet for a particular schema.

Transforming JSON - Saxon diaries

| No Comments | No TrackBacks

In my conference paper at XML Prague in 2016 I examined a couple of use cases for transforming JSON structures using XSLT 3.0. The overall conclusion was not particularly encouraging: the easiest way to achieve the desired results was to convert the JSON to XML, transform the XML, and then convert it back to JSON.

Unfortunately this study came too late to get any new features into XSLT 3.0. However, I've been taking another look at the use cases to see whether we could design language extensions to handle them, and this is looking quite encouraging.

Use case 1: bulk update

We start with the JSON document

[ { 
  "id": 3, "name": "A blue mouse", "price": 25.50, 
  "dimensions": {"length": 3.1, "width": 1.0, "height": 1.0}, 
  "warehouseLocation": {"latitude": 54.4, "longitude": -32.7 }}, 
  { 
  "id": 2, "name": "An ice sculpture", "price": 12.50, 
  "tags": ["cold", "ice"], 
  "dimensions": {"length": 7.0, "width": 12.0, "height": 9.5 }, 
  "warehouseLocation": {"latitude": -78.75, "longitude": 20.4 }
} ]

and the requirement: for all products having the tag "ice", increase the price by 10%, leaving all other data unchanged. I've prototyped a new XSLT instruction that allows this to be done as follows:

<saxon:deep-update
   root="json-doc('input.json')
   select=" ?*[?tags?* = 'ice']"
   action="map:put(., 'price', ?price * 1.1)"/>

How does this work?

First the instruction evaluates the root expression, which in this case returns the map/array representation of the input JSON document. With this root item as context item, it then evaluates the select expression to obtain a sequence of contained maps or arrays to be updated: these can appear at any depth under the root item. With each of these selected maps or arrays as the context item, it then evaluates the action expression, and uses the returned value as a replacement for the selected map or array. This update then percolates back up to the root item, and the result of the instruction is a map or array that is the same as the original except for the replacement of the selected items.

The magic here is in the way that the update is percolated back up to the root. Because maps and arrays are immutable and have no persistent identity, the only way to do this is to keep track of the maps and arrays selected en-route from the root item to the items selected for modification as we do the downward selection, and then modify these maps and arrays in reverse order on the way back up. Moreover we need to keep track of the cases where multiple updates are made to the same containing map or array. All this magic, however, is largely hidden from the user. The only thing the user needs to be aware of is that the select expression is constrained to use a limited set of constructs when making downward selections.

The select expression select="?*[?tags?* = 'ice']" perhaps needs a little bit of explanation. The root of the JSON tree is an array of maps, and the initial ?* turns this into a sequence of maps. We then want to filter this sequence of maps to include only those where the value of the "tags" field is an array containing the string "ice" as one of its members. The easiest way to test this predicate is to convert the value from an array of strings to a sequence of strings (so ?tags?*) and then use the XPath existential "=" operator to compare with the string "ice".

The action expression map:put(., 'price', ?price * 1.1) takes as input the selected map, and replaces it with a map in which the price entry is replaced with a new entry having the key "price" and the associated value computed as the old price multiplied by 1.1.

Use case 2: Hierarchic Inversion

The second use case in the XML Prague 2016 paper was a hierarchic inversion (aka grouping) problem. Specifically: we'll look at a structural transformation changing a JSON structure with information about the students enrolled for each course to its inverse, a structure with information about the courses for which each student is enrolled.

Here is the input dataset:

[{ "faculty": "humanities", 
   "courses": [ 
    { "course": "English", 
      "students": [ 
       { "first": "Mary", "last": "Smith", "email": "mary_smith@gmail.com"}, 
       { "first": "Ann", "last": "Jones", "email": "ann_jones@gmail.com"}
      ]
    },
    { "course": "History", 
      "students": [ 
        { "first": "Ann", "last": "Jones", "email": "ann_jones@gmail.com" }, 
        { "first": "John", "last": "Taylor", "email": "john_taylor@gmail.com"} 
      ] 
    } ] 
 }, { 
  "faculty": "science", 
  "courses": [ 
  { "course": "Physics", 
    "students": [ 
     { "first": "Anil", "last": "Singh", "email": "anil_singh@gmail.com"}, 
     { "first": "Amisha", "last": "Patel", "email": "amisha_patel@gmail.com"}]
  }, 
 { "course": "Chemistry", 
    "students": [ 
     { "first": "John", "last": "Taylor", "email": "john_taylor@gmail.com"}, 
     { "first": "Anil", "last": "Singh", "email": "anil_singh@gmail.com"}
   ]
 } ]
}]

The goal is to produce a list of students, sorted by last name then irst name, each containing a list of courses taken by that student, like this:

[
  { "email": "anil_singh@gmail.com", 
    "courses": ["Physics", "Chemistry" ]},
  { "email": "john_taylor@gmail.com", 
    "courses": ["History", "Chemistry" ]},
  .... 
]

The classic way of handling this is in two phases: first reduce the hierarchic input to a flat sequence in which all the required information is contained at one level, and then apply grouping to this flat sequence.

To achieve the flattening we introduce another new XSLT instruction:

<saxon:tabulate-maps
    root="json-doc('input.json')"
    select="?* ! map:find(., 'students)?*"/>

Again the root expression delivers a representation of the JSON document as an array of maps. The select expression first selects these maps ("?*"), then for each one it calls map:find() to get an array of maps each representing a student. The result of the instruction is a sequence of maps corresponding to these student maps in the input, where each output map contains not only the fields present in the input (first, last, email), but also fields inherited from parents and ancestors (faculty, course). For good measure it also contains a field _keys containing an array of keys representing the path from root to leaf, but we don't actually use that in this example.

Once we have this flat structure, we can construct a new hierarchy using XSLT grouping:

<xsl:for-each-group select="$students" group-by="?email">
   <xsl:map>
     <xsl:map-entry key="'email'" select="?email"/>
     <xsl:map-entry key="'first'" select="?first"/>
     <xsl:map-entry key="'last'" select="?last"/>
     <xsl:map-entry key="'courses'">
        <saxon:array>
           <xsl:for-each select="current-group()">
               <saxon:array-member select="?course"/>
           </xsl:for-each>
       </saxon:array>
    </xsl:map-entry>
  </xsl:map>
</xsl:for-each-group>

This can then be serialized using the JSON output method to produce to required output.

Note: the saxon:array and saxon:array-member instructions already exist in Saxon 9.8. They fill an obvious gap in the XSLT 3.0 facilities for handling arrays - a gap that exists largely because the XSL WG was unwilling to create a dependency XPath 3.1.

Use Case 3: conversion to HTML

This use case isn't in the XML Prague paper, but is included here for completeness.

The aim here is to construct an HTML page containing the information from a JSON document, without significant structural alteration. This is a classic use case for the recursive application of template rules, so the aim is to make it easy to traverse the JSON structure using templates with appropriate match patterns.

Unfortunately, although the XSLT 3.0 facilities allow patterns that match maps and arrays, they are cumbersome to use. Firstly, the syntax is awkward:

match=".[. instance of map(...)]"

We can solve this with a Saxon extension allowing the syntax

match="map()"

Secondly, the type of a map isn't enough to distinguish one map from another. To identify a map representing a student, for example, we aren't really interested in knowing that it is a map(xs:string, item()*). What we need to know is that it has fields (email, first, last). Fortunately another Saxon extension comes to our aid: tuple types, described here: http://dev.saxonica.com/blog/mike/2016/09/tuple-types-and-type-aliases.html With tuple types we can change the match pattern to

match="tuple(email, first, last)"

Even better, we can use type aliases:

<saxon:type-alias name="student" as="tuple(email, first, last)"/>
<xsl:template match="~student">...</xsl:template>

With this extension we can now render this input JSON into HTML using the stylesheet:

<?xml version="1.0" encoding="utf-8"?> 

<xsl:stylesheet
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
   xmlns:xs="http://www.w3.org/2001/XMLSchema"
   xmlns:saxon="http://saxon.sf.net/"
   exclude-result-prefixes="#all"
   expand-text="yes"

  <saxon:type-alias name="faculty" type="tuple(faculty, courses)"/>
  <saxon:type-alias name="course" type="tuple(course, students)"/>
  <saxon:type-alias name="student" type="tuple(first, last, email)"/>

  <xsl:template match="~faculty">
    <h1>{?faculty} Faculty</h1>
    <xsl:apply-templates select="?courses?*"/>
  </xsl:template>

  <xsl:template match="~course">
    <h2>{?course} Course</h2>
    <p>List of students:</p>
    <table>
      <thead>
        <tr>
          <th>Name</th>
          <th>Email</th>
        </tr>
      </thead>
      <tbody>
        <xsl:apply-templates select="?students?*">
          <xsl:sort select="?last"/>
          <xsl:sort select="?first"/>
        </xsl:apply-templates>
      </tbody>
    </table>
  </xsl:template>

  <xsl:template match="~student">
    <tr>
      <td>{?first} {?last}</td>
      <td>{?email}</td>
    </tr>
  </xsl:template>

  <xsl:template name="xsl:initial-template">
    <xsl:apply-templates select="json-doc('courses.json')"/>
  </xsl:template>

</xsl:stylesheet>

Conclusions

With only the facilities of the published XSLT 3.0 recommendation, the easiest way to transform JSON is often to convert it first to XML node trees, and then use the traditional XSLT techniques to transform the XML, before converting it back to JSON.

With a few judiciously chosen extensions to the language, however, a wide range of JSON transformations can be achieved natively.

Bugs: How well are we doing? - Saxon diaries

| No Comments | No TrackBacks
We're about to ship another Saxon 9.7 maintenance release, with another 50 or so bug clearances. The total number of patches we've issued since 9.7 was released in November 2015 has now reached almost 450. The number seems frightening and the pace is relentless. But are we getting it right, or are we getting it badly wrong?

There are frequently-quoted but poorly-sourced numbers you can find on the internet suggesting a norm of 10-25 bugs per thousand lines of code. Saxon is 300,000 lines of (non-comment) code, so that would suggest we can expect a release to have 3000 to 7500 bugs in it. One one measure that suggests we're doing a lot better than the norm. Or it could also mean that most of the bugs haven't been found yet.

I'm very sceptical of such numbers. I remember a mature product in ICL that was been maintained by a sole part-time worker, handling half a dozen bugs a month. When she went on maternity leave, the flow of bugs magically stopped. No-one else could answer the questions, so users stopped sending them in. The same happens with Oracle and Microsoft. I submitted a Java bug once, and got a response 6 years later saying it was being closed with no action. When that happens, you stop sending in bug reports. So in many ways, a high number of bug reports doesn't mean you have a buggy product, it means you have a responsive process for responding to them. I would hate the number of bug reports we get to drop because people don't think there's any point in submitting them.

And of course the definition of what is a bug is completely slippery. Very few of the bug reports we get are completely without merit, in the sense that the product is doing exactly what it says on the tin; at the same time, rather few are incontrovertible bugs either. If diagnostics are unhelpful, is that a bug?

The only important test really is whether our users are satisfied with the reliability of the product. We don't really get enough feedback on that at a high level. Perhaps we should make more effort to find out; but I so intensely hate completing customer satisfaction questionnaires myself that I'm very reluctant to inflict it on our users. Given that open source users outnumber commercial users by probably ten-to-one, and that the satisfaction of our open source users is just as important to us as the satisfaction of our commercial customers (because it's satisfied open source users who do all the sales work for us); and given that we don't actually have any way of "reaching out" to our open source users (how I hate the marketing jargon); and given that we really wouldn't know what to differently if we discovered that 60% of our users were "satisfied or very satisfied": I don't really see very much value in the exercise. But I guess putting a survey form on the web site wouldn't be difficult, some people might interpret it as a signal that we actually care.

With 9.7 there was a bit of a shift in policy towards fixing bugs pro-actively (more marketing speak). In particular, we've been in a phase where the XSLT and XQuery specs were becoming very stable but more test cases were becoming available all the time (many of them, I might add, contributed by Saxonica - often in reaction to queries from our users). So we've continuously been applying new tests to the existing release, which is probably a first. Where a test showed that we were handling edge cases incorrectly, and indeed when the spec was changed in little ways under our feet, we've raised bugs and fixes to keep the conformance level as high as possible (while also maintaining compatibility). So we've shifted the boundary a little between feature changes (which traditionally only come in the next release), and bug fixes, which come in a maintenance release. That shift also helps to explain why the gap between releases is becoming longer - though the biggest factor holding us back, I think, is the ever-increasing amount of testing that we do before a release.

Fixing bugs pro-actively (that is before any user has hit the bug) has the potential to improve user satisfaction if it means that they never do hit the bug. I think it's always as well to remember also that for every user who reports a bug there may be a dozen users who hit it and don't report it. One reason we monitor StackOverflow is that a lot of users feel more confident about reporting a problem there, rather than reporting it directly to us. Users know that their knowledge is limited and they don't want to make fools of themselves, and you need a high level of confidence to tell your software vendor that you think the product is wrong. 

On the other hand, destabilisation is a risk. A fix in one place will often expose a bug somewhere else, or re-awaken an old bug that had been laid to rest. As a release becomes more mature, we try to balance the benefits of fixing problems with the risk of de-stabilisation.

So, what about testing? Can we say that because we've fixed 450 bugs, we didn't run enough tests in the first place?

Yes, in a sense that's true, but how many more tests would have had to write in order to catch them? We probably run about a million test cases (say, 100K tests in an average of ten product configurations each) and these days the last couple of months before a major release are devoted exclusively to testing. (I know that means we don't do enough continuous testing. But sorry, it doesn't work for me. If we're doing something radical to the internals of the product then things are going to break in the process, and my style is to get the new design working while it's still fresh in my head, then pick up the broken pieces later. If everything had to work in every nightly build, we would never get the radical things done. That's a personal take, and of course what works with a 3-4 person team doesn't necessarily work with a larger project. We're probably pretty unusual in developing a 300Kloc software package with 3-4 people, so lots of our experience might not extrapolate.)

We've had a significant number of bug reports this time on performance regression. (This is of course another area where it's arguable whether it's a bug or not. Sometimes we will change the design in a way that we know benefits some workloads at the expense of others.) Probably most of these are extreme scenarios, for example compilation time for stylesheets where a single template declares 500 local variables. Should we have run tests to prevent that? Well, perhaps we should have more extreme cases in our test suite: the vast majority of our test cases are trivially small. But the problem is, there will always be users who do things that we would never have imagined. Like the user running an XSD 1.1 schema validation in which tens of thousands of assertions are expected to "fail", because they've written it in such a way that assertion failures aren't really errors, they are just a source of statistics for reporting on the data.

The bugs we hate most (and therefore should to most to prevent) are bugs in bytecode generation, streaming, and multi-threading. The reason we hate them is that they can be a pig to debug, especially when the user-written application is large and complex. 

  • For bytecode generation I think we've actually got pretty good test coverage, because we not only run every test in the QT3 and XSLT3 test suites with bytecode generation enabled, we also artificially complicate the tests to stop queries like 2+5 being evaluated by the compiler before bytecode generation kicks in. We've also got an internal recovery mechanism so if we detect that we've generated bad code, we fall back to interpreted mode and the user never notices (problem with that is of course that we never find out).
  • Streaming is tricky because the code is so convoluted (writing everything as inverted event-based code can be mind-blowing) and because the effects of getting it wrong often give very little clue as to the cause. But at least the failure is "in your face" for the user, who will therefore report the problem, and it's likely to be reproducible. Another difficulty with streaming is that because not all code is streamable, tests for streaming needed to be written from scratch.
  • Multi-threading bugs are horrible because they occur unpredictably. If there's a low probability of the problem happening then it can require a great deal of detective work to isolate the circumstances, and this often falls on the user rather than on ourselves. Fortunately we only get a couple of these a year, but they are a nightmare when they come. In 9.7 we changed our Java baseline to Java 6 and were able therefore to replace many of the hand-built multithreading code in Saxon with standard Java libraries, which I think has helped reliability a lot. But there are essentially no tools or techniques to protect you from making simple thread-safety blunders, like setting a property in a shared object without synchronization. Could we do more testing to prevent these bugs? I'm not optimistic, because the bugs we get are so few, and so particular to a specific workload, that searching the haystack just in case it contains a needle is unlikely to be effective.
Summary: Having the product perceived as reliable by our users is more important to us than the actual bug count. Fixing bugs quickly before they affect more users is probably the best way of achieving that. If the bug count is high because we're raising bugs ourselves as a result of our own testing, then that's no bad thing. It hasn't yet got to the level where we can't cope with the volumes, or where we have to filter things through staff who are only employed to do support. If we can do things better, let us know.


Guaranteed Streamability - Saxon diaries

| No Comments | No TrackBacks
The XSLT 3.0 specification in its current form provides a set of rules (that can be evaluated statically, purely by inspecting the stylesheet) for determining whether the code is (or is not) guaranteed streamable.

If the code is guaranteed streamable then every processor (if it claims to support streaming at all) must use streaming to evaluate the stylesheet; if it is not guaranteed streamable then the processor can choose whether to use streaming or not.

The tricky bit is that there's a requirement in the spec that if the code isn't guaranteed streamable, then a streaming processor (on request) has to detect this and report it. The status section of the spec says that this requirement is "at risk", meaning it might be removed if it proves too difficult to implement. There are people on the working group who believe passionately that this requirement is really important for interoperability; there are others (including me) who fully understand why users would like to have this, but have been arguing that it is extremely difficult to deliver.

In this article I'm going to try to explain why it's so difficult to achieve this requirement, and to explore possibilities for overcoming these difficulties.

Streamability analysis can't be performed until various other stages of static analysis are complete. It generally requires that names have been resolved (for example, names of modes and names of streamable functions). It also relies on rudimentary type analysis (determining the static type of constructs). For Saxon, this means that streamability analysis is done after parsing, name fixup, type analysis, and rewrite optimization.

When Saxon performs these various stages of analysis, it modifies the expression tree as it goes: not just to record the information obtained from the analysis, but to make use of the information at execution time. It goes without saying that in modifying the expression tree, it's not permitted to replace a streamable construct with a non-streamable one, and that isn't too hard to achieve (though these things are relative...). But the requirement to report departures from guaranteed streamability imposes a second requirement, which is proving much harder. If we are to report any deviations from guaranteed streamability, then up to the point where we do the streamability analysis, we must never replace a non-streamable construct with a streamable one.

There are various points at which we currently replace a non-streamable construct with a streamable one.

  • Very early in the process, the expression tree that is output by the parsing phase uses the same data structure on the expression tree to represent equivalent constructs in the source. For example, the expression tree produced by <xsl:if test="$a=2"><xsl:sequence select="3"/></xsl:if> will be identical to the expression tree produced by <xsl:sequence select="if ($a=2) then 3 else ()"/>. But streamability analysis makes a distinction between these two constructs. It's not a big distinction (in fact, the only thing it affects is exactly where you are allowed to call the accumulator-after() function) but it's big enough to count.
  • At any stage in the process, if we spot a constant expression then we're likely to replace it with its value. For example if we see the expression $v+3, and $v is a global variable whose value is 5, we will replace the expression with the literal 8. This won't usually affect streamability one way or the other. However, there are a few cases where it does. The most obvious is where we work out that an expression is void (meaning it always returns an empty sequence). For example, according to the spec, the expression (author[0], author[1]) is not streamable because it makes two downward selections. But Saxon spots that author[0] is void and rewrites the expression as (author[1]), which is streamable. Void expressions often imply some kind of user error, so we often output a warning when this happens, but just because we think the user has written nonsense doesn't absolve us from the conformance requirement to report on guaranteed streamability. Void expressions are particularly likely to be found with schema-aware analysis.
  • Inlining of calls to user-defined functions will often make a non-streamable expression streamable.
  • Many other rewrites performed by the optimizer have a similar effect, for example replacing (X|Y) by *[self::X|self::Y].
My first attempt to meet the requirement is therefore (a) to add information to the expression tree where it's needed to maintain a distinction that affects streamability, and (b) to try to avoid those rewrites that turn non-streamable expressions into streamable ones. As a first cut, skipping the optimization phase completely seems an easy way to achieve (b). But it turns out it's not sufficient, firstly because some rewrites are done during the type-checking phase, and secondly because it turns out that without an optimization pass, we actually end up finding that some expressions that should be streamable are not. The most common case for this is sorting into document order. Given the expression A/B, Saxon actually builds an expression in the form sort(A!B) relying on the sort operation to sort nodes into document order and eliminate duplicates. This relies on the subsequent optimization phase to eliminate the sort() operation when it can. If we skip the optimization phase, we are left with an unstreamable expression.

The other issue is that the streamability rules rely on type inferencing rules that are much simpler than the rules Saxon uses. It's only in rare cases that this will make a difference, of course: in fact, it requires considerable ingenuity to come up with such cases. The most obvious case where types make a difference to streamability is with a construct like <xsl:value-of select="$v"/>: this is motionless if $v is a text or attribute node, but consuming if it is a document or element node. If a global variable with private visibility is initialized with select="@price", but has no "as" attribute, Saxon will infer a type of attribute(price) for the variable, but the rules in the spec will infer a type of item()*. So to get the same streamability answer as the spec gives, we need to downgrade the static type inferencing in Saxon.

So I think the changes needed to replicate exactly the streamability rules of the XSLT 3.0 spec are fairly disruptive; moreover, implementing the changes by searching for all the cases that need to change is going to be very difficult to get right (and is very difficult to test unless there is another trustworthy implementation of the rules to test against).

This brings us to Plan B. Plan B is to meet the requirement by writing a completely free-standing tool for streamability analysis that's completely separate from the current static analysis code. One way to do this would be to build on the tool written by John Lumley and demonstrated at Balisage a couple of years ago. Unfortunately that's incomplete and out of date, so it would be a significant effort to finish it. Meeting the requirement in the spec is different here from doing something useful for users: what the spec demands is a yes/no answer as to whether the code is streamable; what users want to know is why, and what they need to change to make the code streamable. The challenge is to do this without users having to understand the difficult abstractions in the spec (posture, sweep, and the rest). John's tool produces an annotated expression tree revealing all the properties: that's great for a user who understands the methodology but probably rather bewildering to the typical end user. Doing the minimum for conformance, a tool that just says yes or no without saying why, involves a lot of work to get a "tick in the box" with a piece of software that no-one will ever use, but would be a lot easier to produce. Conformance has always been a very high priority for Saxonica, but I can't see anyone being happy with this particular solution.

So, assuming the WG maintains its insistence of having this feature (and it seems to me likely that it will), what should we do about it?

One option is simply to declare a non-conformance. Once upon a time, standards conformance was very important to Saxon's reputation in the market, but I doubt that this particular non-conformance would affect our sales.

Another option is to declare conformance, do our best to achieve it using the current analysis technology, and simply log bugs if anyone reports use cases where we get the answer wrong. That seems sloppy and dishonest, and could leave us with a continuing stream of bugs to be fixed or ignored.

Another option is the "minimal Plan B" analyser - a separate tool for streamability analysis, that simply reports a yes/no answer (without explanation). It would be significant piece of work to create this and test it, and it's unclear that anyone would use it, but it's probably the cheapest way of getting the conformance tick-in-the-box.

A final option is to go for a "fully featured" but free-standing streamability analysis tool, one which aims to not only answer the conformance question about guaranteed streamability, but also to provide genuinely useful feedback and advice helping users to create streamable stylesheets. Of course ideally such a tool would be integrated into an IDE rather than being free-standing. I've always argued that there's only a need for one such tool: it's not something that every XSLT 3.0 processor needs to provide. Doing this well would be a large project and involves different skills from those we currently have available.

In the short term, I think the only honest and affordable approach would be the first option: declare a non-conformance. Unfortunately that could threaten the viability of the spec, because we can only get a spec to Recommendation status if all features have been shown to be implementable.

No easy answers.

LATER

I've been thinking about a Plan C which might fly...

The idea here is to try and do the streamability analysis using the current expression tree structure and the current streamability logic, but applying the streamability rules to an expression tree that faithfully represents the stylesheet as parsed, with no modifications from type checking or optimization.

To do this, we need to:

* Define a configuration flag --strictStreamability which invokes the following logic.

* Fix places where the initial expression tree loses information that's needed for streamability analysis. The two that come to mind are (a) losing the information that something is an instruction rather than an expression (e.g. we lose the distinction between xsl:map-entry and a singleton map expression) - this distinction is needed to assess calls on accumulator-after(); (b) turning path expressions A/B into docSort(A!B). There may be other cases that we will discover along the road (or fail to discover, since we may not have a complete set of test cases...)

* Write a new type checker that attaches type information to this tree according to the rules in the XSLT 3.0 spec. This will be much simpler than the existing type checker, partly because the rules are much simpler, but more particularly because the only thing it will do is to assign static types: it will never report any type errors, and it will never inject any code to do run-time type checking or conversion.

* Immediately after this type-checking phase, run the existing streamability rules against the expression tree. As far as I'm aware, the streamability rules in Saxon are equivalent to the W3C rules (at any rate, most of the original differences have now been eliminated).

There are then two options. We could stop here: if the user sets the --strictStreamability flag, they get the report on streamability, but they don't get an executable that can actually be run. The alternative would be, if the streamability analysis succeeds, attempt to convert the expression tree into a form that we can actually use, by running the existing simplify / typecheck / optimize phases. The distinctions introduced to the expression tree by the changes described above would be eliminated by the simplify() phase, and we would then proceed along the current lines, probably including a rerun of the streamability analysis against the optimised expression tree (because the posture+sweep annotations are occasionally needed at run-time).

I will do some further exploration to see whether this all looks feasible. It will be very hard to prove that we've got it 100% right. But in a sense that doesn't matter, so long as the design is sound and we're passing known tests then we can report honestly that to the best of our knowledge the requirement is satisfied, which is not the case with the current approach.

Tuple types, and type aliases - Saxon diaries

| No Comments | No TrackBacks
I've been experimenting with some promising Saxon extensions.

Maps and arrays greatly increase the flexibility and power of the XPath / XSLT / XQuery type system. But one drawback is that the type declarations can be very cumbersome, and very uninformative.

Suppose you want to write a library to handle arithmetic on complex numbers. How are you going to represent a complex number? There are several possibilities: as a sequence of two doubles (xs:double*); as an array of two doubles (array(xs:double)), or as a map, for example map{"r": 0.0e0, "i": 0.0e0} (which has type map(xs:string, xs:double)).

Note that whichever of these choices you make, (a) your choice is exposed to the user of your library by the way you declare the type in your function signatures, (b) the type allows many values that aren't legitimate representations of complex numbers, and (c) there's nothing in the type declaration that tells the reader of your code that this has anything to do with complex numbers.

I think we can tackle these problems with two fairly simple extensions to the language.

First, we can define type aliases. For XSLT, I have implemented an extension that allows you to declare (as a top-level element anywhere in the stylesheet):

<saxon:type-alias name="complex" 
                  type="map(xs:string, xs:double)"/>
and then you can use this type alias (prefixed by a tilde) anywhere an item type is allowed, for example

<xsl:variable name="i" as="~complex" 
              select="cx:complex(0.0, 1.0)"/>
Secondly, we can define tuple types. So we can instead define our complex numbers as:

<saxon:type-alias name="complex" 
                  type="tuple(r: xs:double, i: xs:double)"/>
We're not actually introducing tuples here as a fundamental new type with their own set of functions and operators. Rather, a tuple declaration defines constraints on a map. It lists the keys that must be present in the map, and the type of the value to be associated with each key. The keys here are the strings "r" and "i", and in both cases the value must be an xs:double. The keys are always NCNames, which plays well with the map lookup notation M?K; if $c is a complex number, then the real and imaginary parts can be referenced as $c?r and $c?i respectively.

For this kind of data structure, tuple types provide a much more precise constraint over the contents of the map than the current map type does. It also provides much better static type checking: an expression such as $c?i can be statically checked (a) to ensure that "i" is actually a defined field in the tuple declaration, and (b) that the expression is used in a context where an xs:double value is expected.

I've been a little wary in the past of putting syntax extensions into Saxon; conformance to standards has always been a primary goal. But the standards process seems to be running out of steam, and I'm beginning to feel that it's time to push a few innovative ideas out in product to keep things moving forward. For those who would prefer to stick entirely to stuff defined by W3C, rest assured that these features will only be available if you explicitly enable extensions.

Improving Compile-Time Performance - Saxon diaries

| No Comments | No TrackBacks
For years we've been putting more and more effort into optimizing queries and stylesheets so that they would execute as fast as possible. For many workloads, in particular high throughput server-side transformations, that's a good strategy. But over the last year or two we've become aware that for some other workloads, it's the wrong thing to do.

For example, if you're running a DocBook or DITA transformation from the command line, and the source document is only a couple of KB in size, then the time taken to compile the stylesheet greatly exceeds the actual transformation time. It might take 5 seconds to compile the stylesheet, and 50 milliseconds to execute it. (Both DocBook and DITA stylesheets are vast.) For many users, that's not an untypical scenario.

If we look at the XMark benchmarks, specifically a query such as Q9, which is a fairly complex three-way join, the query executes against a 10Mb source document in just 9ms. But to achieve that, we spend 185ms compiling and optimizing the query. We also spend 380ms parsing the source document. So in an ad-hoc processing workflow, where you're compiling the query, loading a source document, and then running a query, the actual query execution cost is about 2% of the total. But it's that 2% that we've been measuring, and trying to reduce.

We haven't entirely neglected the other parts of the process. For example, one of the most under-used features of the product is document projection, which enables you during parsing, to filter out the parts of the document that the query isn't interested in. For query Q9 that cuts down the size of the source document by 65%, and reduces the execution time of the query to below 8ms. Unfortunately, although the memory saving is very useful, it actually increases the parsing time to 540ms. Some cases are even more dramatic: with Q2, the size of the source document is reduced by 97%; but parsing is still slowed down by the extra work of deciding which parts of the document to retain, and since the query only takes 2ms to execute anyway, there's no benefit other than the memory saving.

For the DocBook and DITA scenarios (unlike XMark) it's the stylesheet compilation time that hurts, rather than the source document parsing time. For a typical DocBook transformation of a small document, I'm seeing a stylesheet compile time of around 3 seconds, source document parsing time of around 0.9ms, and transformation time also around 0.9ms. Clearly, compile time here is far more important than anything else.

The traditional answer to this has always been to compile the stylesheet once and then use it repeatedly. That works if you're running hundreds of transformations using the same stylesheet, but there are many workflows where this is impractical.

Saxon 9.7 makes a big step forward by allowing the compiled form of a stylesheet to be saved to disk. This work was done as part of the implementation of XSLT 3.0 packages, but it doesn't depend on packages in any way and works just as well with 1.0 and 2.0 stylesheets. If we export the docbook stylesheets as a compiled package, and then run from this version rather than from source, the time taken for loading the compiled stylesheet is around 550ms rather than the original 3 seconds. That's a very useful saving especially if you're processing lots of source documents using a pipeline written say using a shell script or Ant build where the tools constrain you to run one transformation at a time. (To ensure that exported stylesheet packages work with tools such as Ant, we've implemented it so that in any API where a source XSLT stylesheet is accepted, we also accept an exported stylesheet package).

But the best performance improvements are those where you don't have to do anything different to get the benefits (cynically, only about 2% of users will ever read the release notes.) So we've got a couple of further projects in the pipeline.

The first is simply raw performance tuning of the optimizer. There's vast potential for this once we turn our minds to it. What we have today has grown organically, and the focus has always been on getting the last ounce of run-time performance regardless how long it takes to achieve it. One approach is to optimize a bit less thoroughly: we've done a bit of that recently in response to a user bug report showing pathological compilation times on an extremely large (20Mb) automatically generated stylesheet. But a better approach is to think harder about the data structures and algorithms we are using.

Over the last few days I've been looking at how we do loop-lifting: that is, identifying subexpressions that can be moved out of a loop because each evaluation will deliver the same result. The current approach is that the optimizer does a recursive walk of the expression tree, and at each node in the tree, the implementation of that particular kind of expression looks around to see what opportunities there are for local optimization. Many of the looping constructs (xsl:for-each, xsl:iterate, for expressions, filter expressions, path expressions) at this point initiate a search of the subtree for expressions that can be lifted out of the loop. This means that with nested loops (a) we're examining the same subtrees once for each level of loop nesting, and (b) we're hoisting the relevant expressions up the tree one loop at a time, rather than moving them straight to where they belong. This is not only a performance problem; the code is incredibly complex, it's hard to debug, and it's hard to be sure that it's doing as effective a job as it should (for example, I only found during this exercise that we aren't loop-lifting subexpressions out of xsl:for-each-group.)

In 9.7, as reported in previous blog posts, we made some improvements to the data structures used for the expression tree, but so far we've been making rather little use of this. One improvement was to add parent pointers, which enables optimizations to work bottom-up rather than top-down. Another improvement was a generic structure for holding the links from a parent node to its children, using an Operand object that (a) holds properties of the relationship (e.g. it tells you when the child expression is evaluated with a different focus from the parent), and (b) is updatable, so a child expression can replace itself by some different expression without needing the parent expression to get involved. These two improvements have enabled a complete overhaul of the way we do loop-lifting. Without knowing anything about the semantics of different kinds of expressions, we can now do a two-phase process: first we do a scan over the expression tree for a function or template to identify, for each node in the tree, what its "innermost scoping node" is: for example an expression such as "$i + @x" is scoped both by the declaration of $i and by the instruction (e.g. xsl:for-each) that sets the focus, and the innermost scoping expression is the inner one of these two. Then, in a second pass, we hoist every expression that's not at the same looping level as its innermost scoping expression to be evaluated (lazily) outside that loop. The whole process is dramatically simpler and faster than what we were doing before, and at least as effective - possibly in some cases more so.

The other project we're just starting on is to look at just-in-time compilation. The thing about stylesheets like DocBook is that they contain zillions of template rules for processing elements which typically don't appear in your average source document. So why waste time compiling template rules that are never used? All we really need to do is make a note of the match patterns, build the data structures we use to identify which rule is the best match for a node, and then do the work of compiling that rule the first time it is used. Indeed, the optimization and byte-code generation work can be deferred until we know that the rule is going to be used often enough to make it worthwhile. We're starting this project (as one should start all performance projects) by collecting instrumentation, so we can work out exactly how much time we are spending in each phase of compilation; that will tell us how much we should be doing eagerly and how much we should defer. There's a trade-off with usability here: do users want to be told about errors found while type-checking parts of the stylesheet that aren't actually exercised by a particular run?

Plenty of ideas to keep us busy for a while to come.

Introducing Saxon-JS - Saxon diaries

| No Comments | No TrackBacks

At XML Prague yesterday we got a spontaneous round of applause when we showed the animated Knight's tour application, reimplemented to use XSLT 3.0 maps and arrays, running in the browser using a new product called Saxon-JS.


So, people will be asking, what exactly is Saxon-JS?


Saxon-EE 9.7 introduces a new option -export which allows you to export a compiled stylesheet, in XML format, to a file: rather like producing a .so file from a C compiler, or a JAR file from a Java compiler. The compiled stylesheet isn't executable code, it's a decorated abstract syntax tree containing, in effect, the optimized stylesheet execution plan. There are two immediate benefits: loading a compiled stylesheet is much faster than loading the original source code, so if you are executing the same stylesheet repeatedly the cost of compilation is amortized; and in addition, it enables you to distribute XSLT code to your users with a degree of intellectual property protection analogous to that obtained from compiled code in other languages. (As with Java, it's not strong encryption - it wouldn't be too hard to write a fairly decent decompiler - but it's strong enough that most people won't attempt it.)


Saxon-JS is an interpreted, written in pure Javascript, that takes these compiled stylesheet files and executes them in a Javascript environment - typically in the browser, or on Node.js. Most of our development and testing is actually being done using Nashorn, a Javascript engine bundled with Java 8, but that's not a serious target environment for Saxon-JS because if you've got Nashorn then you've got Java, and if you've got Java then you don't need Saxon-JS.


Saxon-JS can also be seen as a rewrite of Saxon-CE. Saxon-CE was our first attempt at doing XSLT 2.0 in the browser. It was developed by producing a cut-down version of the Java product, and then cross-compiling this to Javascript using Google's GWT cross-compiler. The main drawbacks of Saxon-CE, at a technical level, were the size of the download (800Kb or so), and the dependency on GWT which made testing and debugging extremely difficult - for example, there was no way of testing our code outside a browser environment, which made running of automated test scripts very time-consuming and labour-intensive. There were also commercial factors: Saxon-CE was based on a fork of the Saxon 9.3 Java code base and re-basing to a later Saxon version would have involved a great deal of work; and there was no revenue stream to fund this work, since we found a strong expectation in the market that this kind of product should be free. As a result we effectively allowed the product to become dormant.


We'll have to see whether Saxon-JS can overcome these difficulties, but we think it has a better chance. Because it depends on Saxon-EE for the front-end (that is, there's a cost to developers but the run-time will be free) we're hoping that there'll be a reveue stream to finance support and ongoing development; and although the JS code is not just a fork but a complete rewrite of the run-time code the fact that it shares the same compiler front end means that it should be easier to keep in sync.


Development has been incredibly rapid - we only started coding at the beginning of January, and we already have about 80% of the XSLT 2.0 tests running - partly because Javascript is a powerful language, but mainly because there's little new design involved. We know how an XSLT engine works, we only have to decide which refinements to leave out. We've also done client-side XSLT before so we can take the language extensions of Saxon-CE (how to invoke templates in response to mouse events, for example) the design of its Javascript APIs, and also some of its internal design (like the way event bubbling works) and reimplement these for Saxon-JS.


One of the areas where we have to make design trade-offs is deciding how much standards conformance, performance, and error diagnostics to sacrifice in the interests of keeping the code small. There are some areas where achieving 100% conformance with the W3C specs will be extremely difficult, at least until JS6 is available everywhere: an example is support for Unicode in regular expressions. For performance, memory usage (and therefore expression pipelining) is important, but getting the last ounce of processor efficiency less so. An important factor (which we never got quite right for Saxon-CE) is asynchronous access to the server for the doc() and document() functions - I have ideas on how to do this, but it ain't easy.


It will be a few weeks before the code is robust enough for an alpha release, but we hope to get this out as soon as possible. There will probably then be a fairly extended period of testing and polishing - experience suggests that when the code is 90% working, you're less than half way there.


I haven't yet decided on the licensing model. Javascript by its nature has no technical protection, but that doesn't mean we have to give it an open source license (which would allow anyone to make changes, or to take parts of the code for reuse in other projects).


All feedback is welcome: especially on opportunities for exploiting the technology in ways that we might not have thought of.

Parent pointers in the Saxon expression tree - Saxon diaries

| No Comments | No TrackBacks
A while ago (http://dev.saxonica.com/blog/mike/2014/11/redesigning-the-saxon-expression-tree.html) I wrote about my plans for the Saxon expression tree. This note is an update.

We've made a number of changes to the expression tree for 9.7.

  • Every node in the tree (every expression) now references a Location object, providing location information for diagnostics (line number, column number, etc). Previously the expression node implemented the SourceLocator interface, which meant it provided this information directly. The benefit is that we can now have different kinds of Location object. In XQuery we will typically hold the line and column and module URI. In XSLT, for a subexpression within an XPath expression, we can now hold both the offset within the XPath expression, and the path to the containing node within the XSLT stylesheet. Hopefully debuggers and editing tools such as oXygen and Stylus Studio will be able to take advantage of the improved location information to lead users straight to the error location in the editor. Where an expression has the same location information as its parent or sibling expressions, the Location object is shared.

Another reason for changing the way we hold location information is connected with the move to separately-compiled packages in XSLT 3.0. This means that the system we previously used, of globally-unique integer "location identifiers" which are translated into real location information by reference to a central "location provider" service, is no longer viable. 

  • Every node in the tree now points to a RetainedStaticContext object which holds that part of the static context which can vary from one expression to another, and which can be required at run-time. Previously we only attempted to retain the parts of the static context that each kind of expression actually used. The parts of the static context that this covers include the static base URI, in-scope namespaces, the default collation, and the XPath 1.0 compatibility flag. Retaining the whole static context might seem extravagent. But in fact, it very rarely changes, so a child expression will nearly always point to the same RetainedStaticContext object as its parent and sibling expressions.

  • Every node in the tree now points to its parent node. This choice has proved tricky. It gives many advantages: it means that the code for every expression can easily find details of the containing package, the configuration options, and a host of details about the query or stylesheet as a whole. The fact that we have a parent node eliminates the need for the "container" object (typically the containing function or template) which we held in previous releases. It also reduces the need to pass additional information to methods on the Expression class, for example methods to determine the item type and cardinality of the expression. There is a significant downside to holding this information, which is the need to keep it consistent. Some of the tree rewrite operations performed by the optimizer are complex enough without having to worry about keeping all the parent pointers correct. And it turns out to be quite difficult to enforce consistency through the normal "private data, public methods" encapsulation techniques: those work when you have to keep the data in a single object consistent, but they aren't much use for maintaining mutual consistency between two different objects. In any case it seems to be unavoidable that to achieve the kind of tree rewrites we want to perform, the tree has to be temporarily inconsistent at various stages.
 
Using parent pointers means that you can't share subtrees. It means that when you perform operations like inlining a function, you can't just reference the subtree that formed the body of the function, you have to copy it. This might seem a great nuisance. But actually, this is not a new constraint. It never was safe to share subtrees, because the optimiser would happily make changes to a subtree without knowing that there were other interested parties. The bugs this caused have been an irritation for years. The introduction of parent pointers makes the constraint more explicit, and makes it possible to perform integrity checking on the tree to discover when we have inadvertently violated the constraints.

During development we've had diagnostic code switched on that checks the integrity of the tree and outputs warnings if problems are found. We've gradually been examining these and eliminating them. The problems can be very hard to diagnose, because the detection of a problem in the data may indicate an error that occurred in a much earlier phase of processing. We've developed some diagnostic tools for tracing the changes made to a particular part of the tree and correlating these with the problems detected later. Most of the problems, as one might expect, are connected with optimization rewrites. A particular class of problem occurs with rewrites that are started but then not completed, (because problems are found) or with "temporary" rewrites that are designed to create an equivalent expression suitable for analysis (say for streamability analysis or for schema-aware static type-checking) but which are not actually intended to affect the run-time interpreted tree. The discipline in all such cases is to copy the part of the tree you want to work on, rather than making changes in-situ.

For some non-local rewrites, such as loop-lifting optimizations, the best strategy seems to be to ignore the parent pointers until the rewrite is finished, and then restore them during a top-down tree-walk.

The fact that we now have parent pointers makes context-dependent optimizations much easier. Checking, for example, whether  a variable reference occurs within a loop (a "higher-order expression" as the XSLT 3.0 spec calls it) is now much easier: it can be done by searching upwards from the variable reference rather than retaining context information in an expression visitor as you walk downwards. Similarly, if there is a need to replace one expression by another (a variable reference by a literal constant, say), the fact that the variable reference knows its own parent makes the substitution much easier.

So although the journey has had a few bumps, I'm reasonably confident that we will see long-term benefits.




Lazy Evaluation - Saxon diaries

| No Comments | No TrackBacks

We've seen some VM dumps recently that showed evidence of contention problems when multiple threads (created, for example, using <xsl:for-each> with the saxon:threads attribute) were attempting lazy evaluation of the same local variable. So I've been looking at the lazy evaluation code in Saxon to try and understand all the permutations of how it works. A blog posting is a good way to try and capture that understanding before I forget it all again. But I won't go into the extra complexities of parallel execution just yet: I'll come back to that at the end.


Lazy evaluation applies when a variable binding, for example "let $v := //x[@y=3]" isn't evaluated immediately when the variable declaration is encountered, but only when the variable is actually referenced. This is possible in functional languages because evaluating an expression has no side-effects, so it doesn't matter when (or how often) it is done. In some functional languages such as Scheme, lazy evaluation happens only if you explicitly request it. In others, such as Haskell, lazy evaluation is mandated by the language specification (which means that a variable can hold an infinite sequence, so long as you don't try to process its entire value). In XSLT and XQuery, lazy evaluation is entirely at the discretion of the compiler, and in this post I shall try to summarize how Saxon makes use of this freedom.


Internally, when a local variable is evaluated lazily, Saxon instead of putting the variable's value in the relevant slot on the stack, will instead put a data structure that contains all the information needed to evaluate the variable: that is, the expression itself, and any part of the evaluation context on which it depends. In Saxon this data structure is called a Closure. The terminology isn't quite right, because it's not quite the same thing as the closure of an inline function, but the concepts are closely related: in some languages, lazy evaluation is implemented by storing, as the value of the variable, not the variable's actual value, but a function which delivers that value when invoked, and the data needed by this function to achieve that task is correctly called a closure. (If higher-order functions had been available in Saxon a few years earlier, we might well have implemented lazy evaluation this way.)


We can distinguish two levels of lazy evaluation. We might use the term "deferred evaluation" to indicate that a variable is not evaluated until it is first referenced, and "incremental evaluation" to indicate that when it is referenced, it is only evaluated to the extent necessary. For example, if the first reference is the function call head($v), only the first item in the sequence $v will be evaluated; remaining items will only be evaluated if a subsequent reference to the variable requires them.


Lazy evaluation can apply to global variables, local variables, parameters of templates and functions, and return values from templates and functions. Saxon handles each case slightly differently.


We should mention some static optimizations which are not directly related to lazy evaluation, but are often confused with it. First, a variable that is never referenced is eliminated at compile-time, so its initializing expression is never evaluated at all. Secondly, a variable that is only referenced once, and where the reference is not in any kind of loop, is inlined: that is, the variable reference is replaced by the expression used to initialize the variable, and the variable itself is then eliminated. So when someone writes "let $x := /a/b/c return $x[d=3]", Saxon turns this into the expression "(/a/b/c)[d=3]". (Achieving this of course requires careful attention to the static and dynamic context, but we won't go into the details here.)


Another static optimization that interacts with variable evaluation is loop-lifting. If an expression within a looping construct (for example the content of xsl:for-each, or of a predicate, or the right-hand-side of the "/" operator) will have the same value for every iteration of the loop, then a new local variable bound to this expression is created outside the loop, and the original expression is replaced by a reference to the variable. In this situation we need to take care that the expression is not evaluated unless the loop is executed at least once (both to avoid wasted evaluation cost, and to give the right behaviour in the event that evaluating the expression fails with a dynamic error.) So lazy evaluation of such a variable becomes mandatory.


The combined effect of these static optimizations, together with lazy evaluation, is that the order of evaluation of expressions can be quite unintuitive. To enable users to understand what is going on when debugging, it is therefore normal for some of these rewrites to be suppressed if debugging or tracing are enabled.


For global variables, Saxon uses deferred evaluation but not incremental evaluation. A global variable is not evaluated until it is first referenced, but at that point it is completely evaluated, and the sequence representing its value is held in memory in its entirety.


For local variables, evaluation is generally both deferred and incremental. However, the rules are quite complex.


  • If the static type shows that the value will be a singleton, then it will be evaluated eagerly. [It's not at all clear that this rule makes sense. Certainly, incremental evaluation makes no sense for singletons. But deferred evaluation could still be very useful, for example if the evaluation is expensive and the variable is only referenced within a branch of a conditional, so the value is not always needed.]

  • Eager evaluation is used when the binding expression is very simple: in particular when it is a literal or a reference to another variable.

  • Eager evaluation is used for binding expressions that depend on position() or last(), to avoid the complexities of saving these values in the Closure.

  • There are some optimizations which take precedence over lazy evaluation. For example if there are variable references using predicates, such as $v[@x=3], then the variable will not only be evaluated eagerly, but will also be indexed on the value of the attribute @x. Another example: if a variable is initialized to an expression such as ($v, x) - that is, a sequence that appends an item to another variable - then we use a "shared append expression" which is a data structure that allows a sequence to be constructed by appending to an existing sequence without copying the entire sequence, which is a common pattern in algorithms using head-tail recursion.

  • Lazy evaluation (and inlining) need special care if the variable is declared outside a try/catch block, but is referenced within it. In such a case a dynamic error that occurs while evaluating the initialization expression must not be caught by the try/catch; it is logically outside its scope. (Writing this has made me realise that this is not yet implemented in Saxon; I have written a test case and it currently fails.)


If none of these special circumstances apply, lazy evaluation is chosen. There is one more choice to be made: between a Closure and a MemoClosure. The common case is a MemoClosure, and in this case, as the variable is incrementally evaluated, the value is saved for use when evaluating subsequent variable references. A (non-memo) closure is used when it is known that the value will only be needed once. Because most such cases have been handled by variable inlining, the main case where a non-memo closure is used is for the return value of a function. Functions, like variables, are lazily evaluated, so that the value returned to the caller is not actually a sequence in memory, but a closure containing all the information needed to materialize the sequence. (Like most rules in this story, there is an important exception: tail-call optimization, where the last thing a function does is to call itself, takes precedence over lazy evaluation).


So let's look more closely at the MemoClosure. A MemoClosure is a data structure that holds the following information:


  • The Expression itself (a pointer to a node in the expression tree). The Expression object also holds any information from the static context that is needed during evaluation, for example namespace bindings.

  • A copy of the dynamic context at the point where the variable is bound. This includes the context item, and values of any local variables referenced by the expression.

  • The current evaluation state: one of UNREAD (no access to the variable has yet been made), MAYBE_MORE (some items in the value of the variable are available, but there may be more to come), ALL_READ (the value of the variable is fully available), BUSY (the variable is being evaluated), or EMPTY (special case of ALL_READ in which the value is known to be an empty sequence).

  • An InputIterator: an iterator over the results of the expression, relevant when evaluation has started but has not finished

  • A reservoir: a list containing the items delivered by the InputIterator so far.


Many variable references, for example count($v), or index-of($v, 'z') result in the variable being evaluated in full. If this is the first reference to the variable, that is if the state is UNREAD, the logic is essentially


inputIterator = expression.iterate(savedContext);

for item in inputIterator {

   reservoir.add(item);

}

state = ALL_READ;

return new SequenceExtent(reservoir);


(However, Saxon doesn't optimize this case, and it occurs to me on writing this that it could.)


Other variable references, such as head($v), or $v[1], or subsequence($v, 1, 5), require only partial evaluation of the expression. In such cases Saxon creates and returns a ProgressiveIterator, and the requesting expression reads as many items from the ProgressiveIterator as it needs. Requests to get items from the ProgressiveIterator fetch items from the reservoir to the extent they are available; on exhaustion of the reservoir, they then attempt to fetch items from the InputIterator until either enough items are available, or the InputIterator is exhausted. Items delivered from the InputIterator are copied to the reservoir as they are found.


So far so good. This has all been in place for years, and works well. We have no evidence that it is in any way optimal, but it has been carefully tweaked over the years to deal with particular cases where it was performing badly. What has changed recently is that local variables can be referenced from multiple threads. There are two particular cases where this happens today: when xsl:result-document is used in Saxon-EE, it executes by default asynchronously in a new thread; and when the extension attribute saxon:threads is used on xsl:for-each, the items selected by the xsl:for-each are processed in parallel rather than sequentially.


The effect of this is that the MemoClosure object needs to be thread-safe: multiple requests to access the variable can come simultaneously from different threads. To achieve this a number of methods are synchronized. One of these is the next() method of the ProgressiveIterator: if two threads reference the variable at the same time, each gets its own ProgressiveIterator, and the next() method on one of these iterators is forced to wait until the other has finished.


This works, but it is risky. Brian Goetz in his excellent book Java Concurrency in Practice recommends that a method should not be synchronized unless (a) its execution time is short, and (b) as the author of the method, you know exactly what code will execute while it is active. In this case neither condition is satisfied. The next() method of ProgressiveIterator calls the next() method of the InputIterator, and this may perform expensive computation, for example retrieving and parsing a document using the doc() function. Further, we have no way of analyzing exactly what code is executed: in the worst case, it may include user-written code (for example, an extension function or a URIResolver). The mechanism can't deadlock with itself (because there cannot be a cycle of variable references) but it is practically impossible to prove that it can't deadlock with other subsystems that use synchronization, and in the face of maliciously-written used code, it's probably safe to assume that deadlock can occur. We haven't seen deadlock happen in practice, but it's unsatisfactory that we can't prove its impossibility.


So what should we do about it?


I think the answer is, add yet another exception to the list of cases where lazy evaluation is used: specifically, don't use it for a variable that can be referenced from a different thread. I'm pretty sure it's possible to detect such cases statically, and they won't be very common. In such cases, use eager evaluation instead.


We must be careful not to do this in the case of a loop-lifted variable, where the correct error semantics depend on lazy evaluation. So another tweak to the rules is, don't loop-lift code out of a multithreaded execution block.


This investigation also suggests a few other refinements we might make.


  • It seems worth optimizing for the case where the entire value of a variable is needed, since this case is so common. The problem is, it's not easy to detect this case: a calling expression such as count($v) will ask for an iterator over the variable value, without giving any indication that it intends to read the iterator to completion.

  • We need to reassess the rule that singleton local variables are evaluated eagerly.

  • We currently avoid using lazy evaluation for expressions with certain dependencies on the dynamic context (for example, position() and last()). But in the course of implementing higher-order functions, we have acquired the capability to hold such values in a saved copy of the dynamic context.

  • We could look at a complete redesign that takes advantage of higher-order functions and their closures. This might be much simpler than the current design; but it would discard the benefits of years of fine-tuning of the current design.

  • I'm not convinced that it makes sense for a MemoClosure to defer creation of the InputIterator until the first request for the variable value. It would be a lot simpler to call inputIterator = Expression.iterate(context) at the point of variable declaration; in most cases the implementation will defer evaluation to the extent that this makes sense, and this approach saves the cost of the elaborate code to save the necessary parts of the dynamic context. It's worth trying the other approach and making some performance measurements.



A redesign of the NamePool - Saxon diaries

| No Comments | No TrackBacks
As explained in my previous post, the NamePool in Saxon is a potential problem for scaleability, both because access can cause contention, and also because it has serious limits on the number of names it can hold: there's a maximum of one million QNames, and performance starts getting seriously bad long before this limit is reached.

Essentially, the old NamePool is a home-grown hash table. It uses a fixed number of buckets (1024), and when hash collisions occur, the chains of hash duplicates are searched serially. The fact that the number of buckets is fixed, and entries are only added to the end of a chain, is what makes it (reasonably) safe for read access to the pool to occur without locking.

One thing I have been doing over a period of time is to reduce the amount of unnecessary use of the NamePool. Most recently I've changed the implementation of the schema component model so that references from one schema component to another are no longer implemented using NamePool fingerprints. But this is peripheral: the core usage of the NamePool for comparing names in a query against names in a source document will always remain the dominant usage, and we need to make this scaleable as parallelism increases.

Today I've been exploring an alternative design for the NamePool (and some variations on the implementation of the design). The new design has at its core two Java ConcurrentHashMaps, one from QNames to fingerprints, and one from fingerprints to QNames. The ConcurrentHashMap, which was introduced in Java 5, doesn't just offer safe multi-threaded access, it also offers very low contention: it uses fine-grained locking to ensure that multiple writers, and any number of readers, can access the data structure simulaneously.

Using two maps, one of which is the inverse of the other, at first seemed a problem. How can we ensure that the two maps are consistent with each other, without updating both under an exclusive lock, which would negate all the benefits? The answer is that we can't completely, but we can get close enough.

The logic is like this:

private final ConcurrentHashMap<StructuredQName, Integer> qNameToInteger = new ConcurrentHashMap<StructuredQName, Integer>(1000);
private final ConcurrentHashMap<Integer, StructuredQName> integerToQName = new ConcurrentHashMap<Integer, StructuredQName>(1000);
private AtomicInteger unique = new AtomicInteger();
// Allocate fingerprint to QName
Integer existing = qNameToInteger.get(qName);
if (existing != null) {
return existing;
}
Integer next = unique.getAndIncrement();
existing = qNameToInteger.putIfAbsent(qName, next);
if (existing == null) {
integerToQName.put(next, qName);
return next;
} else {
return existing;
}
Now, there are several things slightly unsafe about this. We might find that the QName doesn't exist in the map on our first look, but by the time we get to the "putIfAbsent" call, someone else has added it. The worst that happens here is that we've used up an integer from the "unique" sequence unnecessarily. Also, someone else doing concurrent read access might see the NamePool in a state where one map has been updated and the other hasn't. But I believe this doesn't matter: clients aren't going to look for a fingerprint in the map unless they have good reason to believe that fingerprint exists, and it's highly implausible that this knowledge comes from a different thread that has only just added the fingerprint to the map.
There's another ConcurrentHashMap involved as well, which is a map from URIs to lists of prefixes used in conjunction with that URI. I won't go into that detail.
The external interface to the NamePool doesn't change at all by this redesign. We still use 20-bit fingerprints plus 10-bit prefix codes, so we still have the limit of a million distinct names. But performance no longer degrades when we get close to that limit; and the limit is no longer quite so hard-coded.
My first attempt at measuring the performance of this found the expected benefits in scalability as the concurrency increases and as the size of the vocabulary increases, but the performance under more normal conditions was worse than the existing design: execution time of 5s versus 3s for executing 100,000 cycles each of which performed an addition (from a pool of 10,000 distinct names so 90% of the additions were already present) followed by 20 retrievals.
I suspected that the performance degradation was caused by the need to update two maps, whereas the existing design only uses one (it's cleverly done so that the fingerprint generated for a QName is closely related to its hash key, which enables us to use the fingerprint to navigate back into the hash table to reconstruct the original QName).
But it turned out that the cause was somewhere else. The old NamePool design was hashing QNames by considering only the local part of the name and ignoring the namespace URI, whereas the new design was computing a hash based on both the local name and the URI. Because URIs are often rather long, computing the hash code is expensive, and in this case it adds very little value: it's unusual for the same local name to be associated with more than one URI, and when it happens, the hash table is perfectly able to cope with the collision. By changing the hashing on QName objects to consider only the local name, the costs for the new design came down slightly below the current implementation (about 10% better, not enough to be noticeable).
So I feel comfortable putting this into production. There are a dozen test cases failing (out of 20,000) which I need to sort out first, but it all looks very promising.