Streaming Mode on large documents

By Michael Kay on July 04, 2006 at 02:18p.m.

Coincidentally, the day after my successful efforts to get the XStream benchmark running using the Saxon-SA streaming mode for large documents, (see previous post) a Saxon-SA customer sent me a problem: they needed to transform a 450 Mb file and wanted to know whether this facility would crack the problem. The answer, sadly, was no: the optimization only works if the transformation can be broken up into a sequence of transformations on subtrees of the document. In this particular use case the subtrees to be transformed carried context with them, for example they were grouped hierarchically, so the method doesn't work.

As always, though, some new customer requirements proved a useful stimulus to improving the product. I haven't cracked this use case yet, but I have made some useful improvements.

Firstly, you can now specify a filter on the expression defining the subtrees to be transformed, for example

<xsl:copy-of select="doc('huge.xml')//item[@price gt 50.00]"/>

The only restrictions on the filter are that it mustn't be positional, and it can't look outside the subtree being copied.

Secondly, union expressions now work (this is a bug fix).

Finally, the construct is now decoupled from the push-pull multithreading implementation. If all you want to do is to write selected parts of the large document to the serializer, or to a temporary tree held in a variable, then there is no push-pull conflict, and the whole thing can operate in push mode, filtering and all. On the other hand, if your stylesheet iterates over the sequence returned by the xsl:copy-of, then the multithreaded implementation is still used (this slows things down by about a factor of two, but that's often worth it for the memory savings.)