All nodes untyped

By Michael Kay on May 14, 2009 at 02:18p.m.

Users sometimes imagine that just by running an application under Saxon-SA instead of Saxon-B, they will automatically get a performance boost. Sadly, this isn't the case. Sometimes Saxon-SA's more powerful optimizer will give a dramatic benefit, sometimes it will give none at all. In fact, sometimes if you move a workload to Saxon-SA without change, you see a performance regression. This is caused by the fact that Saxon-B can assume all nodes are untyped, whereas Saxon-SA can't make this assumption.

To see why it makes a difference, consider the code item[@code='A1234']. If the compiler knows that the element and attribute will both be untyped (that is, not schema-validated), then it knows that it can simply generate code to compare the string-value of the attribute with the literal 'A1234'. Equally, if there is a schema that declares the attribute to be a string, the same simple code can be generated. But if you run Saxon-SA without declaring any schema type information, the compiler doesn't know what it will find in the @code attribute: it might, for example, be typed as a list of strings (in which case it needs to compare each of those strings individually), or as a list of integers or dates, in which case it needs to be prepared to raise a type error at run-time.

In an ideal world, the answer to this would be: use a schema, and declare the types of all your variables and function parameters and results, so that the compiler has as much information as possible. If you do this, Saxon-SA will almost invariably outperform Saxon-B on common tasks like evaluating predicates. The problem is, the world isn't ideal, and people want instant gratification. They want to just switch to the commercial version of the software and see a performance improvement with no effort on their part. Which is not really that unreasonable an expectation.

In Saxon 9.1, the Configuration object has a setting setAllNodesUntyped(true), which if called at run-time, causes the compiler to generate code in the same way as Saxon-B, that is, on the assumption that at run-time, there will be no schema validation, and all nodes will be untyped. When you run Saxon-SA from the command line this is called automatically if you don't select the -sa option. But that's not a very happy solution, because if you don't select the -sa option then you don't get SA's souped-up optimizer either. If you're calling from Java however, you can create a SchemaAwareConfiguration to get the benefits of the SA optimizer, and then call setAllNodesUntyped(true) to say that it should generate code to handle untyped input documents.

In 9.2 I'm trying to change this so the default options are the ones that give the best performance. There's a factory method Configuration.newConfiguration() which gives you the best Configuration available (if you install Saxon-EE, the new name for Saxon-SA, it will give you an EnterpriseConfiguration, which is SchemaAwareConfiguration under a new name). The option to say that all nodes are untyped is no longer at the Configuration level, but at the level of an individual XSLT or XQuery compilation, and it is now the default. If your query or stylesheet imports a schema, the setting automatically changes; the only time you really need to be aware of the switch is for the unusual case where you don't import a schema, but still want to handle schema-validated input.

On the whole this is working well, but there are still a few glitches to be ironed out. One of them is that the minimum conformance level for XQuery permits the use of "construction mode preserve", which causes element nodes to be annotated as xs:anyType rather than xs:untyped. Saxon is currently disallowing this option if the "all nodes untyped" switch is set, but it doesn't seem right to reject a conformant query under default configuration settings. Since in the absence of validation there is almost no operational difference between xs:anyType and xs:untyped, I probably need to find a way of allowing xs:anyType nodes to appear even when the all-nodes-untyped option is set.