At Saxonica, we have for a long time now used a tailor-made java application to create and issue licenses for all commercial products we develop. There is no real database at the back-end, but just a local XML file with customer details and copies of the licenses created and issued. For a one man company this poses no real problem, but inevitably as the company has expanded over the last two years this has been a major concern.

Early last year Mike Kay presented me with the task to create a new saxon-license application with the following requirements:

  • Accessible  to all employees, preferable web-based
  • Core Java tool should remain intact 
  • Centralised store, must be XML-based
  • Secure application

From the outset we thought that for such a tool which heavily relies on XML and XSLT at its core, that  requirements would be best met using XSLTForms and Servlex to develop the tool.

In this blog post I would like to share my own experiences in the development of the saxon-license webapp using Servlex and XSLTForms. In the discussion I include how we stitched on our existing back-end core Java tool and challenges faced with encoding. Specific details of the features and functions are not that important here, only the engineering process is of interest.

On the client-side we write XForms [1] documents, which are manipulated by XSLTForms [2] (created by Alain Couthures)  to render in the browsers. XSLTForms is an open source client-side implementation, not a plug-in or install, that works with all major browsers.

On the server-side we integrate the core Saxon-license tool in a Servlex webapp [3] as Saxon extension functions called from within XSL. The Servlex is an open-source implementation of the EXPath webapp framework [4] based on Saxon and Calabash as its XSLT, XQuery and XProc processors.  Servlex provides a way to write web applications directly in XSLT. It is developed as a Java EE application requiring Servlet technology, sitting on tomcat for binding to HTTP server-side.

Saxon License-tool functionality

The server-side Servlex works as a dispatching and routing mechanism to components (implementation as XSLT stylesheets), applying a configuration-based mapping between the request URI and the component used to process that URI. The container communicates with the components by means of an XML representation of the HTTP request, and receives in turn XML data with HTML at the request body with XForms content and XSLTForms references to render the page.  The representation of the HTTP response is sent back to the client. There are buttons on the forms, which if pressed trigger the action HTTP PUT request; made through the client-side XSLTForms. These requests are handled by Servlex.

There are 7=5 main XSLT functions described below, which map the URIs to generate the various XForms to tunnel the instance data between the XForms. These functions all make calls to the core Saxon-license tool written in Java, made available as a Saxon extensions calls from the XSLT:


  1. fnRunMainForm: A request to serve the main form is made with the following URI pattern:

    http://192.168.0.2:8080/app/license-tool/main

    License requests are usually made through the main saxonica website either for evaluation or paid order (See: http://www.saxonica.com/download/download.xml and http://www.saxonica.com/purchase/purchase.xml, respectively), these orders are receives as an email, which are then copied and pasted on the main form.  This data is sent in the form of a XForms instance in a web request, picked up by servlex.

  2. fnManualEntry: Manual Entry form for manual creation of the customer details to create a license. A request is made to servlex with the following URI pattern:http://192.168.0.2:8080/app/license-tool/manualEntry

  3. fnFetchRecord: Existing licenses created we can retrieve and re-issue. A request is made to Servlex with the following URI pattern. We observe the parameter after the ? Is the license number to fetch:
    http://192.168.0.2:8080/app/license-tool/fetchRecord?Select=X002110

  4. fnReport: This function generates an HTML page containing all license created or such the last 20.
    http://192.168.0.2:8080/app/license-tool/report

  5. fnEditParseRequest: Manual Entry form: The manual form with the client data populated. The order request from the main form is parsed and returned as a Xforms instance data which is used to generate the form on the server. A request is made to Servlex with the following URI pattern:
    http://192.168.0.2:8080/app/license-tool/editParseRequest

     
Securing access to the saxon-license webapp is achieved through apache2 configuration.

Encoding problem

A long-standing problem we faced in this application was the handling of non-ASCII characters. We raised this issue with Alain and Florent the creators of XSLTForms and Servlex, respectively, to get to the bottom of this problem.

Basically, if the user enters data on a form, we're sending it back to the server in a url-encoded POST message, and it's emerging from Servlex in the form of XML presented as a string, and if there are non-ASCII characters then they are miscoded. In the form we set the submission method attribute to 'xml-urlencoded-post' to guarantee that the next page will fully replace the current one: XMLHttpRequest is not used in this case.

We were seeing the typical pattern that you get when the input characters are encoded as a UTF-8 byte sequence and the byte sequence is then decoded to characters by someone who believes it to be 8859-1. We were not able to work out where the incorrect decoding was happening. We originally circumvented the problem by reversing the error: we converted the string back to bytes treating each char as a byte, and then decoded the bytes as UTF-8.

A feature of XSLTForms is the profiler (enabled by pressing F1 or setting debug='yes' in the xsltforms-options process instruction). The profiler allows the inspection of the instance data. Another mechanism is to inspect the requests sent by the browser with the network profiler of a debugger.

We established that on the client side, there is an HTML Form Element that gets built, and just before the submit() method gets called on this object, the data appears to be OK. But when we look at the Tomcat log of the POST request, it's wrong. Somewhere between the form.submit() on the client and the logging of the message on the server, it's getting corrupted. We can't actually see where the encoding and decoding is happening between these two points.

To tackle this problem Florent provided a development version of Servlex, which added logging of the octets as they are read from the binary stream (the logger org.expath.servlex must be set to trace, which should be the default in that version).  In addition to logging the raw headers, as they are read by Tomcat.

With this new version of Servlex in place I inputted the following data on the main form. We observe the euro symbol at the end of my first name 'O'Neil' is a non-ASCII character which needs to be preserved:


First Name: O'Neil€

Last Name: Delpratt

Company: Saxonica

Country: United Kingdom

Email Address: oneil@saxonica.com

Phone:

Agree to Terms: checked

After submitting this data to the URI pattern: .../app/license-tool/editParseRequest  we see below the the log data reported by tomcat. What is interesting is the line 'DEBUG [2013-03-04 18:06:34,281]: Request - header   : content-type / application/x-www-form-urlencoded'.  Also at this stage the input to the receiving form has been corrupted to 'O'Neilâ&#x82' which should be 'O'Neil€' :

DEBUG [2013-03-04 18:06:34,279]: Request - servlet  : parseRequest
DEBUG [2013-03-04 18:06:34,280]: Request - path     : /parseRequest
DEBUG [2013-03-04 18:06:34,280]: Request - method   : POST
DEBUG [2013-03-04 18:06:34,280]: Request - uri      : http://localhost:8080/app/license-tool/parseRequest
DEBUG [2013-03-04 18:06:34,280]: Request - authority: http://localhost:8080
DEBUG [2013-03-04 18:06:34,280]: Request - ctxt_root: /app/license-tool
DEBUG [2013-03-04 18:06:34,280]: Request - param    : postdata / <Document><Data>First Name: O'Neil€

Last Name: Delpratt

Company: Saxonica

Country: United Kingdom

Email Address: oneil@saxonica.com

Phone:

Agree to Terms: checked </Data><Options><Confirmed>false</Confirmed><Create>false</Create><Send>false</Send><Generate>false</Generate><Existing/></Options></Document>
DEBUG [2013-03-04 18:06:34,281]: Request - header   : host / localhost:8080
DEBUG [2013-03-04 18:06:34,281]: Request - header   : user-agent / Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:18.0) Gecko/20100101 Firefox/18.0
DEBUG [2013-03-04 18:06:34,281]: Request - header   : accept / text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
DEBUG [2013-03-04 18:06:34,281]: Request - header   : accept-language / en-gb,en;q=0.5
DEBUG [2013-03-04 18:06:34,281]: Request - header   : accept-encoding / gzip, deflate
DEBUG [2013-03-04 18:06:34,281]: Request - header   : referer / http://localhost:8080/app/license-tool/main
DEBUG [2013-03-04 18:06:34,281]: Request - header   : connection / keep-alive
DEBUG [2013-03-04 18:06:34,281]: Request - header   : content-type / application/x-www-form-urlencoded
DEBUG [2013-03-04 18:06:34,281]: Request - header   : content-length / 482
DEBUG [2013-03-04 18:06:34,281]: Raw body content type: application/x-www-form-urlencoded
TRACE [2013-03-04 18:06:34,281]: TraceInputStream(org.apache.catalina.connector.CoyoteInputStream@771eeb)
TRACE [2013-03-04 18:06:34,282]: read([B@1a70476): -1

Florent made the following observations:

  1. The content-type is application/x-www-form-urlencoded, which should conform to http://www.w3.org/TR/xforms/#serialize-urlencode, but seems not to: the XML seems to be passed as is, instead of been split into individual elements and their string values. But I am not an expert on XForms so I might be wrong.

  2. Still about application/x-www-form-urlencoded and the same section, it says that the non-ASCII characters are replaced based on the octets of their UTF-8 representation, so the encoding should not be used here.  This content-type does not carry any charset parameter anyway, if I am right.

  3. Again about application/x-www-form-urlencoded, it is actually handled by Java EE as parameters, instead of simply giving the raw POST entity content.  I am not sure exactly how it works WRT the encoding.

Alain provided the following example to test the assumptions made by Florent.

Encoding.xhtml:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xf="http://www.w3.org/2002/xforms">
    <head>
        <title>Encoding Test</title>
        <xf:model>
            <xf:instance>
                <data xmlns=""/>
            </xf:instance>
            <xf:submission id="s01" method="xml-urlencoded-post" replace="all" action="http://www.agencexml.com/xsltforms/dump.php">
                <xf:message level="modeless" ev:event="xforms-submit-error">Submit error.</xf:message>
            </xf:submission>
        </xf:model>
    </head>
    <body>
        <xf:input ref=".">
            <xf:label>Input:</xf:label>
        </xf:input>
        <xf:submit>
            <xf:label>Save</xf:label>
        </xf:submit>
    </body>
</html>

dump.php:
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>HTTP XML POST Dump</title>
    </head>
    <body>
        <h1>HTTP XML POST Dump</h1>
        <h2>Raw Data :</h2>
        <?php
        $body = file_get_contents("php://input");
        echo strlen($body);
        echo " bytes: <br/>";
        echo "<pre>$body</pre>";
        if(substr($body,0,9) == "postdata=") {
            $body = urldecode(substr($body,strpos($body,"=")+1));
        }
        $xml = new DOMDocument();
        $xml->loadXML($body);
        $xslt = new XSLTProcessor();
        $xsl = new DOMDocument();
        $indent = "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\"><xsl:output method=\"xml\" indent=\"yes\" encoding=\"UTF-8\"/><xsl:template match=\"@*|node()\"><xsl:copy-of select=\".\"/></xsl:template></xsl:stylesheet>";
        $xsl->loadXML($indent);
        $xslt->importStylesheet($xsl);
        $result = $xslt->transformToXml($xml);
        $result = substr($result, strpos($result,"?>")+3);
        echo "<h2>Indented XML :</h2><pre>".htmlspecialchars($result, ENT_QUOTES)."</pre>";
        ?>
    </body>
</html>

When submitting '€', I get this:

HTTP XML POST Dump
Raw Data :
41 bytes:
postdata=%3Cdata%3E%E2%82%AC%3C%2Fdata%3E
Indented XML :
<data>€</data>

and with Firebug, I can see following, which is correct:

saxon-license img3




Florent states:

What should be in the content of the HTTP request is %E2%82%AC to represent the Euro symbol as URL- encoded (because that represents the 3 octets of € in UTF-8).
Because of the "automatic" handling of that Content-Type by Java EE, I am afraid the only way to know for sure what is on the wire is to actually look into it (using a packet sniffer, like Wireshark for instance).

At this stage it was important to check what packets are being sent. Used Wireshark. The following is a snippet of the reports from Wireshark, with the data format correct at this point.


HTTP    1207    POST /app/license-tool/parseRequest HTTP/1.1  (application/x-www-form-urlencoded)

[truncated] postdata=%3CDocument+xmlns%3D%22%22%3E%3CData%3EFirst+Name%3A+O%27Neil%E2%82%AC%0D%0A%0D%0ALast+Name%3A+Delpratt%0D%0A%0D%0ACompany%3A+Saxonica%0D%0A%0D%0ACountry%3A+United+Kingdom%0D%0A%0D%0AEmail+Address%3A+oneil%40saxonica.c ...

Florent discovered using Alain's test case that it was actually Tomcat itself interpreting the %xx encoding as Latin-1!  More infos at:

http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

In summary, the message is the decoding was done using 8859-1 not UTF-8 as one would expect.

To overcome the problem Florent created a new config property for Servlex, which is named org.expath.servlex.default.charset, the value of which can be set to "UTF-8" in Tomcat's conf/catalina.properties. If set, it's value is used as the charset for requests without an explicit charset in Content-Type.


Thanks to Florent, Alain and Mike the encoding problem has now been resolved. The lesson learnt in all, is that tracking down encoding problems can still be very hard work.

References
[1] XForms. W3C. http://www.w3.org/MarkUp/Forms/
[2] XSLTForms. Alain Couthures. http://www.agencexml.com/xsltforms
[3] Servlex. Florent George. Gihub: https://github.com/fgeorges/servlex  Google Project: http://code.google.com/p/servlex/
[4] EXPath Webapp. http://expath.org/wiki/Webapp


XSLT in MicroXML? - Saxon diaries

| No Comments | No TrackBacks
Listening to Uche Ogbuji's talk at XML Prague this morning, it was great news to hear that namespaces are being thrown out of the window[1]. The complexity that namespaces add to the XML stack is vast, and the value they add is tiny. 

But it raises the question, can one represent an XSLT stylesheet as an XML document without using namespaces? Clearly there needs to be some way to distinguish XSLT instructions from literal result elements, and this seems to be one of the more justifiable uses of namespaces.

My immediate thought is, let the outermost element of an XSLT stylesheet be

<xxx.transform>

where xxx is any prefix you care to choose; and then let all XSLT instructions in the stylesheet use the same prefix "xxx". For most people "xsl" would be a suitable prefix, so you would write

<xsl.transform>

  <xsl.template match="/">
    <out>
      <xsl.copy-of select="."/>
    </out>
 </xsl.template>

</xsl.transform>

it takes a bit of getting used to typing "." instead of ":", but apart from that, it seems to work perfectly well. And you don't need xsl:namespace-alias any more to read or write XSLT stylesheets, which must be good: you simply use a different prefix for the output stylesheet.

Norm Walsh tweeted that XSLT was claiming ownership of all names ending in ".transform". But that's not the case. The name doesn't have global significance; it only has significance in the context of the input supplied to an XSLT processor. Names are local and contextual, that's the point. Arguably there is still more redundancy than we need - we know the document we are reading is a stylesheet, so why do we need a top-level element to confirm the fact? (That's not to say all redundancy is bad, just to make the point that it's redundant).

[1] Defenestration seems the thing to do when in Prague 

Further thoughts

It's not only the XSLT namespace and its use in XSLT instructions that needs to be addressed, of course.

Namespaces in the names of variables, templates, modes, keys, decimal formats and so on can safely be ignored. Hardly anyone uses them anyway.

Functions are a bit more awkward. XSLT has never seriously committed to the use of the "fn" namespace for system functions (it's always the default, so no prefix is ever needed), but it does require user-written functions to be in a namespace to distinguish them from system-supplied functions; and external function libraries also use namespaces in a similar way. I've always felt the right approach for resolving function names is a search list - it shouldn't be necessary to prefix each function name to identify its library, rather it should be possible to define a list of function libraries to be searched when looking for functions with a given local name. The MicroXML philosophy would suggest getting rid of URIs from function names entirely, but that's a big change; transitionally, a list of URIs to be searched plus the XSLT 3.0 EQName facility if you need to be more explicitly would probably meet the requirement.

Step back, however. What are we trying to achieve? If we want to eliminate namespaces entirely from the whole stack, then we need to think about a namespace-less XSD (shudder!) or a typeless XSLT (losing integers and dates would be a big step back). If our goals are less ambitious, then what are they? Being able to transform MicroXML documents doesn't require an XSLT change at all.

I think MicroXML is a first step along a road that is not well charted. While the goals are laudable, the route to a simpler cleaner world is not a straightforward one.

Keys and maps - Saxon diaries

| No Comments | No TrackBacks
I've felt for a while there ought to be some kind of connection between keys (in the xsl:key sense) and maps.

Keys are great, but their limitations are frustrating. There's no way of enumerating the entries in a key, there's no way of using them for grouping or sorting. If you're going to the effort of building an index, then one feels it ought to be possible to get a higher return on this investment.

I've been trying out a fairly simple way of achieving this.

Firstly, to the xsl:key declaration we add the option saxon:range-key="yes". The main effect of this is that we build the resulting index as a TreeMap rather than a HashMap, which means that the keys are sorted.

Then, we add a single extension function: saxon:key-map('keyname', doc, min, max). The effect of this function is to return a map. The keys in the map are the values of the "use" expression in the key declaration; the map contains all entries for this key relevant to the document supplied in the doc argument; and min and max are the minimum and maximum (inclusive) values of the key - either can be supplied as an empty sequence to indicate "no lower/upper limit". Perhaps for convenience we could also have a version of the function in which only the first two arguments are supplied.

So, for example, given a list of transactions with a date attribute, you would declare a key like this:

<xsl:key name="transactions-by-date" match="transaction" use="@date" saxon:range-key="yes"/>

and then the call

saxon:range-key('transactions-by-date', doc('transactions.xml'), '2012-12-01', '2012-12-31')

would return a map containing all the transactions for December. With the result bound to variable $map, map:keys() returns the dates of these transactions, and $map($x) returns the transactions for a specific date. It's a shame there isn't a nicer way to iterate over the contents of a map (we're still working on that one), but by presenting the result as a map we get access to all the map functionality without having to invent a lot of new functions just for this use case.

We're guaranteeing that for this particular kind of map, keys() will return the keys in sorted order. (Perhaps that's a facility that should also be available for other kinds of map? I don't think this design precludes that.)

Let me know what you think.
I would like to report on some Saxon performance measure on a Word ladder solution implemented in XSLT.

Firstly, some background information on the Word ladder problem. From Wikipedia, the free encyclopedia:


A word ladder (also known as a doublets, word-links, or Word golf) is a word game invented by Lewis Carroll. A word ladder puzzle begins with two given words, and to solve the puzzle one must find the shortest chain of other words to link the two given words, in which chain every two adjacent words (that is, words in successive steps) differ by exactly by one letter.

XSLT interest in this problem was first started (to the best of my knowledge) by Dimitre Novatchev through the mulberry mailing list, who provides a 20 step guide to create a stylesheet in his blog to solve the Word ladder problem (FindChainOfWordsHamming.xsl). Following the post on the list, there has been some interest; another solution to this problem was given by Wolfgang Laun (please see thread, file: FindChainOfWordsHamming2.xsl).

Experimental Evaluation

Our interest resides in the Saxon performances only. I was curious and surprised by the results reported by Dimitre. The question I had is why Dimitre's stylesheet was much slower than Wolfgang's stylesheet in Saxon and faster in another XSLT processor: there must be some optimization step we were not making. I was motivated to understand were the bottle necks were and how we could improve the performance in Saxon.

Wolfgang wrote: "The XSLT program is three times faster on one XSLT implementation than on another one is strange, 'very' strange".

Mike Kay addressed Wolfgang's comment by writing in the thread: "No, it's extremely common. In fact, very much larger factors than this are possible. Sometimes Saxon-EE runs 1000 times faster than Saxon-HE. This effect is normal with declarative languages where powerful optimizations are deployed - SQL users will be very familiar with the effect."

The table below shows the execution times of the stylesheets in Saxon 9.XX (for some recent X). Time were reported by Dimitre.

TransformationTimes (secs)
Dimitre39
Wolfgang25

We observe that Wolfgang's transformation is 1.56 times faster. Please note that with Wolfgang's stylesheet his results lists all solutions (i.e. ladders), whereas Dimitre only finds one.

Saxon represents a stylesheet as a compiled abstract syntax tree (AST) which is processed in a interpreted manner. Since the release of Saxon 9.4 we have included the bytecode generation feature, which allows us at the compilation phase to generate directly the byte code representation of the entire AST or sub-trees of it where performance benefits can be achieved. We make use of properties we know at compile time (See full paper).

Analysis of Dimitre's Stylesheet

Step one was to see how well Saxon does with the bytecode feature switched on. This proved inconclusive because we discovered a bug in the bytecode generated. A useful exercise already, we managed to fix the bug (see bug issue: #1653). The problem was in the function processQueue the tail recursive call was not being properly generated into bytecode.

The Table below shows running times of the stylesheets under Saxon 9.4.0.6. We observe that Wolfgang's stylesheet was 2.07 and 3.22 faster in Saxon Intepreted and bytecode, respectively.

TransformationInterpreted - Times (secs)With bytecode generation - Times (secs)
Dimitre7.957.78
Wolfgang3.832.41

Analyzing Dimitre's stylesheet with the Saxon tracing profile (i.e. option -TP) proved useful. See the html output produced by Saxon below. We observe that there is a big hit on the processNode method, with the most time spent in this function.

 

Analysis of Stylesheet Execution Time

Total time: 9498.871 milliseconds

Time spent in each template or function:

The table below is ordered by the total net time spent in the template or function. Gross time means the time including called templates and functions; net time means time excluding time spent in called templates and functions.

file line instruction count avg time (gross) total time (gross) avg time (net) total time (net)
"*rdsHamming.xsl" 79 function my:processNode 2053 4.12 8470.67 3.729 7655.792
"*rdsHamming.xsl" 21 function my:chainOfWords 1 9491.1 9491.12 993.34 993.34
"*rdsHamming.xsl" 131 function f:eq 3993 0.06 230.02 0.058 230.26
"*rdsHamming.xsl" 131 function my:HammingDistance 3993 0.20 807.38 0.049 194.77
"*func-apply.xsl" 21 function f:apply 15972 0.01 290.01 0.011 175.00
"*-Operators.xsl" 244 template f:eq 15972 0.01 115.01 0.004 68.23
"*-Operators.xsl" 248 function f:eq 15972 0.003 46.77 0.003 46.77
"*nc-zipWith.xsl" 21 function f:zipWith 19965 0.002 33.11 0.002 33.11
"*nc-zipWith.xsl" 9 function f:zipWith 19965 0.003 57.67 0.001 24.56
"*func-apply.xsl" 16 function f:apply 15972 0.019 309.52 0.001 19.52
"*rdsHamming.xsl" 70 function my:processQueue 2053 0.009 18.35 0.009 18.35
"*hFunctions.xsl" 498 function f:string-to-codepoints 3993 0.003 10.52 0.003 10.52
"*rdsHamming.xsl" 120 function my:HammingDistance 3993 0.204 814.48 0.002 7.09
"*hFunctions.xsl" 498 function f:string-to-codepoints 3993 0.001 4.88 0.001 4.88
"*rdsHamming.xsl" 73 function my:processNode 2053 4.128 8475.2 0.002 4.57
"*rdsHamming.xsl" 54 function my:processQueue 2053 0.011 22.20 0.002 3.85
"*rdsHamming.xsl" 17 template /* 1 9491.87 9491.9 0.756 0.76
"*rdsHamming.xsl" 40 function my:chainOfWords 1 0.344 0.34 0.344 0.34
"*rdsHamming.xsl" 117 function my:enumerate 10 0.166 1.65 0.029 0.29
"*rdsHamming.xsl" 111 function my:enumerate 10 0.176 1.76 0.010 0.10

In addition to the Saxon tracing profile I ran the Java hrof profiling tool, which showed up that most time was spent in comparing strings. See the Java profile results below. It was now obvious that the GeneralComparison expression was in question. Specifically we narrowed it down to the instruction: <xsl:for-each select="$vNeighbors[not(. = $pExcluded)]">. For the interpreted code we were doing some unnecessary runtime type checking when we know statically at compile time that we are comparing string values. More Specifically, we know at compile time that $vNeighbors is a sequence of untyped atomic values and $pExcluded is a sequence of strings. We were unnecessarily checking at runtime that untyped atomic and string literal were comparable and we were doing an unnecessary conversion from an untyped atomic to string. 

CPU SAMPLES BEGIN (total = 1213) Thu Nov 29 14:42:47 2012
rank   self  accum   count trace method
   1 24.24% 24.24%     294 300547 java.lang.Integer.hashCode
   2 19.13% 43.36%     232 300581 net.sf.saxon.expr.GeneralComparison.compare
   3  7.75% 51.11%      94 300613 java.util.HashMap.getEntry
   4  2.14% 53.26%      26 300570 java.util.LinkedHashMap$Entry.recordAccess
   5  2.06% 55.32%      25 300234 java.lang.ClassLoader.defineClass1
   6  2.06% 57.38%      25 300616 com.saxonica.expr.ee.GeneralComparisonEE.effectiveBooleanValue
   7  1.98% 59.36%      24 300603 java.util.LinkedHashMap$Entry.recordAccess
   8  1.98% 61.34%      24 300609 net.sf.saxon.type.Converter.convert
....
See full hprof results: java.hprof-DN.txt

Improvements in Bytecode generation

In the bytecode we discovered we were missing out on opportunities to capitalise on static properties we know at compile time. For example during atomization we were doing an instanceof test to see whether each item was a node when we already know from static analysis that this was the case. We were also able to avoid unnecessary conversions of the strings, checking of instanceof and we found we could avoid repeated conversions by saving of string values for reuse when appropriate.

With the code improvements discussed above we were able to apply them in Saxon-EE 9.5 (pre-release). The table below shows these running times on the stylesheet written by Dimitre and Wolfgang. We observe that in the interpreted code that Wolfgang's XSL is 2.13 times faster than Dimitre (This is similar to Dimitre results above). With the bytecode generation feature switched on: Dimitre's stylesheet has dramatically improved in performance and is now 1.19 times faster than Wolfgang's XSL.

TransformationInterpreted - Times (secs)With bytecode generation - Times (secs)
Dimitre7.3731.938
Wolfgang3.4502.17

We have not done any similar analysis on Wolfgang's stylesheet, we will now attempt to do this.

To be continued....

Comparing DOM and other object models - Saxon diaries

| No Comments | No TrackBacks
Saxon has for a long time supported a number of implementations of the XDM tree model:

  • Native Saxon implementations: the TinyTree, the LinkedTree, and a variant of the TinyTree called TinyTreeCondensed
  • External object models: DOM, JDOM, DOM4J, XOM
  • Newly added in the last few weeks: JDOM2, AXIOM
What we haven't done is to do any serious comparisons of performance, except for occasional bouts of intensive tuning of one or another of the models while it was under development.

The addition of two more external models suggested it was time we did some measurements. We adapted our test driver for the XMark benchmark suite to work with any of the above models, and ran the queries to compare the results. The results quoted below are for the nominal 1Mb source document (actual size 1182530 bytes).

DOM of course is an interface rather than an implementation; the DOM used here was the Apache Xerces DOM.

First, memory usage: the table below gives the "expansion factor", that is, the amount of memory used divided by the number of bytes in the source XML document:
ModelExpansion factor
TinyTree3.79
TinyTreeCondensed3.12
LinkedTree5.86
DOM6.17
JDOM6.88
JDOM26.57
DOM4J6.31
XOM4.68
AXIOM7.79
No great surprises there.

As for query performance, our first measurements showed some alarmingly bad performance on many of the external models for Q7. This query is dominated by traversal of the descendant axis. On some of the models this was taking almost 100 times as long as for the TinyTree. We know that the TinyTree is particularly good at traversing the descendant axis, but this ratio was worse than we expected. We found that the recursive algorithm used by some of the models was particularly inefficient, and replaced it with the non-recursive algorithm used by the LinkedTree, which significantly improved the figures. After this tuning exercise, the following table shows for each model, the average elapsed time compared with the TinyTree figure, together with the minimum and maximum values of this ratio.


ModelAverageBestWorst
TinyTree 1 1 1
TinyTreeCondensed 1.03 0.97 1.23
LinkedTree 2.29 1.2 6.35
DOM 8.05 1.98 23.05
JDOM 12.17 2.45 71.07
JDOM2 6.51 2.08 28.47
DOM4J 17.3 2.11 98.38
XOM 3.8 1.99 6.09
Axiom 11.15 2.61 60.18

This clearly shows that of the external object models, XOM is clearly out-performing the others. But even XOM is nearly four times slower than the TinyTree on average.

Finally, we also measured build time: the time taken to build the tree. The figures below are in ms:


ModelBuild time
TinyTree 44
TinyTreeCondensed 139
LinkedTree 205
DOM 184
JDOM 177
JDOM2 183
DOM4J 187
XOM 283
Axiom 675

Nothing dramatic here, except to note that Axiom appears to be very slow to build the tree.

The results here are of course not necessarily representative of other workloads. Notoriously, the XMark data does not use namespaces, and that could seriously distort the results. The XMark queries are also heavily optimized; for example, they probably do less sorting-into-document-order than your average workload, and that's an operation that can be very expensive with some object models. Nevertheless, any data is better than none, and publishing this may encourage some people to take another look at XOM.
Unfortunately, most of the people who use DOM with Saxon do so out of ignorance, and they are the people who are least likely t read this article.

Footnote

A more detailed look at the XMark performance data showed that in many cases the worst-performing code in a number of the external model wrappers (XOM being a notable exception) was when navigating the descendant axis. We've changed some of the wrappers, notably DOM and DOM4J, to use the same algorithm as XOM was using for the descendant axis, and the result is a notable improvement. However, one concern about this algorithm is that it has the potential for very poor performance in trees with a high fan-out, for example, where one parent node has tens of thousands of immediate children. In the case of DOM the problem is compounded by the fact that DOM is an interface, not an implementation, so different implementations of DOM might exhibit very different performance characteristics: the right approach for one implementation is not necessarily the right approach for others.

FtanML - A new markup language - Saxon diaries

| No Comments | No TrackBacks

FtanML


FtanML is a notation for data and documents designed to combine the simplicity of JSON with the expressive power of XML. Key aims are ease of reading and writing by human beings, and ease of processing by software in conventional programming languages.


It is named after Ftan, the village in the Swiss Alps where it was created, during a two-week summer school course in August 2012. The tutors on the course were Michael Kay and Stephanie Haupt, and the students were Max Altgelt, Julien Bergner, Lukas Graf, Dominik Helm, Axel Kroschk, Uwe von Lüpke, My-Tien Nguyen, Sebastian Meßmer, Suhanyaa Nitkunanantharajah, Jan Landelin Pawellek, and Martin Schmitt.


The FtanML Grammar


value ::= array | element | string | number | "true" | "false" | "null"

string ::= (like a JSON string)

number ::= (like a JSON number)

array ::= "[" (value ("," value)* )? "]"


element ::= "<" element-name? attribute* ("|" content)? ">"


element-name ::= string | name


attribute ::= (string | name) "=" value


content ::= (content-char | escaped-char | element ) *


content-char ::= any character except \, <, >, and control characters


escaped-char ::= as in JSON strings with the addition of "\<", "\>" (to escape "<" and ">")


name ::= [\p{L}\p{N}:_$]+


The FtanML Data Model


A Value is either an Array, an Element, a string, a number, a boolean or null.


An Array is a sequence of Values.


An Element is a set of attributes. An attribute is a name-value pair, where the name is any string and the value is any Value. (An element is thus equivalent to a JSON Object.)


Two particular attributes are treated specially in the surface syntax, but not in the data model. The attribute named "name" represents what in XML is called the "element type name", and the syntax <name="hr"> may therefore be abbreviated to <hr>. The attribute named "content" represents what in XML is modelled as the children of the element, and its value is an array consisting of a sequence of elements and strings (with adjacent strings and zero-length strings not permitted). The content of an element is normally written in the form <p|Here is <b|bold> text> but in data model terms this is equivalent to <p content=["Here is ", <b content=["bold"]>, " text"]>.


Examples


A JSON-like array:


[1, 2, "abc", [1, 2]]


A JSON-like map:


<x=1 y=2 label="box" "corner coordinates"=[[0,1], [0,3], [1,3], [1,1]]>


Note: the attribute names do not require quotes unless they contain special characters not allowed in a name. The presence or absence of quotes is not exposed in the data model. The quotes around attribute values are required if the value is a string, but not otherwise.


An XML-like element:


<para|Here is some <b|bold> text>


The equivalent of the XML "start tag" is the material before the vertical bar; to the right of the vertical bar is the (optional) content; the end tag is reduced to a right angle bracket.


An XML-like element with attributes, and an empty element:


<para id="p123"|Line 1<br>Line 2>


A element named "list" containing three unnamed elements:


<list|<|red><|green><|blue>>


An element having a class attribute but no name:


<p|This is <class="u"|underlined>>


An element with no name, attributes, or content:


<>


Postscript


By the end of the course we implemented a parser for FtanML (using JavaCC), which generated Java objects corresponding to the object model. We also implemented serialization of the object model to FtanML, JSON, and XML, and wrote some parser tests and test applications for the object model.


Analyzing what we did, I think there are two things FtanML does particularly well:


(a) it adds mixed content to JSON at the lexical level, without adding any complexity to the data model. So programming against JSON is just as easy as it was before, but "document-like" information is no longer excluded


(b) It provides a document-oriented markup language that is both richer and simpler than XML. Richer, because it generalizes what attributes can contain (not just strings, but numbers, booleans, arrays, and elements), and simpler because (i) there is less syntactic baggage (<b|bold text> instead of <b>bold text</b>; simpler escape mechanism), and (ii) it cuts out a lot of the rarely used but complicating features such as comments, processing instructions, CDATA sections, entity references, XML declarations, etc.)


Of course, the fact that we have a better mousetrap will not itself cause the world to beat at our door; but I hope the ideas will prove influential.


There are some things we didn't tackle or decide; here is a list of what one might want to do next.


Parent pointers in the data model


Should an element (=map, = JSON object) have a pointer to its parent? In the XML world we expect this, in the JSON world we don't. There are advantages and disadvantages both ways. I think this remains an open question.


Whitespace


In element content, is whitespace significant? I think it's important that it should be possible to include both significant and insignificant whitespace, and that the two should be clearly distinguished. One suggestion is that any sequence of whitespace characters that follows a backslash should be considered insignificant. Another is the reverse: whitespace is normalized/collapsed except for any whitespace character that is escaped with "\".


Namespaces


I think there are three possible approaches to namespaces. (i) do nothing, as in JSON. (ii) do namespaces the XML way. (iii) provide a minimal namespaces facility.


Since FtanML generally adopts the approach of doing what is right without concern for compatibility, my preference is (iii). I think a simple namespace scheme might work as follows:


The "absolute name" of an element is in the form ":com.saxonica.project.para", that is a name rather like a Java class name using inverse DNS names by convention to achieve uniqueness. The element name can always be written literally in this form. Alternatively, it can be written as a short name "para" in which case it is implicitly in the same namespace as its parent element. The name that appears in the data model is always the absolute name; the short name is merely an authoring convenience. Attribute names are in no namespace unless they are written in full as absolute names.


Path/query language


We clearly need an XPath-like language to address into FtanML documents, and the data model is sufficiently different from XDM that XPath itself won't really do the job. For example, the arrays in the data model are subtly different from XDM sequences (a singleton item is not the same as an array containing that item) and this has considerable implications.


Schema language


Given time and effort, I would propose a schema language for FtanML that is rule-based and that incorporates both predicate-based validation, type assignment, and grammar rules. That is, the ability to say "for an element that matches this pattern, the following rules must be satisfied", where the rules include the ability to specify a named type, to add restrictions such as min and max values and regex patterns, or if the type is "element", to specify the required content of the element as a grammar.


Finally


I don't know if FtanML will go any further. But developing it was a great experience and a good way of spending a couple of weeks with a group of very talented students. I think it was a good learning experience for them; and perhaps, just perhaps, it will sow some ideas in the community that will influence the future of markup languages.







Saxon-CE is 1.0 today - Saxon diaries

| No Comments | No TrackBacks
Today we are finally releasing Saxon-CE 1.0: XSLT 2.0 on the browser. It's been a long haul (18 months of prototyping, alpha releases and beta releases) but I think it has been well worth it. This product, more than anything else that we've done in Saxonica over the years, has the potential to change the rules of the game.

Why do I think it's so significant? Well, let's look at what it changes.

Firstly, it's XSLT 2.0. XSLT 2.0 is twice the language of XSLT 1.0. It's not just a few extra features (grouping, regular expressions, multiple output files, user-defined functions). Many stylesheets require only half the code when written in XSLT 2.0, and many stylesheets that couldn't be written at all in XSLT 1.0 can easily be written using 2.0. So it's an enormous step forward in terms of development and maintenance productivity, and it greatly increases the range of applications that can be tacked using the language.

Next, it's not just XSLT 2.0. It's XSLT with extensions to handle user interaction. XSLT 1.0 came out in the days of Web 1.0, when displaying content prettily was the summit of most people's ambitions. Today people expect the web to be interactive, and Saxon-CE meets that need. Without writing any low-level Javascript, you can write declarative applications that respond to user interaction. XSLT used to handle half the application - the output but not the input; now it handles the whole lot. OK, there's a large community out there that has become comfortable with writing Javascript. But it's messy, it's hard to debug, it requires a great deal of specialist skill, and we think that there's a better and easier way of doing it.

And it's cross-browser. There are five major browsers today, and it runs on all of them, doing a very good job of masking their differences. That means you're no longer constrained to move forward at the pace of the slowest browser vendor. It's no secret that some of the browser vendors have lost some of their early enthusiasm for XML. It's been a chicken-and-egg problem; users weren't rushing into client-side XSLT without universal browser support, and browser vendors weren't rushing into providing that support without mass-market adoption. Instead the browser vendors have been competing to produce the world's fastest Javascript engine, and we've been able to capitalise on that, because internally, Saxon-CE uses that engine.

It's also a solid piece of engineering. Saxonica hired Phil Fearon as lead developer to take the product through from its initial prototype to the stage it has reached now, and he has done a great job. He's made full use of his previous experience developing client-side XML editing tools. Under the surface, the product is now rich in diagnostic and tracing features that integrate solidly with the console and logging facilities in the different browser platforms, providing a solid working environment for the professional web developer; it's also got the APIs to enable integration into IDEs, which is something we hope we'll see a lot of over the next year or so. Phil has reworked the event handling and the exploitation of the Javascript concurrency model to ensure that applications are as responsive as possible. We've also had great feedback from the user community during beta testing, and we've implemented many ideas that emerged from this feedback: to take one simple example, the ability to launch Saxon from the xml-stylesheet processing instruction via an XSLT 1.0 bootstrap stylesheet came from a chance remark during the XML Prague conference, but it will make a big difference to anyone trying to migrate an existing XSLT 1.0 application. 

So, it's got a great deal to offer. But, some might ask, isn't it ten years too late?

There are two aspects to that question: an engineering aspect, and a fashion aspect. To succeed in this world, you need to get the engineering right, but you also need to ride the fashion wave. They are not unrelated, but they're not 100% correlated either.

There's no doubt that XML is less fashionable than it was. That's probably a good thing: people are now using XML where it is the right tool for the job, whereas ten years ago a lot of people were using XML for jobs that it was never designed for. Make no mistake, there is an awful lot of XML around, and that isn't going to change. What we hope will change is that more of that XML finds its way onto the browser now that there are decent tools for processing it there. The arguments for shipping XML to the client and processing it locally haven't changed; the reason it didn't happen in the past wasn't that it was a bad idea, it was because the technology wasn't viable to make it happen.

A lot of the reason that JSON is now preferred for many simple data interchange tasks is that when you are using conventional programming languages like Javascript (and Java, and C#, and Python, etc), JSON is much easier than XML to manipulate. Programming is easier when you use a programming language that matches the data model. So if you're using Javascript, JSON is easier than XML; conversely, if you're using XML, you want a programming language (XSLT) that is designed for the job. There is absolutely no doubt that processing XML in Javascript is a pig, so we think that exactly the same pressures that led to people saying "use JSON rather than XML if you can" will lead people to say "use XSLT rather than Javascript if you can". Both are ways of solving the famous "impedance mismatch" - the reams of code that you need to write when your programming language doesn't match your data model. If your data is simple enough to represent in JSON, then fine, use it. But a lot of "big data" isn't, so XML is here to stay.

That's the engineering argument, but can we make it fashionable? Who knows. I think that will depend on the user community: it takes a few inspired users to do something cool with it, and others will join the bandwaggon. It's not the technology itself that inspires the masses, it's what gets built on top of it. We've no shortage of ideas as to what could be done (I would love to write a MusicXML application...) but we can't do it all ourselves, and we wouldn't want to.

So, Saxon-CE 1.0 is here. XSLT 2.0 has finally arrived on the browser, in a very solid implementation. Over to you.
Saxon for a while has had the ability to export the schema component model as an XML document, making it accessible to XSLT and XQuery applications. Providing the information at this level is much easier to process because it represents the "compiled" schema rather than the raw schema documents; so there's no sensitivity to all the stylistic variations in the way users choose to write schemas. Nevertheless, it's not a particularly easy model to work with, partly because chasing all the references from one component to another isn't much fun.

I've been experimenting today with a new interface that represents the schema components directly as a functional model.

So saxon:schema() returns the "Schema" schema component as a function, and saxon:schema()('element declarations') gets all the element declarations, as a list of functions. If you want the element declaration with a particular name, that's

let $bookDecl := saxon:schema()('element declarations')[.('name') = 'BOOK']

which is likely to be sufficiently common that I propose to offer a short-cut.

The properties of the element are then available, for example

$bookDecl('nillable')

tests whether the declaration is nillable, and

$bookDecl('abstract')

tests whether it is abstract.

Many properties have values that are themselves schema components, also represented as functions. So to discover whether an element has element-only content you can test:

if ($bookDecl('type definition')('variety') = 'element-only') ...

All the schema components and property records are available as described in the XSD 1.1 specification, with a very few exceptions (hopefully temporary):

(a) Saxon currently does not provide access to annotations (they are discarded at schema processing time)

(b) One or two other properties are not fully implemented, for example, {context} and {fundamental facets}

(c) The types available directly from the Schema do not include built-in types

(d) Saxon represents multiple patterns and assertions of a simple type as multiple facet objects rather than a single facet object.

(e) There's one property that's hard to represent directly in the XDM data model: the value of the enumeration facet is a sequence of sequences. So I'm currently exposing the enumeration values as string values rather than typed values.

Every component has an additional property "class" which tells you what kind of schema component it is. This is particularly necessary for the components representing facets, where the class is the only way of telling, for example, whether you have a minInclusive or a maxExclusive facet.

That's a lot of capability but I think there are one or two other things needed:

(i) getting top-level global components by name. I think the best way to do this is a function saxon:element-declaration() that takes the QName of the required component and returns the component directly; there will be one of these functions for each kind of component.

(ii) a few other 'convenience' properties that short-circuit complex navigation of the component model. For example, starting with an element declaration, getting all the element declarations of its permitted children, without having to trawl your way through the complex type and its particles and terms and their substitution groups. I guess only experience of real applications will show what convenience functions are really needed.

(iii) linking to this model from validated instances (a subset of the PSVI - initially perhaps just the ability to get from a node to the schema component representing its type annotation) and perhaps also from validation errors: if you use try/catch to catch a validation error, it would be nice if the error object gives access to the relevant schema component.

Note that I haven't mentioned maps. The functions that implement schema components might be maps, or they might not. That's an implementation detail. One of the nice things about the XSLT 3.0 model for maps (already implemented in Saxon 9.4) is that maps are functions, so  if an interface like this one is defined in terms of functions, it can use maps behind the scenes in the implementation, or not, as it pleases. In fact my current implementation isn't using maps, because that makes it easier to do a lazy implementation in which there is no underlying data structure other than the Java classes that are already there to implement the component model.

One nice property of this API is that I can't see many obstacles to standardizing it so it's available in all XSLT and XQuery implementations. The main work, of defining the schema component model, has already been done in the XSD specs. 

 

Enhanced by Zemanta

Taking Saxon-CE Forward - Saxon diaries

| No Comments | No TrackBacks

The work we are doing on Saxon-CE (XSLT 2.0 on the browser) received a very encouraging reception at XML Prague. Our half-day tutorial/workshop was very well attended, and there was a lot of useful feedback, and new ideas which we are hoping to incorporate. Discussions in the corridors of the conference also demonstrated a lot of interest. If you're not familiar with the current status, it's now on beta release, and we've been doing a lot of work that can be summed up as taking it from the prototype we showed last year to an industrial-strength product with stable interfaces, a range of development and debugging and instrumentation tools, responsive performance, and full cross-browser compatibility.

There are a few technical things we need to do before the product is finished. Providing a Javascript API is one; another is to provide control over management of the browser URL bar and history, so that users can save bookmarks and use the back button in the way that they expect; and we need to enable asynchronous fetching of XML documents from the server to create AJAX-style applications (since the J stands for Javascript, perhaps we should call it AXAX).

We're also starting to plan the commercial side of shipping a product. That includes pricing, the logistics of licensing and selling, and some basic marketing. We're not going to give the product away: there are too many good software products that have stalled because the value delivered to users wasn't flowing back to the developers to invest in the product. At the same time, giving software away free, within limits, is by far the best way of establishing a presence in the market.

Let me share my current thinking on pricing. I'm interested in feedback, so let me know what you think: these are just current ideas, so it might turn out quite different.

Firstly, the software will be licensed by domain (the name of the web domain on which it is deployed). We've already got that mechanism in place in the beta. We're hoping it's simple, clear, and hassle-free. We're not charging per developer, because that would be complicated and because the cost isn't directly related to the value; and we're not charging per end-user or per access, because that would be madness. Perhaps for people who want to deploy on multiple domains - the IBMs of this world - we will offer some kind of discount for domains after the first.

Secondly, we'll classify users into two groups: small and large. (Not "commercial" and "non-commercial" - that leaves far too many questions of definition). A small user is an individual or company or organization with revenue less than $x per annum (let's say $1m for the sake of argument), and a big user is anyone else. The entity whose revenue matters is the legal owner of the domain for which the license is held, which is nice because we can look that up; in many cases we can also check the revenue from public information, so there's not too much room for cheating.

Finally, we'll give you a choice of getting a one-year license or a permanent license. A one-year license will switch off at the end of the year. We hope that's a painful enough prospect that people who are taking the software seriously will choose to pay for a permanent license, while leaving a low-cost option for those who aren't quite sure yet how much commitment they want to make.

So that gives four prices, looking something like this (in GBP):

Price
12-month Permanent
Small organization
free £500
Large organization
£500 £2500

How do we tell if we have got this right? The prices look high by some standards, low by others. I think the test whether the price is in the right ballpark is that it is (a) less than the value users are getting from the software, and (b) greater than the money we are spending on developing it. On those criteria, it looks good to me, but I welcome your views!

Enhanced by Zemanta

Maps - Saxon diaries

| No Comments | No TrackBacks
I optimistically assumed that the W3C XQuery/XSLT specification for maps would be published long before Saxon 9.4 was released. I was wrong - the XQuery WG didn't rubber-stamp the XSLT WG's spec as expected, but insisted on reviewing it from first principles, starting a long drawn out study of requirements and use-cases which is still ongoing.

So Saxon 9.4 has maps, but no documentation for them. To remedy this, here is the spec. It happens to be a snapshot of a draft of the XSLT 3.0 specification, but of course everything can change before a draft is published. And as always when Saxon implements features in a draft specification, if the spec changes then Saxon will change with it, regardless of compatibility. Use at your own risk.

20.1 Maps

A map is an additional kind of item.

[DefinitionA map comprises a collation and a set of entries. Each entry comprises a key which is an arbitrary atomic value, and an arbitrary sequence called the associated value. Within a map, no two entries have the same key, when compared using the eq operator under the map's collation. It is not necessary that all the keys should be mutually comparable (for example, they can include a mixture of integers and strings). Key values will never be of type xs:untypedAtomic, and they will never be the xs:float or xs:double value NaN.]

The function call map:get($map, $key) can be used to retrieve the value associated with a given key.

A map can also be viewed as a function from keys to associated values. To achieve this, a map is also a function item. The function corresponding to the map has the signaturefunction($key as xs:anyAtomicValue) as item()*. Calling the function has the same effect as calling the get function: the expression $map($key) returns the same result asget($map, $key). For example, if $books-by-isbn is a map whose keys are ISBNs and whose assocated values are book elements, then the expression $books-by-isbn("0470192747") returns the book element with the given ISBN. The fact that a map is a function item allows it to be passed as an argument to higher-order functions that expect a function item as one of their arguments.

Like all other values, maps are immutable. For example, the map:remove function creates a new map by removing an entry from an existing map, but the existing map is not changed by the operation.

Like sequences, maps have no identity. It is meaningful to compare the contents of two maps, but there is no way of asking whether they are "the same map": two maps with the same content are indistinguishable.

20.1.1 The Type of a Map

The syntax of ItemTypeXP30 as defined in XPath is extended as follows:

MapType
[17]   ItemType   ::=   ... | MapType 
[18]   MapType   ::=   'map' '(' ( '*' | (AtomicOrUnionTypeXP30 ',' SequenceTypeXP30) ')'

The ItemType map(K, V) matches an item M if (a) M is a map, and (b) every entry in M has a key that matches K and an associated value that matches V. For example,map(xs:integer, element(employee)) matches a map if all the keys in the map are integers, and all the associated values are employee elements. Note that a map (like a sequence) carries no intrinsic type information separate from the types of its entries, and the type of existing entries in a map does not constrain the type of new entries that can be added to the map.

The ItemType map(*) is equivalent to map(xs:anyAtomicType, item()*), and matches any map regardless of its contents.

Because a map is a function, the type map(K, V) is derived from function(K) as V, and instances of map(K, V) can be used wherever the required type is function(K) as V.

20.1.2 Functions that operate on Maps

The functions defined in this section use a conventional namespace prefix map, which is assumed to be bound to the namespace URI http://www.w3.org/2005/xpath-functions/map.

There is no operation to atomize a map or convert it to a string.

The number of entries in the map may be obtained as count(map:keys($map)).

20.1.2.1 map:new
Summary

Creates a new map: either an empty map, or a map that combines entries from a number of existing maps.

Signatures

new() as map(*)

 

new($maps as map(*)*) as map(*)

 

new($maps as map(*)*,
$collation as xs:string) as map(*)

Rules

The function map:new constructs and returns a new map.

The zero-argument form of the function returns an empty map whose collation is the default collation in the static context. It is equivalent to calling the one-argument form of the function with an empty sequence as the value of the first argument.

The one-argument form of the function returns a map that is formed by combining the contents of the maps supplied in the $input argument. It is equivalent to calling the two-argument form of the function with the default collation from the static context as the second argument.

The two-argument form of the function returns a map that is formed by combining the contents of the maps supplied in the $input argument. The collation of the new map is the value of the $collation argument. The supplied maps are combined as follows:

  1. There is one entry in the new map for each distinct key value present in the union of the input maps, where keys are considered distinct according to the rules of thefn:distinct-values function with $collation as the collation.

  2. The associated value for each such key is taken from the last map in the input sequence $input that contains an entry with this key. If this map contains more than one entry with this key (which can happen if its collation is different from that of the new map) then it is implementation-dependent which of them is selected.

There is no requirement that the supplied input maps should have the same or compatible types. The type of a map (for example map(xs:integer, xs:string)) is descriptive of the entries it currently contains, but is not a constraint on how the map may be combined with other maps.

Examples

let $week := map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}

The expression map:new() returns map{}(Returns an empty map, whose collation is the default collation from the static context).

The expression map:new(()) returns map{}(Returns an empty map, whose collation is the default collation from the static context).

The expression map:new((map:entry(0, "no"), map:entry(1, "yes"))) returns map{0:="no", 1:="yes"}(Returns a map with two entries; the collation of the map is the default collation from the static context).

The expression map:new((map:entry(0, "no"), map:entry(1, "yes"))) returns map{0:="no", 1:="yes"}(Returns a map with two entries; the collation of the map is the default collation from the static context).

The expression map:new(($week, map{7:="Unbekannt"})) returns map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag", 7:="Unbekannt"}(The value of the existing map is unchanged; a new map is created containing all the entries from $week, supplemented with a new entry.).

The expression map:new(($week, map{6:="Sonnabend"})) returns map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Sonnabend"}(The value of the existing map is unchanged; a new map is created containing all the entries from $week, with one entry replaced by a new entry. Both input maps contain an entry with the key value 6; the one used in the result is the one that comes last in the input sequence.).

The expression map:new((map{"A":=1}, map{"a":=2}), "http://collation.example.com/caseblind") returns map{"a":=2}(Assuming that the keys of the two entries are equal under the rules of the chosen collation, only one of the entries can appear in the result; the one that is chosen is the one from the last map in the input sequence. If both entries were in the same map, it would be implementation-dependent which was chosen.).

20.1.2.2 map:collation
Summary

Returns the URI of the supplied map's collation

Signature

collation($input as map(*)) as xs:string

Rules

The function map:collation returns the collation URI of the map supplied as $input.

Examples

The expression map:collation(map:new((), "http://collation.example.com/caseblind")) returns "http://collation.example.com/caseblind".

20.1.2.3 map:keys
Summary

Returns a sequence containing all the key values present in a map

Signature

keys($input as map(*)) as xs:anyAtomicType*

Rules

The function map:keys takes any map as its $input argument and returns the keys that are present in the map as a sequence of atomic values, in implementation-dependent order.

Examples

The expression map:keys(map{1:="yes", 2:="no"}) returns some permutation of (1,2)(The result is in implementation-dependent order.).

20.1.2.4 map:contains 
Summary

Tests whether a supplied map contains an entry for a given key

Signature

contains($map as map(*),
$key as xs:anyAtomicType) as xs:boolean

Rules

The function map:contains returns true if the map supplied as $map contains an entry with a key equal to the supplied value of $key; otherwise it returns false. The equality comparison uses the map's collation; no error occurs if the map contains keys that are not comparable with the supplied $key.

If the supplied key is xs:untypedAtomic, it is converted to xs:string. If the supplied key is the xs:float or xs:double value NaN, the function returns false.

Examples

let $week := map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}

The expression map:contains($week, 2) returns true().

The expression map:contains($week, 9) returns false().

The expression map:contains(map{}, "xyz") returns false().

The expression map:contains(map{"xyz":=23}, "xyz") returns true().

The expression map:contains(map{"abc":=23, "xyz":=()}, "xyz") returns true().

20.1.2.5 map:get
Summary

Returns the value associated with a supplied key in a given map.

Signature

get($map as map(*),
$key as xs:anyAtomicType) as item()*

Rules

The function map:get attempts to find an entry within the map supplied as $input that has a key equal to the supplied value of $key. If there is such an entry, it returns the associated value; otherwise it returns an empty sequence. The equality comparison uses the map's collation; no error occurs if the map contains keys that are not comparable with the supplied $key.

If the supplied key is xs:untypedAtomic, it is converted to xs:string. If the supplied key is the xs:float or xs:double value NaN, the function returns an empty sequence.

Notes

A return value of () from map:get could indicate that the key is present in the map with an associated value of (), or it could indicate that the key is not present in the map. The two cases can be distinguished by calling map:contains.

Invoking the map as a function item has the same effect as calling get: that is, when $map is a map, the expression $map($K) is equivalent to get($map, $K). Similarly, the expression get(get(get($map, 'employee'), 'name'), 'first') can be written as $map('employee')('name')('first').

Examples

let $week := map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}

The expression map:get($week, 4) returns "Donnerstag".

The expression map:get($week, 9) returns ()(When the key is not present, the function returns an empty sequence.).

The expression map:get(map:entry(7,()), 7) returns ()(An empty sequence as the result can also signify that the key is present and the associated value is an empty sequence.).

20.1.2.6 map:entry
Summary

Creates a map that contains a single entry (a key-value pair).

Signature

entry($key as xs:anyAtomicType,
$value as item()*) as map(*)

Rules

The function map:entry returns a new map which normally contains a single entry. The collation of the new map is the default collation from the static context. The key of the entry in the new map is $key, and its associated value is $value.

If the supplied key is the xs:float or xs:double value NaN, the supplied $map is empty (that is, it contains no entries).

If the supplied key is xs:untypedAtomic, it is converted to xs:string.

Notes

The function map:entry is intended primarily for use in conjunction with the function map:new. For example, a map containing seven entries may be constructed like this:

map:new((
   map:entry("Su", "Sunday"),
   map:entry("Mo", "Monday"),
   map:entry("Tu", "Tuesday"),
   map:entry("We", "Wednesday"),
   map:entry("Th", "Thursday"),
   map:entry("Fr", "Friday"),
   map:entry("Sa", "Saturday")
   ))

Unlike the map{...} expression, this technique can be used to construct a map with a variable number of entries, for example:

map:new(for $b in //book return map:entry($b/isbn, $b))
Examples

The expression map:entry("M", "Monday") returns map{"M":="Monday"}.

20.1.2.7 map:remove
Summary

Constructs a new map by removing an entry from an existing map

Signature

remove($map as map(*),
$key as xs:anyAtomicType) as map(*)

Rules

The function map:remove returns a new map. The collation of the new map is the same as the collation of the map supplied as $map. The entries in the new map correspond to the entries of $map, excluding any entry whose key is equal to $key.

No failure occurs if the input map contains no entry with the supplied key; the input map is returned unchanged

Examples

let $week := map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}

The expression map:remove($week, 4) returns map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 5:="Freitag", 6:="Samstag"}.

The expression map:remove($week, 23) returns map{0:="Sonntag", 1:="Montag", 2:="Dienstag", 3:="Mittwoch", 4:="Donnerstag", 5:="Freitag", 6:="Samstag"}.

20.1.2.8 fn:deep-equal
Summary

This function is extended to handle maps.

Signatures

deep-equal($parameter1 as item()*,
$parameter2 as item()*) as xs:boolean

 

deep-equal($parameter1 as item()*,
$parameter2 as item()*,
$collation as xs:string) as xs:boolean

Rules

The $collation argument identifies a collation which is used at all levels of recursion when strings are compared (but not when names are compared), according to the rules in [FO:choosing-a-collation]FO.

If the two sequences are both empty, the function returns true.

If the two sequences are of different lengths, the function returns false.

If the two sequences are of the same length, the function returns true if and only if every item in the sequence $parameter1 is deep-equal to the item at the same position in the sequence $parameter2. The rules for deciding whether two items are deep-equal follow.

Call the two items $i1 and $i2 respectively.

If $i1 and $i2 are both atomic values, they are deep-equal if and only if ($i1 eq $i2) is true, or if both values are NaN. If the eq operator is not defined for $i1 and $i2, the function returns false.

If one of the pair $i1 or $i2 is an atomic value and the other is not, or if one is a node and the other is not, the function returns false.

If $i1 and $i2 are both maps, the result is true if and only if all the following conditions apply:

  1. Both maps have the same number of entries.

  2. Both maps have the same collation.

  3. For every entry in the first map, there is an entry in the second map that:

    1. has the same key (compared using the eq operator under the maps' collation), and

    2. has the same associated value (compared using the fn:deep-equal2 function, under the collation supplied in the original call to fn:deep-equal2).

If $i1 and $i2 are both nodes, they are compared as described below:

  1. If the two nodes are of different kinds, the result is false.

  2. If the two nodes are both document nodes then they are deep-equal if and only if the sequence $i1/(*|text()) is deep-equal to the sequence $i2/(*|text()).

  3. If the two nodes are both element nodes then they are deep-equal if and only if all of the following conditions are satisfied:

    1. The two nodes have the same name, that is (node-name($i1) eq node-name($i2)).

    2. The two nodes are both annotated as having simple content or both nodes are annotated as having complex content.

    3. The two nodes have the same number of attributes, and for every attribute $a1 in $i1/@* there exists an attribute $a2 in $i2/@* such that $a1 and $a2 are deep-equal.

    4. One of the following conditions holds:

      • Both element nodes have a type annotation that is simple content, and the typed value of $i1 is deep-equal to the typed value of $i2.

      • Both element nodes have a type annotation that is complex content with elementOnly content, and each child element of $i1 is deep-equal to the corresponding child element of $i2.

      • Both element nodes have a type annotation that is complex content with mixed content, and the sequence $i1/(*|text()) is deep-equal to the sequence$i2/(*|text()).

      • Both element nodes have a type annotation that is complex content with empty content.

  4. If the two nodes are both attribute nodes then they are deep-equal if and only if both the following conditions are satisfied:

    1. The two nodes have the same name, that is (node-name($i1) eq node-name($i2)).

    2. The typed value of $i1 is deep-equal to the typed value of $i2.

  5. If the two nodes are both processing instruction nodes, then they are deep-equal if and only if both the following conditions are satisfied:

    1. The two nodes have the same name, that is (node-name($i1) eq node-name($i2)).

    2. The string value of $i1 is equal to the string value of $i2.

  6. If the two nodes are both namespace nodes, then they are deep-equal if and only if both the following conditions are satisfied:

    1. The two nodes either have the same name or are both nameless, that is fn:deep-equal(node-name($i1), node-name($i2)).

    2. The string value of $i1 is equal to the string value of $i2 when compared using the Unicode codepoint collation.

  7. If the two nodes are both text nodes or comment nodes, then they are deep-equal if and only if their string-values are equal.

Error Conditions

An error is raised [tba] if either input sequence contains a function item that is not a map.

Notes

Two nodes are not required to have the same type annotation, and they are not required to have the same in-scope namespaces. They may also differ in their parent, their base URI, and the values returned by the is-id and is-idrefs accessors (see Section 5.5 is-id Accessor DM30 and Section 5.6 is-idrefs Accessor DM30). The order of children is significant, but the order of attributes is insignificant.

The contents of comments and processing instructions are significant only if these nodes appear directly as items in the two sequences being compared. The content of a comment or processing instruction that appears as a descendant of an item in one of the sequences being compared does not affect the result. However, the presence of a comment or processing instruction, if it causes a text node to be split into two text nodes, may affect the result.

The result of fn:deep-equal(1, current-dateTime()) is false; it does not raise an error.

Examples

The expression fn:deep-equal(map{}, map{}) returns true().

The expression fn:deep-equal(map{"a":=1, "b":=2}, map{"b":=2, "a":=1.0}) returns true().

The expression fn:deep-equal(map{"a":=xs:double('NaN')}, map{"a":=xs:float('NaN')}) returns true().

let $at :=

<attendees>
  <name last='Parker' first='Peter'/>
  <name last='Barker' first='Bob'/>
  <name first='Peter' last='Parker'/>
</attendees>

The expression fn:deep-equal($at, $at/*) returns false().

The expression fn:deep-equal($at/name[1], $at/name[2]) returns false().

The expression fn:deep-equal($at/name[1], $at/name[3]) returns true().

The expression fn:deep-equal($at/name[1], 'Peter Parker') returns false().

20.1.3 Map Expressions

A new kind of expression is added to the syntax of XPath.

The syntax of PrimaryExprXP30 is extended to permit MapExpr as an additional alternative.

MapExpr := "map" "{" (KeyExpr ":=" ValueExpr ("," KeyExpr ":=" ValueExpr )*)? "}"

KeyExpr := ExprSingle

ValueExpr := ExprSingle

Note:

Two variations on this syntax are under consideration: removing the leading keyword "map", and using the token ":" in place of ":=". This would bring the syntax closer to Javascript and JSON notation. However, special lexical rules would be needed to disambiguate this use of ":" from other uses. Feedback is invited.

The value of the expression is a map whose entries correspond to the key-value pairs obtained by evaluating the successive KeyExpr and ValueExpr expressions.

Each KeyExpr expression is evaluated and atomized; a dynamic error occurs if the result is not a single atomic value. If the key value is of type xs:untypedAtomic it is converted toxs:string. The associated value is the result of evaluating the corresponding ValueExpr. The collation of the new map is the default collation from the static context. If the key value is NaN then the key/value pair is not added to the map. If two or more keys are equal under the collation of the map then the last occurrence is added to the map and the others are ignored.

For example, the following expression constructs a map with seven entries:

map {
  "Su" := "Sunday",
  "Mo" := "Monday",
  "Tu" := "Tuesday",
  "We" := "Wednesday",
  "Th" := "Thursday",
  "Fr" := "Friday",
  "Sa" := "Saturday
}

Note:

Unlike the map:new function, the number of entries in a map that is constructed using a map expression is known statically, except where duplicate keys or NaN values cause some entries to be ignored.

20.1.4 Examples using maps

This section gives some examples of where maps can be useful.

Example: Using maps with xsl:iterate

This example uses maps in conjunction with the xsl:iterate instruction to find the highest-earning employee in each department, in a single streaming pass of an input document containing employee records.

<xsl:stream href="employees.xml">
  <xsl:iterate select="*/employee">
    <xsl:param name="highest-earners" 
               as="map(xs:string, element(employee))" 
               select="map:new()"/>
    <xsl:variable name="this" select="copy-of(.)" as="element(employee)"/> 
    <xsl:next-iteration>
      <xsl:with-param name="highest-earners"
          select="let $existing := $highest-earners($this/department)
                  return if ($existing/salary gt $this/salary)
                         then $highest-earners
                         else map:new($highest-earners, map:entry($this/department, $this))"/>
    </xsl:next-iteration>
    <xsl:on-completion>
      <xsl:for-each select="map:keys($highest-earners)">
        <department name="{.}">
          <xsl:copy-of select="$highest-earners(.)"/>
        </department>
      </xsl:for-each>
    </xsl:on-completion>
  </xsl:iterate>
</xsl:stream>

 

Example: Using Maps to implement Complex Numbers

A complex number might be represented as a map with two entries, the keys being the xs:boolean value true for the real part, and the xs:boolean value false for the imaginary part. A library for manipulation of complex numbers might include functions such as the following:

<xsl:function name="i:complex" as="map(xs:boolean, xs:double)">
  <xsl:param name="real" as="xs:double"/>
  <xsl:param name="imaginary" as="xs:double"/>
  <xsl:sequence select="map{ true() := $real, false() := $imaginary }"/>
</xsl:function>

<xsl:function name="i:real" as="xs:double">
  <xsl:param name="complex" as="map(xs:boolean, xs:double)"/>
  <xsl:sequence select="$complex(true())"/>
</xsl:function>

<xsl:function name="i:imaginary" as="xs:double">
  <xsl:param name="complex" as="map(xs:boolean, xs:double)"/>
  <xsl:sequence select="$complex(false())"/>
</xsl:function>

<xsl:function name="i:add" as="map(xs:boolean, xs:double)">
  <xsl:param name="arg1" as="map(xs:boolean, xs:double)"/>
  <xsl:param name="arg2" as="map(xs:boolean, xs:double)"/>
  <xsl:sequence select="i:complex(i:real($arg1)+i:real($arg2), i:imaginary($arg1)+i:imaginary($arg2)"/>
</xsl:function>

<xsl:function name="i:multiply" as="map(xs:boolean, xs:double)">
  <xsl:param name="arg1" as="map(xs:boolean, xs:double)"/>
  <xsl:param name="arg2" as="map(xs:boolean, xs:double)"/>
  <xsl:sequence select="i:complex(
      i:real($arg1)*i:real($arg2) - i:imaginary($arg1)*i:imaginary($arg2),
      i:real($arg1)*i:imaginary($arg2) + i:imaginary($arg1)*i:real($arg2))"/>
</xsl:function>

Note:

This example demonstrates how useful it would be to allow user-defined type aliases, so that callers of this function library could write code that treats the value simply as acomplex-number, not as a map. A proposal to introduce such type aliases is under consideration.

 

Example: Using a Map as an Index

Given a set of book elements, it is possible to construct an index in the form of a map allowing the books to be retrieved by ISBN number.

Assume the book elements have the form:

<book>
  <isbn>0470192747</isbn>
  <author>Michael H. Kay</author>
  <publisher>Wiley</publisher>
  <title>XSLT 2.0 and XPath 2.0 Programmer's Reference</title>
</book>

An index may be constructed as follows:

<xsl:variable name="isbn-index" as="map(xs:string, element(book))"
    select="map:new(for $b in //book return map{$b/isbn := $b})"/>

This index may then be used to retrieve the book for a given ISBN using either of the expressions map:get($isbn-index, "0470192747") or $isbn-index("0470192747").

In this simple form, this replicates the functionality available using xsl:key and the key function. However, it also provides capabilities not directly available using the key function: for example, the index can include book elements in multiple source documents. It also allows processing of all the books using a construct such as <xsl:for-each select="map:keys($isbn-index)">

 

Example: A Map containing named Functions

As in Javascript, a map whose keys are strings and whose associated values are function items can be used in a similar way to a class in object-oriented programming languages.

Suppose an application needs to handle customer order information that may arrive in three different formats, with different hierarchic arrangement:

  1. Flat structure:

    <customer id="c123">...</customer>
    <product id="p789">...</product>
    <order customer="c123" product="p789">...</order>
    
  2. Orders within customer elements:

    <customer id="c123">
       <order product="p789">...</order>
    </customer>
    <product id="p789">...</product>
    
  3. Orders within product elements:

    <customer id="c123">...</customer>
    <product id="p789">
      <order customer id="c123">...</order>
    </product>
    

An application can isolate itself from these differences by defining a set of functions to navigate the relationships between customers, orders, and products: orders-for-customerorders-for-productcustomer-for-orderproduct-for-order. These functions can be implemented in different ways for the three different input formats. For example, with the first format the implementation might be:

<xsl:variable name="flat-input-functions" as="map(xs:string, function(*))*"
  select="map {
            'orders-for-customer' := 
                 function($c as element(customer)) as element(order)* 
                    {$c/../order[@customer=$c/@id]},
            'orders-for-product' := 
                 function($p as element(product)) as element(order)* 
                    {$p/../order[@product=$p/@id]},
            'customer-for-order' := 
                 function($o as element(order)) as element(customer) 
                    {$o/../customer[@id=$o/@customer]},
            'product-for-order' := 
                 function($o as element(order)) as element(product) 
                    {$o/../product[@id=$o/@product]} }                    
         "/>

Having established which input format is in use, the application can bind the appropriate implementation of these functions to a variable such as $input-navigator, and can then process the input using XPath expressions such as the following, which selects all products for which there is no order: //product[empty($input-navigator("orders-for-product")(.))]

Enhanced by Zemanta