XML versus JSON: A Comparison using the XMark Benchmark

By Michael Kay on May 06, 2022 at 11:00a.m.

For many years we have used the XMark benchmark to check for performance regression between Saxon releases, and to evaluate the impact of internal changes on query performance.

XMark originated with the MonetDB project and is described at https://projects.cwi.nl/xmark/. It consists of a scaleable XML data file (produced using a data generator), and a set of 20 XQuery queries to be run against that data file. We have run the data generator to produce files with nominal sizes of 100Kb, 1Mb, 4Mb, 10Mb, and 100Mb; we use the original queries as published, except for one or two small changes to correct errors in the original publication.

Recently we have converted these data files to JSON, and have produced equivalent XQuery 3.1 queries to deliver the same results as the original. The queries still produce XML rather than JSON output, so that we can compare the results; except in a few cases where large chunks of the original XML are copied to the output, which we cannot reproduce exactly because we don't have the original XML available. The results also differ because JSON maps don't retain order.

In this article I will report first on the data conversion; then on the query conversion; and finally on performance results.

Converting the data

I didn't attempt to use any off-the-shelf XML-to-JSON conversion tools. My instinct is that they wouldn't have done a very good job, and I would have needed an XSLT transformation to refine the output anyway, so I decided to do the whole job using XSLT 3.0.

The conversion stylesheet is not particularly interesting; in fact, it's rather tedious. A few points are worth mentioning:

The nominal 10Mb file is actually 11,875,066 bytes in its XML form, and 10,464,266 bytes when converted to JSON, a reduction of 13%. Some of this difference (perhaps 200Kb) is due to unnecessary whitespace in the XML; the rest is the overhead of element end tags.

Parsing the XML and building a Saxon TinyTree took 353ms; parsing the JSON and building a structure of XDM maps and arrays took 636ms. I haven't attempted to assess the memory usage of the two data structures, but the maps and arrays are almost certainly larger. This is despite the fact that for maps derived directly from JSON parsing, we use a specialized representation of maps that optimizes for the fact that all keys are instances of xs:string, and therefore don't need to retain a type annotation.

Converting the Queries

The queries were converted by hand. Generally we tried to change the query so it continued to produce the same (XML) output as the original, for ease of comparing results; but for queries whose output copies sizeable chunks of the input XML, we abandoned this principle, instead replicating the intent of the query as far as we could.

In most cases the conversion is very straightforward. For example, this is Q3:


(: Q3. Return the IDs of all open auctions whose current
     increase is at least twice as high as the initial increase. :)

for    $b in /site/open_auctions/open_auction
where  $b/bidder[1]/increase * 2 <= $b/bidder[last()]/increase
return <increase first="{$b/bidder[1]/increase}"
                 last="{$b/bidder[last()]/increase}"/>            
            
                   
        

Which turns into:


(: Q3. Return the IDs of all open auctions whose current
     increase is at least twice as high as the initial increase. :)

for    $b in ?open_auctions?*
where  $b?bidders?*[1]?increase *2 <= $b?bidders?*[last()]?increase
return <increase first="{$b?bidders?*[1]?increase}"
                 last="{$b?bidders?*[last()]?increase}"/>
                   
        

Some observations:

Some specific difficulties that were encountered in converting other queries:

Query Performance

For most of the queries, the JSON query was a little slower than the XML version. Queries in this category include:

Query XML timing (ms) JSON timing (ms) Ratio (%)
q1 0.2649 0.6845 258%
q2 0.4861 0.6588 136%
q5 0.2711 0.3190 118%
q8 1.9359 2.3572 122%
q10 11.3329 14.3428 127%
q11 93.5360 144.1105 154%
q16 0.4183 0.8489 203%
q17 0.5964 0.8887 149%
q20 1.2380 2.2084 178%

How do we account for these numbers? My theory (based on gut feeling) is that the XML queries are faster because of the use of integer fingerprints for name matching in the TinyTree. Look at q1, for example, which in the original is:

Q1: for $b in /site/people/person[@id="person0"] return $b/name

(The XMark queries were written by someone who felt that everything ought to be written as a FLWOR expression. It can of course be simplified to a simple XPath. I'm surprised they didn't use a where clause...)

The child and attribute axis steps here (child::people, child::person, attribute::idetc) are implemented in the TinyTree by a sequential search of node entries testing each one for an integer namecode. By contrast the JSON equivalent is:

Q1: for $b in ?people?*[?id="person0"] return $b?name

and this involves string-based lookups in a hash table. Because the fan-out is fairly small, the sequential search wins.

To test this theory, I ran the XML queries using DOM rather than TinyTree as the tree model. Navigation in the DOM uses string matching on element and attribute names. The DOM queries are dramatically slower than the TinyTree: q1: 0.2947 q2: 9.1684 q5: 5.1841 q8: 49.4798 q10: 116.8379 q11: 402.2151 q16: 6.5635 q17: 44.1887 q20: 179.2854.

In the next group of queries, the JSON query is slightly faster:

Query XML timing (ms) JSON timing (ms) Ratio (%)
q3 1.3507 1.2656 94%
q6 0.2870 0.0316 11%
q9 3.2959 2.2320 68%
q12 32.3911 29.2320 90%
q18 0.3134 0.2865 91%
q19 4.9937 4.6699 93%

Query q6 is clearly an outlier. This query counts descendants: the original XML formulation is:

Q6: for $b in /site/regions/* return count ($b//item)

As it happens, item elements cannot appear at any depth, so the return clause could equally have been written count($b/item). In writing the JSON query I took advantage of this knowledge, and wrote the query as:

Q6: map:for-each(?regions, function($k, $v){a:size($v)})

This runs faster firstly because of this simplification, and secondly because the size of a map can be determined in constant time, whereas counting the number of children of an element requires actually scanning them.

For the other queries where there is a small speed-up, the cause is less obvious, but it's usually possible to hazard a guess. Some of them, for example, involve arithmetic and numeric comparisons, and the JSON queries in such cases avoid the overhead of converting strings to numbers on the fly (instead, the conversion is done during JSON parsing). We know from profiling that these conversions, especially if they occur in a filter predicate, can dominate query execution time.

For the final group of queries, the JSON performance is chronically worse:

Query XML timing (ms) JSON timing (ms) Ratio (%)
q7 1.0953 87.4869 7987%
q13 0.3635 15.1646 4171%
q14 12.4252 138.0764 1111%

These three queries all involve access to the description of an item, which in the XML representation is a mixed-content field (text with inline markup). As remarked earlier, this has been represented in JSON by expanding the node tree to a structure of arrays and singleton maps. As a result, a query like this one:

Q14: for $i in /site//item where contains ($i/description,"gold") return ($i/name, $i/description)

becomes thoroughly contorted (and inefficient) in the JSON representation: it is necessary to write a recursive function that assembles the description (sans markup) as a string before executing the contains() function. Even then, the JSON query doesn't faithfully reproduce the original, because it outputs the description as a string, losing the internal markup.

Conclusions

First, if you've got mixed content (text with inline markup) then you probably don't want to be using JSON. If you must use JSON, use XML or HTML within character strings in cases where inline markup is needed.

Secondly, for structured data it's a fairly even match; the differences aren't large enough to be critical for most applications. In Saxon, XML does slightly better on balance. This assumes, however, that for the XML case you are using an efficient model like the Saxon TinyTree, rather than a general-purpose DOM.

We found a few cases where the expressive power of XQuery 3.1 for querying JSON structures has gaps and omissions. Notably, searching for descendants in the tree is difficult; operations based on ordering of items within arrays are also tricky.