Saxon-CS says Hello World

By Michael Kay on March 22, 2021 at 10:34a.m.

The Saxon product on .NET has been living on borrowed time for a while. It's built by converting the Java bytecode of the Java product to the equivalent .NET intermediate language, using the open-source IKVM converter produced by Jeroen Frijters. Jeroen after many years of devoted service decided to give up further development and maintenance of IKVM a few years ago, which didn't immediately matter because the product worked perfectly well. But then Microsoft in 2019 announced that future .NET developments would be based on .NET Core, and IKVM has never supported .NET Core, so we clearly had a problem.

There's a team attempting to produce a fork of IKVM that supports the new .NET, but we've never felt we could put all our eggs in that basket. In any case, we also have performance problems with IKVM that we've never managed to resolve: some applications run 5 times slower than Java, and despite a lot of investigation, we've never worked out why.

So we decided to try a new approach, namely Java-to-C# source code conversion. After a lot of work, we've now achieved successful compilation and execution of a subset of the the code, and for the first time this morning, Saxon-CS successfully ran the minimal "Hello World" query.

We're a long way from having a product we can release, but we can now have confidence that this approach is going to be viable.

How does the conversion work? We looked at some available tools, notably the product from Tangible Solutions, and this gave us many insights into what could be readily converted, and where the remaining difficulties lay; it also convinced us that we'd be better off writing our own converter.

The basic workflow is:

  1. Using the open source JavaParser library, parse the Java code, generate an XML abstract syntax tree for each module, and annotate the syntax tree with type information where needed.
  2. Using XSLT code, do a cross-module analysis to determine which methods override each other, which have covariant return types, etc: information needed when generating the C# code.
  3. Perform an XSLT transformation on each module to generate C# code.

We can't convert everything automatically, so there's a range of strategies we use to deal with the remaining issues:

The areas that have caused most trouble in conversion are:

One area where we could have had trouble, but avoided it, is in the use of the Java CharSequence class. I wrote about this issue last year at String, CharSequence, IKVM, and .NET. As described in that article, we decided to eliminate our dependence on the CharSequence interface. For a great many internal uses of strings in Saxon, we now use a new interface UnicodeString which as the name implies is much more Unicode-friendly than Java's String and CharSequence. It also reduces memory usage, especially in the TinyTree. But there is a small overhead in the places where we have to convert strings to or from UnicodeStrings, which we can't hide entirely: it represents about 5% on the bottom line. But it does make all this code much easier to port between Java and C#.

What about dependencies? So far we've just been tackling the Saxon-HE code base, and that has very few dependencies that have caused any difficulty. Most of the uses of standard Java library classes (maps, lists, input and output streams, and the like) are handled by the converter, simply translating calls into the nearest C# equivalent. In some cases such as java.util.Properties we've written en emulation of the Java interface (or the parts of it that we actually use). In other cases we've redirected calls to helper methods. For example we don't always have enough type information to know whether Java's List.remove() should be translated to List.Remove() or List.RemoveAt(); so instead we generate a call on a static helper method, which makes the decision at runtime based on the type of the supplied argument.

The only external dependency we've picked up so far is for handling big decimal numbers. We're currently evaluating the BigDecimal library from Singulink, which appears to offer all the required functionality, though its philosophy is sufficiently different from the Java BigDecimal to make conversion non-trivial.

One thing I should stress is that we haven't written a general purpose Java to C# converter. Our converter is designed to handle the Saxon codebase, and nothing else. Some of the conversion rules are specific to particular Saxon classes, and as a general principle, we only convert the subset of the language and of the class library that we actually need. Some of the conversion rules assume that the code is written to the coding conventions that we use in Saxon, but which might not be followed in other projects.

So, Hello World to Saxon-CS. There's still a lot of work to do, but we've reached a significant milestone.