How we have automated converting C# projects to C++

Customers value Aspose products, that allow manipulating protocols and files of popular formats. Most of them were initially developed for .NET. At the same time, business applications for file formats, run in different environments. This article will describe how we have succeeded in setting up the releases of Aspose.BarCode, Aspose.Email, Aspose.Font, Aspose.Page, Aspose.PDF, Aspose.PUB, Aspose.Slides, Aspose.Tasks, Aspose.TeX, and Aspose.Words for C++, by building a framework for code translation from C#. Keeping the functionality of .NET versions for these products was technically challenging.

Background

The success of C# to C++ code translator is based on the successful experience, that the CodePorting team had while setting up the automated C# to Java code translation. The created framework was transforming C# classes into Java ones while replacing system library calls properly.

The different approaches had been considered for the framework. The development of pure Java versions from scratch would require too many resources. One option was marshaling the calls from Java code to .NET environment, but this would limit the set of programming platforms we could support in the future. Back then, .NET was present on Windows only. Calls marshaling is convenient with rarely happening calls carrying widely used data types. However, it becomes overwhelming while working with plenty of objects and custom data types.

Instead, we wondered how to fully translate existing code to a new platform. This was a topical issue because code migration had to be done monthly and for all products, producing a synchronized flow of similarly featured releases.

The solution was split into two parts:

  • Translator — application to transform C# syntax into Java one, replacing .NET types and methods with proper substitutions from target language libraries.
  • Library — component to emulate the parts of .NET library that could not be mapped to Java properly. To simplify the task, the available third-party components could be used.

The following arguments confirmed that the plan was technically viable:

  1. C# and Java languages have a similar ideology. At least, when it comes to types structure and memory management model.
  2. We had to translate the libraries only, so moving GUIs to a different platform was not the case.
  3. The translated libraries mostly contained business logic and low-level file operations, with the most complex dependencies being System.Net and System.Drawing.
  4. From the very beginning, the libraries were developed to work on a wide range of .NET versions (including Framework, Standard, and even Xamarin). Therefore, minor platform differences could be ignored.

We won't go into further details of C# to Java translator, this would require dedicated articles. To summarize, converting C# products to Java had become the company's regular practice, thanks to the code translator created. The translator had grown from a simple rule-driven text transformer into a complicated code generator that works with AST representation of source code.

The success of C# to Java translator helped us to enter the Java market, and the subject was raised to start releasing for C++ using the same scenario.

Requirements

To make it possible to release C++ version of our products, it was required to create a framework that would allow us to translate C# code to C++, compile it, test it, and send it to the customer. The code was a set of libraries, each up to a few million lines of code. The Library component of the code translator had to cover the following:

  1. Emulate .NET environment for the translated code.
  2. Adapt translated code for C++: types structure, memory management, etc.
  3. Move from ‘translated C#’ code style to C++ style, to make it easy to use the code for the developers not familiar with .NET paradigms.

Many readers are likely to ask why we didn't consider using existing solutions, such as Mono project. There were several reasons to do so:

  1. This would not cover the second and third requirements.
  2. Mono is implemented on C# and is dependent on its runtime.
  3. Adapting third-party code to our needs (API, type system, memory management model, optimization, etc.) would require the amount of time comparable to creating our solution.
  4. Our products do not require the full .NET implementation. However, if we had a full implementation, it would be hard to distinguish which methods and classes we need and which ones do not. We would spend much time fixing the features we never use.

Theoretically, we could use our translator to convert an existing solution to C++. However, this would require having a fully functional translator at the very beginning, because it is impossible to debug any translated code without a system library. Besides, the optimization issues would become even more essential than for the translated products' code, because system library calls tend to become bottlenecks.

Let's come back to our requirements for the code translator. Because of the inability to map .NET types to STL ones, we decided to use custom Library types as substitutions. The library was developed as a set of adapters allowing the use of third-party libraries' features through a .NET-like API (same as in Java).

As we were translating the libraries with existing API, an important requirement for the translated code was that it should run inside any customer's application. Therefore, we couldn't use garbage collection for the translated code as it would cover the whole application. Instead, our memory management model had to be clear for C++ developers. Using smart pointers was chosen as a compromise. We will describe how we have succeeded in changing the memory model in a separate article.

CodePorting has a strong test coverage culture, and the ability to apply the tests written for C# code to C++ products would simplify troubleshooting significantly. The code translator had to be able to translate the tests too.

Initially, manual fixing of translated Java code allowed to speed up the development and product releases. However, in the long run, this significantly raised the expenses needed to prepare each version for the release, as every translation error had to be fixed each time it appeared. This could be manageable by feeding the resulting Java code with the patches calculated as the difference between the translator's outputs generated for two consequential C# code revisions instead of converting it from zero each time. Nevertheless, it was decided to prioritize C++ framework fixing over resulting code fixing, thus fixing each translation error only once.

Development

The design and development of C# to C++ code translator was performed solely by CodePorting. It required many investigations, applying multiple approaches, and tests, differing by memory model and other aspects. In the end, two solutions were chosen. One of them is currently being used for C++ releases of Aspose.BarCode, Aspose.Email, Aspose.Pdf, Aspose.Slides, Aspose.Tasks and Aspose.Words products.

Technologies

Now it's time to explain the technologies we use in the code translator. The translator is a console application written in C#, which makes it easy to embed into scripts performing typical sequences like ‘translate-compile-test’. There is also a GUI component allowing you to do the same by clicking on the buttons.

Syntax analysis is being performed by the NRefactory library in the outdated generation of the translator and by Roslyn in the new one.

The translator uses several AST tree walkthroughs to collect information and generate output C++ code. For C++ code there is no AST representation created, instead, we handle output code in pure text form.

There are many cases when extra information is required to fine-tune the translator. This information is passed via options and attributes. Options are applied to the whole project. Typically, they are used to specify the class export macro name or C# conditional symbols used when parsing the code. Attributes are applied to the types and entities and provide some specific information for them, e.g.: mark which class members require ‘const’ or ‘mutable’ qualifiers in the translated code or which entities should be excluded from translation.

C# classes and structures are being converted into C++ classes. Their members and source code - into closest analogs. Generic types and methods are mapped to C++ templates. C# references are translated into smart pointers (shared or weak). Reference classes are defined in the Library. Other internal details of the code translator will be described in a separate article.

So, the project translated from C# to C++ depends on our Library instead of .NET libraries:

C# to C++

To build the code translator Library and the translated projects, we use Cmake. Currently, we support VS 2017 and 2019 (Windows), GCC, and Clang (Linux) compilers.

As already mentioned, most of our .NET implementations are thin adapters over third-party libraries, including:

  • Skia — graphics support.
  • Botan — encryption functions.
  • ICU — strings, codepages, and cultures support.
  • Libxml2 — XML operations.
  • PCRE2 — regular expressions support.
  • zlib — compression functions.
  • Boost — different purposes.
  • Few other libraries.

Both the Translator and Library are covered with many tests. Library tests use the GoogleTest framework. Translator tests are mostly written in NUnit/xUnit and are split into several categories, which ensure that:

  • The translator's output matches its target on specific input data.
  • Translated programs' output matches its target.
  • NUnit/xUnit tests from the input projects are translated into GoogleTest ones and pass.
  • Translated projects' API works fine in C++.
  • Translator options and attributes work as expected.

We use GitLab as a version control system. For CI, we use Jenkins. Translated products are available as NuGet packages and downloadable archives.

Issues

While working on this project, we faced a lot of different problems. Some of them were expected, and others were uncovered on the way:

  1. Type system differences between .NET and C++.
    C++ doesn't have any substitution for Object type, and most library classes don't have RTTI. This makes it impossible to map .NET types to STL ones.
  2. Translation algorithms are complicated.
    Many untrivial nuances need to be uncovered in translated code. For example, C# has a defined order of calculating the method's arguments, while C++ has UB here.
  3. Troubleshooting is hard.
    Debugging translated code requires specific skills. Nuances like the one described above can impact a program's work crucially, producing hard-to-explain errors. On the other hand, they can easily turn into hidden bugs and remain for a long time.
  4. Memory management systems differ.
    C++ doesn't have garbage collection. Due to that, more resources are required to make the translated code behave like the original one.
  5. Discipline is required for C# developers.
    C# developers have to get used to the limitations caused by the code translation process. The reasons for such limitations:
    • The language version should be supported by a translator syntax analyzer.
    • Code constructs not supported by the translator are forbidden (e.g. ‘yield’).
    • Code style is limited by translated code structure (e.g. each reference field must unambiguously be either a weak reference or shared reference, while for arbitrary C# code, this is not necessarily the case).
    • C++ language imposes its restrictions (e. g. in C# static variables aren't deleted before all foreground threads finish, while in C++ this is not the case).
  6. A large amount of work.
    The subset of the .NET library which is used by our products is large enough, and it takes much time to implement all classes and methods.
  7. Special requirements for developers.
    The necessity to go deep into complicated platform internals, and work with two or more programming languages limits the number of available candidates. On the other hand, developers interested in compilers theory or other exotic disciplines find their place in the project easily.
  8. Fragility of the system.
    Although we have thousands of tests and millions of lines of code to test the translator, sometimes we face problems when changes made for fixing the compilation of one project break it for the other one. For example, this may happen with rare syntax constructs and specific code styles in projects.
  9. High entry barriers.
    Most tasks in the code translator project require deep analysis. Because of the wide number of subsystems and scenarios, each new task requires getting familiar with new aspects of the project for a long time.
  10. Intellectual property protection issues.
    While there are a lot of ready solutions to obfuscate C# code effectively, in C++ much information is preserved in class headers. Moreover, some definitions can't be removed from public headers without consequences. Mapping generic classes and methods to templates creates another vulnerability, as it reveals the algorithms.

Despite all of that, the code translator project is very interesting from a technical point of view, and its academic complicity forces us to learn something new all the time.

Conclusion

While working on the code translator project, we have succeeded in implementing a system that solves an interesting academic task of code translation. We have organized monthly releases of Aspose libraries for the language they were not supposed to work with.

It is planned to publish more articles about the code translator. The next one will explain the conversion process in detail, including how concrete C# constructions are mapped onto C++ ones. Another one will talk about the memory management model.

We will try our best to reply to the questions asked. If the readers are interested in other aspects of code translator development, we may consider writing more articles on it.