Facebook researchers say they’ve developed what they call a neural transcompiler, a system that converts code from one high-level programming language like C++, Java, and Python into another. It’s unsupervised, meaning it looks for previously undetected patterns in data sets without labels and with a minimal amount of human supervision, and it reportedly outperforms rule-based baselines by a “significant” margin.
Migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages, and it’s often costly. For example, the Commonwealth Bank of Australia spent around $750 million over the course of five years to convert its platform from COBOL to Java. Transcompilers could help in theory — they eliminate the need to rewrite code from scratch — but they’re difficult to build in practice because different languages can have a different syntax and rely on distinctive platform APIs, standard-library functions, and variable types.
Facebook’s system — TransCoder, which can translate between C++, Java, and Python — tackles the challenge with an unsupervised learning approach. TransCoder is first initialized with cross-lingual language model pretraining, which maps pieces of code expressing the same instructions to identical representations regardless of programming language. (Input streams of source code sequences are randomly masked out, and TransCoder is tasked with predicting the masked-out portions based on context.) A process called denoising auto-encoding trains the system to generate valid sequences even when fed with noisy input data, and back-translation allows TransCoder to generate parallel data that can be used for training.
The cross-lingual nature of TransCoder arises from the number of common tokens — anchor points — existing across programming languages, which come from common keywords like “for,” “while,” “if,” and “try” and also digits, mathematical operators, and English strings that appear in the source code. Back-translation serves to improve the system’s translation quality by coupling a source-to-target model with a “backward” target-to-source model trained in parallel. The target-to-source model is used to translate target sequences into the source language, producing noisy source sequences, while the source-to-target model helps to reconstruct the target sequences from the noisy sources until the two models converge.
The Facebook researchers trained TransCoder on a public GitHub corpus containing over 2.8 million open source repositories, targeting translation at the function level. (In programming, functions are blocks of reusable code that are used to perform a single, related action.) After pretraining TransCoder on all source code available, the denoising auto-encoding and back-translation components were trained on functions only, alternating between the components with batches of around 6,000 tokens.
To evaluate TransCoder’s performance, the researchers extracted 852 parallel functions in C++, Java, and Python from GeeksforGeeks, an online platform that gathers coding problems and presents solutions in several programming languages. Using these, they developed a new metric — computational accuracy — that tests whether hypothesis functions generate the same outputs as a reference when given the same inputs.
Facebook notes that while the best-performing version of TransCoder didn’t generate many functions strictly identical to the references, its translations had high computational accuracy. They attribute this to the incorporation of beam search, a method that maintains a set of partially decoded sequences that are appended to form sequences and then scored so the best sequences bubble to the top:
- When translating from C++ to Java, 74.8% of TransCoder’s generations returned the expected outputs.
- When translating from C++ to Python, 67.2% of TransCoder’s generations returned the expected outputs.
- When translating from Java to C++, 91.6% of TransCoder’s generations returned the expected outputs.
- When translating from Python to Java, 56.1% of TransCoder’s generations returned the expected outputs.
- When translating from Python to C++, 57.8% of TransCoder’s generations returned the expected outputs.
- When translating from Java to Python, 68.7% of TransCoder’s generations returned the expected outputs.
According to the researchers, TransCoder demonstrated an understanding of the syntax specific to each language as well as the languages’ data structures and their methods during experiments, and it correctly aligned libraries across programming languages while adapting to small modifications (like when a variable in the input was renamed). And while it wasn’t perfect — TransCoder failed to account for certain variable types during generation, for example — it outperformed frameworks that rewrite rules manually built using expert knowledge.
“TransCoder can easily be generalized to any programming language, does not require any expert knowledge, and outperforms commercial solutions by a large margin,” the coauthors wrote. “Our results suggest that a lot of mistakes made by the model could easily be fixed by adding simple constraints to the decoder to ensure that the generated functions are syntactically correct, or by using dedicated architectures.”
Facebook isn’t the only organization developing code-generating AI systems. During Microsoft’s Build conference earlier this year, OpenAI demoed a model trained on GitHub repositories that uses English-language comments to generate entire functions. And two years ago, researchers at Rice University created a system — Bayou — that’s able to write its own software programs by associating “intents” behind publicly available code.
“[Programs like these are] really just trying to eliminate the minutiae of creating software,” principal scientist and director at Intel Labs Justin Gottschlich told VentureBeat in a recent interview. “[They] could help accelerate productivity … [by taking care of] bugging. [And they could] increase the number of jobs [in tech] because people who don’t have a programming background will be able to take their creative intuition and capture that via machine by these intentionality interfaces.”