Tree Variational Autoencoder for Code

  • Vadim Liventsev*
  • , Sander de Bruin
  • , Aki Härmä
  • , Milan Petković
  • *Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Autoencoder models of source code are an emerging alternative to autoregressive large language models with important benefits for genetic improvement of software. We hypothesize that encoder-decoder architectures are suboptimal for source code because they ignore the grammatical structure that can be derived with an Abstract Syntax Tree parser. We propose a structured Variational Auto-Encoder based on TreeLSTM that operates directly on the AST. We train it along with a baseline sequence VAE on a dataset of competitive programming submissions We find the structured model to perform better in most tests, with some notable exceptions. These findings suggest structured autoencoder models could enable more effective generation and manipulation of source code for tasks like automated bug fixing and generative programming.
Original languageUndefined/Unknown
Pages (from-to)30262-30273
Number of pages12
JournalIEEE Access
Volume13
DOIs
Publication statusPublished - 2025

Keywords

  • Decoding
  • Source coding
  • Vocabulary
  • Predictive models
  • Codes
  • representation learning
  • Vectors
  • Autoencoders
  • Automatic programming
  • Genetic programming
  • Logic gates
  • long short term memory
  • Recurrent neural networks

Cite this