Applying Genetic Algorithms to Automatic Code Formatting

Our traditional JetBrains Hackathon was held this summer at our St. Petersburg and Munich locations. Earlier we shared the first impressions and quick overview of the winning projects in our company blog. Today we’d like to tell you about one of the projects that came in third in more detail. Why does it deserve special attention? Because you can experience it in your everyday work in CLion!

The project team tackled a problem that appears to be essential for C++ language, which is already in its 30s. Dozens of standard code styles exist for C++, and hundreds of dialects are adopted in various big companies or development areas. Even if we take only one aspect of the code style guidelines and regulate the layout of the language lexemes through various spaces and new lines, the family is still huge.

The project team consisted of Alexei Utkin and Roman Shein, two JetBrains developers.

Alexey Utkin works on CLion (our cross-platform IDE for C and C++) taking care of the automatic formatter. This means he manages more than 200 interconnected settings for C and C++. One of Alexey’s dreams was to find a good solution to formalize the personal preferences of a developer or development team in a way that the IDE could truly understand.

Alexey has always been interested in Artificial Intelligence:
“The main challenge for our generation is to make computers think as a human being, meaning being able to make conclusions from assumptions, and make predictions based on input data,” he says.

This is how the team came to setting a task for themselves for this Hackathon: to teach the computer to “understand” code styles, i.e. identify the rules that were used for a particular code piece formatting. The restrictions, however, are pretty tight as there is a limited set of interdependent settings (code style parameters) that the computer can influence, and the time for solving the problem is limited as well.

The solution Alexey finally suggested was taken from the nature itself: a Genetic algorithm. In short, a genetic algorithm is a search heuristic that mimics the process of natural selection. This is the algorithm our own nature uses to optimize the generation by a set of given parameters. These parameters mutate and change for every individual in the generation, and the fitness of every individual is evaluated. The goal of the whole evaluation process is to reach a satisfactory fitness level.

In the particular case of code formatting the team searches for a set of parameters that leads to the given formatting:

  • A genotype (the set of properties describing a candidate solution) is the set of code style parameters.
  • A fitness function (the actual metric to measure how far the candidate is from the real solution) is a specifically calculated distance between the original code base and the same code base after reformating using the genotype set.

After several successive generations, the “evolution” process determines the fittest solution.

Alexey and Roman worked out a few optimizations on the way as well:

  • The first step of the search procedure filtered out all the options that do not influence the current code piece. (This reminds us of the newest CLion 1.2 feature called Adjust settings via a quick fix, which helps you change code style options applicable to the selected piece of code only.)
  • If one can choose the set of most important settings and find the appropriate values for them first, then the other can be found as a local minimum around this important set. This hypothesis was also used to optimize the algorithm.

Eventually, Alexey’s and Roman’s efforts were rewarded handsomely: about 40 generations later, 1-2 minutes of calculations, and voila! The team then implemented two practical things:

  • Roman implemented an interactive “code style capture” which can take any code file as input and provide a set of settings for the formatted code as an output. Later these settings can be used for formatting other files, since they are are saved into the platform code style scheme. The tool is internal for now but we consider making it available publicly later.
  • Alexey extracted and verified a range of code styles for C++ including Google, GNU, Qt and Stroustrup, and braces rules by Allman, Whitesmiths and K&R. You can use these in CLion and AppCode since their releases in August. Just go to the Editor | Code Style | C/C++ settings and select ‘Set from predefined style’ there:

That’s about it. As we said earlier, this team’s work is already implemented in CLion and you can use experience it for yourself if you use our IDE. Get your free 30-day trial.

Please let us know in the comments below if this story was interesting to you so we can bring you more in the future. Thanks!

image description