JetBrains Research Digest 2023: Volume 2

In Volume 2 of the JetBrains Research Digest, we present a new selection of the latest developments from the JetBrains Research team.

Opening of the AI for Software Engineering research partnership

As announced in early October, JetBrains and Delft University of Technology have joined forces to establish the AI for Software Engineering (AI4SE) research partnership. Through this collaboration we will work on a variety of topics, from code generation and optimizing large language models to developing tools that aid in programming education and software development. Moreover, the lab is now open for PhD applications in all five of its core topics.

Apart from this, we are happy to report the following achievements from the Machine Learning Methods in Software Engineering Lab (ML4SE), which works on ​​improving modern software engineering tools and discovering new ways to develop and maintain code.

Detecting Code Quality Issues in Pre-Written Templates of Programming Tasks in Online Courses

On July 10, the team presented their work at the International Conference on Innovation and Technology in Computer Science Education (ITiCSE’23) in Turku, Finland. This was the ML4SE Lab’s first entry at this prestigious conference.

Anastasiia Birillo, Yaroslav Golubev, Maria Tigina, and Timofey Bryksin from JetBrains Research, together with Elizaveta Artser from Constructor University, Hieke Keuning from Utrecht University, and Nikolay Vyahhi from Hyperskill, worked together on code quality issue detection for programming task templates in online courses.

In this research, they tackled a somewhat non-obvious problem in the area of massive open online courses: detecting code quality issues that are not introduced in students’ code but rather already existed in the templates, the pre-written code of the tasks. This work is not only academically novel but also practically useful for Hyperskill and JetBrains Academy, where the methods the authors developed were used to detect and fix such code quality issues.

To detect the issues, the team developed an approach based on searching for problems not fixed by the majority of students. After manually validating the approach, the authors conducted a pilot study on material from the JetBrains Academy platform. They discovered that 14.7% of the Java tasks had at least one issue in their templates, which the platform team was subsequently able to fix, highlighting the importance of such a code-quality check.

The Effect of Perceptual Load on Performance Within IDE in People With ADHD Symptoms

On July 23, the team presented these results at the 17th International Conference on Augmented Cognition (AC’23), hosted alongside the 25th International Conference on Human-Computer Interaction (HCI’23) in Copenhagen, Denmark.

Vseslav Kasatskii from Neapolis University Pafos, together with Agnia Sergeyuk, Anastasiia Serova, Sergey Titov, and Timofey Bryksin from JetBrains Research, investigated how the perceptual load of an IDE may affect the programming performance of people with attentional difficulties.

The team asked 36 developers to complete the Barkley Deficits in Executive Functioning Scale, which indicates the presence and severity levels of ADHD symptoms. Then, the participants solved mentally active programming tasks (coding) and monotonous ones (debugging) in PyCharm in high and low perceptual load modes. The development environment was augmented with a plugin the team developed to track efficiency metrics, i.e. time, speed, and user activity. The authors found that perceptual load did affect programmers’ efficiency. In particular, the low perceptual load mode was generally beneficial in terms of programming speed. They also discovered that the effect of perceptual load differed between those with and without ADHD symptoms. This effect had specificity: depending on efficiency measures and ADHD symptoms, one or another level of perceptual load was beneficial. The results showed that visual representation is an influential aspect of IDE accessibility. A well-designed environment can help boost user productivity. Developing just-in-time work-station adjustments may support the efficiency of neurodiverse workers.

Overcoming the Mental Set Effect in Programming Problem Solving

On August 21, the team participated in the Psychology of Programming Interest Group workshop (PPIG’23) in Lund, Sweden, a long-established event for both computer scientists and psychologists from around the world.

Agnia Sergeyuk, Sergey Titov, Yaroslav Golubev, and Timofey Bryksin from JetBrains Research presented their work on overcoming the mental set effect in programming problem-solving.

The study adopted a cognitive psychology perspective to investigate recurring mistakes in code resulting from the mental set (Einstellung) effect. The Einstellung effect is the tendency to approach problem-solving with a preconceived mindset, often overlooking better solutions that may be available. The study aimed to test the Einstellung effect and two mechanisms for overcoming it in the field of programming. The first intervention was changing the color scheme of the code editor to one less frequently used. The second intervention was a combination of instructing participants to “forget the previous solutions and tasks” and changing the color scheme. The results of the experiment suggest that the tested techniques were insufficient to support overcoming the mental set, which can be attributed to the specificity of the programming domain. This study contributed to the existing literature by providing insights into creativity support during software development and, most importantly, offering a framework for experimental research in this field that will be further studied and applied by the ML4SE team.

38th IEEE/ACM International Conference on Automated Software Engineering

On September 1, the team presented two papers at the third largest conference in the area of software engineering, the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE’23) in Luxembourg, marking the lab’s fifth year in a row presenting at this conference.

Out of the BLEU: How Should We Assess Quality of the Code Generation Models?

Mikhail Evtikhiev, Egor Bogomolov, and Timofey Bryksin from JetBrains Research, together with Yaroslav Sokolov from the Machine Learning in Code Completion team at JetBrains, published their article on code generation assessment in the Journal of Systems and Software and then presented it as a journal-first paper at the International Conference on Automated Software Engineering.

In recent years, researchers have created and introduced a significant number of code generation models. As the manual evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU to approximate the results of human judgment. These metrics originate from the machine translation domain, and it is unclear whether they are valid for code generation tasks and how well they agree with human evaluation in such cases. There are also other metrics, CodeBLEU and RUBY, that were developed to estimate the similarity of code, taking into account the properties of source code. However, for these metrics, there are hardly any studies on their agreement with human evaluation. Despite all that, minimal differences in the metric scores have been used in recent papers to claim the superiority of some code generation models over others.

In their paper, the team presented a study on the validity of six metrics – BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY – for the evaluation of code generation models. The authors conducted a study on two different code generation datasets and used human annotators to assess the quality of all models run on them. The results indicated that for the CoNaLa dataset of Python one-liners, none of the metrics could correctly emulate human judgment on which model is better with greater than 95% certainty if the difference in model scores was less than five points. For the HearthStone dataset, which consists of classes of a particular structure, a difference in model scores of at least two points proved enough to claim the superiority of one model over another. These findings suggested that the ChrF metric is a better fit for evaluating code generation models than the commonly used BLEU and CodeBLEU. Nevertheless, finding a metric for code generation that closely agrees with human judgment requires further research.

This work could prove to be seminal, as it encompasses a large-scale and crucial manual validation study that can be used to judge how robustly other works present their results.

From Commit Message Generation to History-Aware Commit Message Completion

Aleksandra Eliseeva, Egor Bogomolov, Yaroslav Golubev, Danny Dig, and Timofey Bryksin from JetBrains Research, together with Yaroslav Sokolov from the Machine Learning in Code Completion team at JetBrains, presented their work on commit message generation and completion as part of the main technical track of the International Conference on Automated Software Engineering.

In this paper, they explored two proposals related to the personalization of commit message generation (CMG) approaches: shifting the focus from generation to completion and using previous commit message history as additional context. As part of this study, the authors highlighted the limitations of common data filtering steps in existing datasets and built a novel dataset of 10.7 million commits across 20 programming languages. The authors conducted experiments with three state-of-the-art models from CMG research and ChatGPT. The results suggested that the completion task is easier than the generation task for existing approaches. Additionally, the experiments showed that extending the input with commit message history was very beneficial for generation but questionable for completion. Finally, the authors observed that, overall, in a simple zero-shot setting, ChatGPT showed subpar quality compared to fine-tuned models from CMG research, but it held some potential for more detailed commit messages.

This work is a crucial step forward for the field, as it shares a novel, diverse dataset that will indeed be helpful for the community.

That’s it for now. Stay tuned for future updates from our research teams in the next digest! 

If you have any questions or comments, please contact us at