Interviews News Research Teams

Curiosity-Driven Researchers: Between Industry and Academia, Part 2

Read this post in other languages:

This is the second part of the interview with Timofey Bryksin, Head of the Research Lab in Machine Learning Methods in Software Engineering at JetBrains, in which he shares his thoughts on the future of IDEs, LLM tools, and the changing role of the developer. Part one can be found here.

Timofey Bryksin, Head of Research Lab in Machine Learning Methods in Software Engineering at JetBrains
Timofey Bryksin, Head of Research Lab in Machine Learning Methods in Software Engineering at JetBrains

The latest publication by your lab is focused on what natural language text editors could learn from IDEs. Can you give me an example?

This article is more of a call to the industry rather than a report on results. We’ve got a lot of experience with text editing, so we know the ins and outs of working with text. The cognitive load involved in software development is significant, since programming is not just about typing, it’s about organizing thoughts. And IDEs are pretty good at reducing this cognitive load through features like syntax highlighting or using formatting conventions, which allows you to uniformly format someone else’s code. IDEs understand the underlying structure of a text, offering insights into connections that might be missed otherwise. Our notion is that this accumulated knowledge could be beneficial beyond code, in other types of text editors. In natural language, you could identify and highlight structures or generate summaries on the fly. The “how” is still an open question, but we believe it’s worth contemplating further. Studies of the cognitive load related to natural language tasks have indeed been making progress, and similar strategies that we see in the way IDEs handle code, such as syntax highlighting, naming conventions, and context-based features, could be useful there too.  

We’ve published a few minor pieces in this area and believe this direction is worth further exploration. Especially if you keep large language models in mind, it’s clear that we now need to rethink developer tools and how to work with them. For example, GitHub Copilot X demonstrated a chat where you could ask the IDE to find errors in code. It’s possible that a voice interface will be added to this in the future.

Could the IDEs of the future be voice-controlled?

Two years ago, I had surgery and couldn’t use my arm for the entire summer. Typing with just one hand got tiring pretty fast, so I installed a few tools that could turn my speech into text. This allowed me to chat in Slack and type things using just my voice. These tools more or less got the job done and were pretty handy. 

But as soon as my arm healed, I went right back to typing with both hands and stopped using voice assistants altogether. It’s just that typing and punctuating felt quicker and more familiar. Remember when smart speakers and voice assistants first came out? They were supposed to engage in conversations with you and become your ultimate assistants. Yet, nowadays, these devices mostly just tell you about the weather and set alarms. However, when it comes to IDEs, voice input has potential, as certain commands can be given to an assistant vocally. “Find errors here,” “refactor this”, “extract this into a separate function” – these commands closely resemble natural language. People these days are generally used to sending voice messages. Some even prefer it. Perhaps the upcoming generation of developers will have their own way of doing things, and a few generations down the line, using a keyboard might appear outdated. 

What else is known about the IDE of the future?

I believe we’re in for a complete reevaluation of how we work, taking into account these new assistants and language models. If the role of a developer starts shifting from mere coding to more intricate high-level tasks like design, requirement outlining, component development, and system structuring, it’s going to prompt us to revamp the tools we rely on.

How will large language models affect the software development industry?

Generalization from typical experience works well for handling routine tasks but doesn’t generate new knowledge. Therefore, LLMs aren’t likely to affect tasks that typically require intellectual effort. They’re more likely to be used for routine mechanicals tasks, which will (and should) be automated. This means that writing landing pages for pet shops will surely be delegated to GPT models. However, tasks like creating intelligent systems and developing new algorithms still require innovative thinking, something models are currently incapable of. Another big practical question is how to build a development workflow around LLM tools. At JetBrains, we’re actively brainstorming about the developer tools of the future and creating prototypes, and we’ll soon begin sharing our progress.

What tools do you personally use?

We don’t rely on any extraordinary solutions, as most of our projects usually culminate in prototypes, and the usual version control and debugging systems suffice for their development. 

However, we do need specialized tools to work with code and different text fragments, to collect and analyze data, and to create datasets. That’s why we develop our own libraries.  For example, the problem of building a program dependency graph for Python code still hasn’t been solved, and there is no fully functional library for it yet. This is mainly due to the complexity of implementing such static features for dynamically typed languages. So, we’re working on it ourselves. 

Apart from data collection, we’ve been working for many years on an open-source library called KInference. It addresses the challenge of integrating neural models, which are usually trained in Python, with IDEs, specifically executing these models in a JVM environment (Java, Kotlin, Scala – doesn’t matter). There are a whole range of libraries out there that serve this purpose, but none fit our requirements for speed, memory, and stability for working in a cross-platform environment. So we’ve developed our own fast and lightweight library that performs ONNX model inference in pure Kotlin. It’s implemented quite efficiently, and our developers are delving deeply into optimizing memory and CPU cache usage to ensure fast computations. This allows us to bring complex neural models to any JVM-based language, which means we can create plugins for our IDEs and other cool stuff. For instance, some models in the Grazie platform are built on this library. We’re actively working on enhancing this tool and expanding its support to other neural operators.

What’s the largest model you’ve trained?

The largest one we’ve trained for research purposes had around 200–300 million parameters. However, we are also focusing on models that can run locally on users’ computers and work with their local data. This approach eliminates privacy and GDPR concerns in particular, since we don’t send any data externally, which grants us a good amount of freedom. Whenever there’s a need to send data elsewhere, privacy becomes a significant concern.

What libraries do you use to optimize the performance of models in your product?

It’s less about libraries and more about approaches. During training, we try to apply all the techniques that the community has come up with: distillation, size reduction, and more. After that, a significant portion of the models run on KInference, which executes the models quite efficiently. This is very helpful for both speeding up and compressing models.

Why does speed still matter in 2023? Is the issue of limited computational power still relevant?

As long as we’re not talking about super-sized models, then on servers, it’s not a big issue anymore. However, when you’re using a laptop resting on your lap and heating up the room, it becomes important. Developer tools can be pretty resource-hungry, especially when working on large codebases. Take, for example, more advanced IDE features like refactorings – they usually require indexing your entire project. For a model integrated into an IDE, the response time ideally should be in the range of tens of milliseconds, and its memory usage should be minimal – just a few megabytes. Of course, you could introduce heavier models, but you must have compelling reasons for doing so, for example, if the goal is to implement a feature that really “wows” the users. Anyway, you always have to be thoughtful about resource consumption. 

How do you choose between speed, accuracy, and completeness?

When it comes to in-IDE AI assistants (for example, a tool that finds bugs in an active project and suggests fixes to them right in the code editor), we usually lean towards accuracy. A model with 99.9% accuracy that takes two seconds to work is often preferred over a model that operates instantly but has 95% accuracy. This is because, in 1 out of 20 cases, the latter might provide incorrect suggestions, leading to frustration. We want to make sure that the assistance doesn’t annoy the developer to a point where they’d rather take the risk of not using it. Avoiding false positives is crucial, because otherwise the developer has to waste time addressing them and figuring out that they are indeed false positives, and this is bound to bring about negative emotions. 

The trade-off between speed and accuracy is sometimes resolved through engineering hacks or precomputations that can assist you. Furthermore, the issue doesn’t have to be solved in real time: When written code is analyzed, it’s acceptable for the developer to get suggestions slightly later. The suggestions don’t have to show up immediately. 

Of course, there are cases where it’s critically important to detect every potential problem. In these cases, it’s not as concerning if developers spend some extra time dealing with false positives. This often happens, for instance, when searching for security vulnerabilities. However, such tasks are not very common in our projects.

What else needs to be considered when developing a new feature?

It’s important to consider how the assistant integrates into the developer’s workflow. I don’t think that the community focuses enough on these concerns. People often come up with new approaches and tools, but it’s important to understand how these tools fit in with a typical developer’s work. You hear it every day, “I tried this feature; it showed me something weird three times; so I turned it off.”

What approach seems right in this case?

The majority of tools require the user to take quite a bit of initiative. They have to remember that the tool is available in the first place (which many developers tend to forget), consciously make an effort to run it, and then look through the result it produces in the end. It seems to me that the tools ought to be more proactive. But there’s a problem: If the tool constantly suggests things to a user, it disrupts their workflow. The question, then, is how to ensure that the assistant will recognize when to intervene. The suggestion should come at a specific time when the context is still fresh on the developer’s mind, but it should also not be a distraction that interferes with the developer’s ongoing tasks. These are important questions that we at JetBrains are exploring.

Are you happy with the programming languages you use for data science?

There’s no universal definition of the perfect language; it’s all about personal preferences and the tasks you’re working on. People tend to work with languages that have appropriate tooling. That’s what happened with Python: It gained popularity due to its robust tooling. What adds to its appeal is its simplicity. You don’t need to spend a long time learning it, like C++, and you don’t have to understand how memory works. Of course, understanding those things is vital, but Python lets you get started with a relatively easy learning curve. We use Python for model training, but we could’ve opted for a different path and gone with Kotlin. 

So, having one language that suits everyone and is universally adopted seems unlikely – unless neural networks somehow invent their own language and start writing code in it. But that’s a whole different story.

You’ve been observing how people write code for quite some time. How has software development changed over the years?

The major trend is the evolution of tooling that automates various tasks.

In enterprise development, there was a preference for complex solutions, where the system infrastructure dealt with a lot of intricate logic. Such systems were difficult not only to develop but also to understand. Around the mid-2000s, with the rise of Agile approaches, there was a shift in mindset: People began moving away from these monolithic constructs that tried to do everything at once. Developers started leaning towards embedding the logic within the code they write, rather than solely relying on tools and libraries. 

At the same time, there has been an increase in the number of tools that can address specific smaller tasks but not the entire problem. People are now trying to combine these tools together. Some are used for development, others for deployment, testing, profiling, versioning, and more. 

Another notable shift is the recognition of the importance of human involvement in the process. Thankfully, we’ve moved away from the era of colossal and bureaucratic processes that relied on powerful tools and operated with minimal human interaction. It’s now clear that systems are created by humans, and tools need to be user-friendly, prioritizing people’s convenience, even if it comes at the expense of predictability.

Can you give some tips for improving data quality?

Data quality is a major concern that people are only just starting to recognize. The principle of “garbage in – garbage out” is well-known. However, good benchmarks and reliable datasets are quite scarce, particularly in our relatively young field. Authors of research papers often have to create their own datasets. But there have been instances where other researchers examine these datasets after the paper is published and discover that they aren’t of high quality. As a reviewer for several top conferences and journals in our field, I often notice that this aspect doesn’t get enough attention. Many papers get rejected precisely because researchers missed the mark during the data collection phase, and thus whatever they did afterwards is irrelevant.

How can self-monitoring help with data quality?

First, it’s essential to check for both explicit and implicit duplicates. When a model encounters loads of identical code, it tends to think that’s how it’s supposed to be. We often use a clone detection tool for this, moving beyond simply filtering out GitHub forks. 

Second, for many tasks, the timeline is a very important part of the dataset. It matters a lot, as data can sometimes sneak from the training set into the test set, even if the model isn’t supposed to see it beforehand. Therefore, it’s crucial to pay close attention to the chronological order, making sure that the training data precedes the test set. 

Third, you need to view things from a broad perspective, carefully examine any biases specific to your task, and look for anything that could influence how the model relates to real-world scenarios. For example, in a recent project, we compiled a dataset for autocompletion in commit messages. There are datasets available for this purpose, but one of the most widely used datasets only retains the initial hundred characters and excludes messages that don’t start with a verb. Consequently, all messages in this dataset begin with a verb and follow a highly typical structure. As a result, any other type of message comes as a surprise for this model. 

Here’s another case: code from GitHub.  While it’s certainly valuable for research purposes, there’s a bias toward open-source projects. Who’s to say that the way code is written inside a large corporation mirrors what’s on GitHub? Corporations have their unique methods, data structures, and libraries. Also, GitHub currently hosts a lot of repositories that are not very representative of how people write code in our industry, containing things like homework assignments, tutorials, projects that aren’t maintained, and so on. As a result, there often appears to be a lot of code, but after careful review, it may not be as abundant as it seems.

____

On that note, we’ll conclude this installment of our series of interviews with Timofey Bryskin. Stay tuned for the conclusion of the series, where we’ll talk about the differences between academic and industrial research and what makes an interesting research outcome . And if you haven’t checked it out already, don’t miss our first interview, which covers what the ML4SE Lab does, how projects are selected and evaluated, and important things to consider when hiring researchers. Thank you, Timofey, for sharing your insights.