Company

Visit jetbrains.com

Backstage

Code Completion, Episode 2: Why Machine Learning?

Roman Poborchiy

Why not order by what’s logical? I don’t understand.
From a bug report

In episode 1, we learned about the principal components of the code completion system and discussed its usage patterns and quality requirements. Today, let’s look into what reasons we have to employ machine learning aside from just following the hype.

It’s a tough decision to make to replace “code that works” with a machine-generated binary that uses extra memory, may slow down the performance, and is difficult to interpret.

Previously, we used a set of heuristics. They worked well unless they conflicted with each other. Let’s review the heuristics we had and some of the conflicts between them to illustrate the difficulties we faced with deterministic sorting.

Prefix match

This is the most significant heuristic. If the user has typed some letters, we need to honor that and prefer the suggestions containing those letters. Even a rule as simple as this may appear controversial.

The popup with suggestions supports CamelCase and lowerCamelCase. You can only type the uppercase letters to get the full name proposed. This looks cool until you have to choose between the exact match and the CamelCase matches. Consider the following code snippet:

static class Benefit { String getName(){return "Name";} } public static void main(String[] args) { Benefit benefit = new Benefit(); System.out.println(benefit.na| ⇐ caret }

Benefit.getName() seems to be the only match, right? Wrong!

Here, the match at the beginning of the name appears to have higher priority than the match in the middle. Some people even argued that we should ignore the is/get/set prefixes and consider “na” the beginning match for getName(), but we won’t go that far. Let’s review two other arguments why notifyAll() shouldn’t be on top:

It is a void method; inserting it will make the code invalid.
It is a method of the Object class, defined far away from the location we’re currently editing.

Correctness

We already know that correctness is not critical. It is a common pattern to write something that won’t compile and refactor later. We have a different case here, since calling notifyAll() as an argument in println() is hard to justify with future refactoring. Unfortunately, the heuristic is not intelligent enough to tell incorrect from blatantly incorrect. Let’s look into the second argument.

Logical proximity

There is, indeed, a relevant sorting factor. The IDE creates a tree-like structure based on the project code and uses it for all the operations, including code completion. The proximity factor treats the names differently depending on where in this tree-like structure they are defined:

In the same method (local variables)
In the same class
In the same file
In the same project
The core libraries of your programming language

All else being equal, the closer the item definition is to the current position, the better. However, there are always cases where it works differently. Imagine you are writing an editor for source code like we do. You start typing “Color” – which of the two would you prefer?

We have red-black trees in the editor, but by “Color,” we usually mean Color from java.awt package. What can help us prioritize it in this selection? There are several things that come to mind:

We can use the frequency of the symbol in the current project or file.
We can build a transitive type closure, taking all locally used types and inspecting their member types, and so enriching the stats from the above.
We can use the frequency of the standard library symbol in all of the available open-source repositories using the same programming language.
We can remember the recent selections of the user in similar situations.

Statistics are good, but it’s tricky to pick threshold values at which the external library symbol starts beating the local one. The user’s previous selection, on the other hand, has more quirks associated with it.

Previous selections

First of all, if we know that you sometimes select one option and sometimes the other, we will not change the order, at least not immediately. We want to be consistent, after all. That consistency poses a usability issue our customers have reported several times. It goes like this.

The user creates a construction for the first time that later becomes typical in their code. If we offer the wrong selection at that point and the user accepts it by mistake without using the “undo” command immediately, then they are stuck with this selection for some time.

Users are also different in their habits, making the correct selection of previously used variables more difficult. Look at the following example:

Course course1 = new Course(); Course course2 = new Course(); course1.setName("Test1"); cou| ⇐ caret

How should we proceed? Should we offer course1 or course2 at the top? Some people at that point continue setting various attributes of course1, while others will most likely set the name of the second course:

course2.setName(“Test2”);

This example may seem artificial, but in real projects, that’s the kind of decision the completion system has to make all the time. Let’s look at other factors that are involved in these decisions.

Length

What’s wrong with the following example?

Side note: golang developers say those “Impl” names make their eyes bleed. In a golang repository, they use this pattern to detect a developer suffering from a severe case of Java infection. They want to cure every such case before it contaminates the whole repository with abstract factories.

Even in the Java world, where it is typical to spawn “Impl” classes, “Impl2” looks a bit creepy. But let’s be honest: who hasn’t had the temptation to write “Impl2” in a Java program?

Let’s get to the completion part. You never (well, almost never) want to manipulate the implementation classes directly in your business logic. The name of an abstract class or an interface parent is the most likely selection. The completion system needs to put this name at the top of the list. The heuristic’s name is “lift shorter.” If one valid name is the prefix of the other, the shorter one takes precedence:

Similar logic applies to other types of tokens, including variable names and method calls:

It isn’t as straightforward as with class names, but it generally seems to work. Now, let’s add a twist to the story:

@Deprecated public String suggestVariableName() { … }

Is there nobody who has ever tried to call a deprecated method? Still, we probably don’t need suggestVariableName() on top, except for, you know, when it’s the only logical suggestion. You already know the definition of “logical.”

There’s more to the shorter names, though. In the cases above, the matching prefixes all relate to the same domain, but the match can also be purely incidental. If you happen to name a variable nullableSomething, you (probably) don’t want null to be the top suggestion every time you use it.

To avoid null being the top selection, we need to treat different types of symbols differently, like keywords, constants, variables, methods, classes, and whatever else your programming language has. Do you see where this is going?

Machine learning

Eventually, we found ourselves writing endless if...else-statements, making the code more complex and hard to maintain. With each change, it took more time to tweak the logic to fix a specific problem without breaking the whole house of cards. Worse, we often felt tempted to add more switches to the (already cluttered) completion settings panel.

Once we understood that we implemented very complex decision-making logic that was hard to maintain and interpret, it became apparent that we had to delegate this job. Taking into account all these many factors, it sounds like it would be a perfect task for machine learning. For complicated systems, it can produce better decisions than any human can.

To enable machine learning, we start by building the training data set. We will talk about that in the next episode. Stay tuned!

Big Data World, Part 5: CAP Theorem Planned Downtime June 12, 2021: JetBrains Account and JetBrains Stores

Discover more

In the previous article devoted to full-line code completion, we looked into the vocabulary that the neural net of our full line completion plugin uses for Python. However, just having 16384 tokens like self., or, s.append(, return value, and others described in the article is not enough to generate…

This is a simplified image of a computer display. It also includes the following text: Looking at Python through the eyes of a neural net.

The JetBrains full line code completion plugin for Python is now available as a public beta. We would like to talk about some of the technologies and algorithms used to create the plugin and share statistics about Python programming that we’ve collected in the process. What is “full line code co…

A header picture with a robot and text saying "Code completion, episode 4: model training."

The previous articles from the series covered the following topics: In the first episode, we discussed general code completion scenarios.The second episode was devoted to the difficulties of heuristics-based implementation and explaining the necessity of machine learning.In the third episode, we …

As we discovered in the second installment of this series, a modern code completion system needs machine learning to rank the suggestions most effectively. Machine learning has one thing in common with human learning: it requires data to extract knowledge. There are many aspects to that process. …

Company

Code Completion, Episode 2: Why Machine Learning?

Prefix match

Correctness

Logical proximity

Previous selections

Length

Machine learning

Discover more

BeamSearch in code generation

Looking at Python through the eyes of a neural net

Code Completion, Episode 4: Model Training

Code Completion, Episode 3: Where Is the Dataset?

Company

Code Completion, Episode 2: Why Machine Learning?

Prefix match

Correctness

Logical proximity

Previous selections

Length

Machine learning

Subscribe to JetBrains Blog updates

Discover more

BeamSearch in code generation

Looking at Python through the eyes of a neural net

Code Completion, Episode 4: Model Training

Code Completion, Episode 3: Where Is the Dataset?