Code Completion, Episode 2: Why Machine Learning?
Why not order by what’s logical? I don’t understand.
From a bug report
In episode 1, we learned about the principal components of the code completion system and discussed its usage patterns and quality requirements. Today, let’s look into what reasons we have to employ machine learning aside from just following the hype.
It’s a tough decision to make to replace “code that works” with a machine-generated binary that uses extra memory, may slow down the performance, and is difficult to interpret.
Previously, we used a set of heuristics. They worked well unless they conflicted with each other. Let’s review the heuristics we had and some of the conflicts between them to illustrate the difficulties we faced with deterministic sorting.
Prefix match
This is the most significant heuristic. If the user has typed some letters, we need to honor that and prefer the suggestions containing those letters. Even a rule as simple as this may appear controversial.
The popup with suggestions supports CamelCase and lowerCamelCase. You can only type the uppercase letters to get the full name proposed. This looks cool until you have to choose between the exact match and the CamelCase matches. Consider the following code snippet:
static class Benefit {
String getName(){return "Name";}
}
public static void main(String[] args) {
Benefit benefit = new Benefit();
System.out.println(benefit.na| ⇐ caret
}
Benefit.getName()
seems to be the only match, right? Wrong!
Here, the match at the beginning of the name appears to have higher priority than the match in the middle. Some people even argued that we should ignore the is/get/set prefixes and consider “na” the beginning match for getName()
, but we won’t go that far. Let’s review two other arguments why notifyAll()
shouldn’t be on top:
- It is a void method; inserting it will make the code invalid.
- It is a method of the
Object
class, defined far away from the location we’re currently editing.
Correctness
We already know that correctness is not critical. It is a common pattern to write something that won’t compile and refactor later. We have a different case here, since calling notifyAll()
as an argument in println()
is hard to justify with future refactoring. Unfortunately, the heuristic is not intelligent enough to tell incorrect from blatantly incorrect. Let’s look into the second argument.
Logical proximity
There is, indeed, a relevant sorting factor. The IDE creates a tree-like structure based on the project code and uses it for all the operations, including code completion. The proximity factor treats the names differently depending on where in this tree-like structure they are defined:
- In the same method (local variables)
- In the same class
- In the same file
- In the same project
- The core libraries of your programming language
All else being equal, the closer the item definition is to the current position, the better. However, there are always cases where it works differently. Imagine you are writing an editor for source code like we do. You start typing “Color
” – which of the two would you prefer?
We have red-black trees in the editor, but by “Color
,” we usually mean Color
from java.awt
package. What can help us prioritize it in this selection? There are several things that come to mind:
- We can use the frequency of the symbol in the current project or file.
- We can build a transitive type closure, taking all locally used types and inspecting their member types, and so enriching the stats from the above.
- We can use the frequency of the standard library symbol in all of the available open-source repositories using the same programming language.
- We can remember the recent selections of the user in similar situations.
Statistics are good, but it’s tricky to pick threshold values at which the external library symbol starts beating the local one. The user’s previous selection, on the other hand, has more quirks associated with it.
Previous selections
First of all, if we know that you sometimes select one option and sometimes the other, we will not change the order, at least not immediately. We want to be consistent, after all. That consistency poses a usability issue our customers have reported several times. It goes like this.
The user creates a construction for the first time that later becomes typical in their code. If we offer the wrong selection at that point and the user accepts it by mistake without using the “undo” command immediately, then they are stuck with this selection for some time.
Users are also different in their habits, making the correct selection of previously used variables more difficult. Look at the following example:
Course course1 = new Course();
Course course2 = new Course();
course1.setName("Test1");
cou| ⇐ caret
How should we proceed? Should we offer course1
or course2
at the top? Some people at that point continue setting various attributes of course1
, while others will most likely set the name of the second course:
course2.setName(“Test2”);
This example may seem artificial, but in real projects, that’s the kind of decision the completion system has to make all the time. Let’s look at other factors that are involved in these decisions.
Length
What’s wrong with the following example?
Side note: golang developers say those “Impl” names make their eyes bleed. In a golang repository, they use this pattern to detect a developer suffering from a severe case of Java infection. They want to cure every such case before it contaminates the whole repository with abstract factories.
Even in the Java world, where it is typical to spawn “Impl” classes, “Impl2” looks a bit creepy. But let’s be honest: who hasn’t had the temptation to write “Impl2” in a Java program?
Let’s get to the completion part. You never (well, almost never) want to manipulate the implementation classes directly in your business logic. The name of an abstract class or an interface parent is the most likely selection. The completion system needs to put this name at the top of the list. The heuristic’s name is “lift shorter.” If one valid name is the prefix of the other, the shorter one takes precedence:
Similar logic applies to other types of tokens, including variable names and method calls:
It isn’t as straightforward as with class names, but it generally seems to work. Now, let’s add a twist to the story:
@Deprecated
public String suggestVariableName() { … }
Is there nobody who has ever tried to call a deprecated method? Still, we probably don’t need suggestVariableName()
on top, except for, you know, when it’s the only logical suggestion. You already know the definition of “logical.”
There’s more to the shorter names, though. In the cases above, the matching prefixes all relate to the same domain, but the match can also be purely incidental. If you happen to name a variable nullableSomething
, you (probably) don’t want null
to be the top suggestion every time you use it.
To avoid null
being the top selection, we need to treat different types of symbols differently, like keywords, constants, variables, methods, classes, and whatever else your programming language has. Do you see where this is going?
Machine learning
Eventually, we found ourselves writing endless if...else
-statements, making the code more complex and hard to maintain. With each change, it took more time to tweak the logic to fix a specific problem without breaking the whole house of cards. Worse, we often felt tempted to add more switches to the (already cluttered) completion settings panel.
Once we understood that we implemented very complex decision-making logic that was hard to maintain and interpret, it became apparent that we had to delegate this job. Taking into account all these many factors, it sounds like it would be a perfect task for machine learning. For complicated systems, it can produce better decisions than any human can.
To enable machine learning, we start by building the training data set. We will talk about that in the next episode. Stay tuned!