Code Completion, Episode 3: Where Is the Dataset?

As we discovered in the second installment of this series, a modern code completion system needs machine learning to rank the suggestions most effectively.

Machine learning has one thing in common with human learning: it requires data to extract knowledge. There are many aspects to that process. We use so-called supervised learning for code completion, and that involves feeding the algorithm a set of example problems for which the correct action is known. The algorithm then finds common patterns and dependencies and learns to take the proper actions even in situations it hasn’t already seen.

Taken together, these examples form the dataset. The quality and volume of the data are critical for the success of machine learning.

Today we will discuss:

  • How we build our training dataset
  • How we protect your code
  • How these two topics are related

Naïve idea: open-source dataset

What should an example problem in our training set look like? We immediately think of the following.

There must be a code snippet that is missing part of a variable usage, a method call, a field access, or something similar, and the completion system must predict the ending for that word.

There are thousands of open-source projects with millions of lines of code in just about any conceivable programming language. Their repositories contain a vast array of data. At first glance, there should be enough data to cut as many required code snippets as we need.

A naïve data set creation algorithm may look like this:

while (needMoreData()) {
    repository = randomRepository();
    file = randomFile(repository);
    position = randomPosition(file);
    token = getToken(file, position);

    // Braces, spaces, punctuation and such are useless
    if token.isUseless()
    preparedFile = file.deleteCharacters(position, endOf(token));
    generateDataPoint(preparedFile, position, token);

In this case, our algorithm removes the tail of the token itself, but other options are available. For example, we could delete everything between the selected position and the end of the method.

Why the naïve algorithm is insufficient

While it is straightforward, the approach above has a problem. It provides us with a synthetic data set that differs from real-world usage. Let’s look at some of the most noteworthy differences.

The development process is more than the commit history

Developers commit their code to the repository only when the code is in some sense “complete.” The following attributes of “complete” code come to mind:

  • It compiles.
  • It passes unit tests.
  • It has clear formatting.
  • It doesn’t have any duplicate blocks.
  • It doesn’t contain any debug prints. In terms of lines of code, this is the most frequent edit.

Most non-complete code states are absent in the commit history, and we don’t have access to them.

Because of the differences between complete code and incomplete code, it is hard to mimic the distribution of invocation points. The proportions of method calls, variables, and field accesses in our data set will almost inevitably be different from what appears in the IDEs of our users.

User behavior features

However, the biggest issue with the synthetic data set is the absence of features related to user behavior. Sometimes it takes less than a second for a user to select an item from the popup, while in other cases they may study the popup for half a minute before deciding.

Is this information valuable for the ranking model? It surely is! And with the synthetic data set, we don’t get any user behavior features.

Therefore, we need the actual usage data in order to cover all the relevant scenarios.

The actual usage data

In a perfect world, we would like to take code snapshots at every completion invocation. If we had the exact state of your project and knew what (if anything) you selected from the popup, imagine what a powerful ML-based completion system we could create.

And imagine how unhappy your Chief Information Security Officer would be.

At this point, the issue of data protection comes up, usually in the form of the following question:

— Do you look at my code?

No, we don’t. We never log what you write in the editor. In fact, we go the extra mile to avoid logging your code accidentally.

All the fields in our logs are either enumerable or numeric. They contain neither binary blobs, free-form text, hashes of your code, libraries you use, nor anything else of that sort. The server parses everything we send and discards any entry that includes unrecognized information. Our logging policy is that everything is forbidden unless it is explicitly permitted, and many levels of reviews are required in order to change the list of what is allowed.

We do everything we can to avoid leaking your code to our servers. However, we still need data to train our models. This is a manifestation of an eternal software development problem: the conflict between ease of use and security.

Let’s see how we solve this conflict in the particular case of code completion logging.

Completion-related log data

Let’s say your primary language is Python, and you are trying to implement the statistical criterion named bootstrap. The function might start as follows:

def bootstrap(datapoints, samples = 10000, confidence = 0.95):
  stats = []
  for i in range(samples):
      subsample = random.c| ⇐ caret here 

The next logical call is random.choices(), as you need to create samples of the same size as the original array, with repetitions. That’s exactly what random.choices() does. What can we conclude from looking at this situation?

  • (enum) The popup is invoked automatically (as opposed to the user pressing Ctrl + Space).
  • (numeric) The indentation level is 2. This feature is critical in Python but makes sense for all languages.
  • (numeric) The prefix length is 1 character (but we don’t write “c” anywhere!).
  • (enum) The parent type for the caret location in the syntax tree is a cycle.
  • (enum) The invocation happened in the middle of a line.
  • (enum) The invocation happened after a dot.

This is the log data that we collect for each completion invocation before the user even has a chance to interact with it. Within 150 milliseconds (hopefully), the popup appears. It probably looks something like this:

A screenshot with a code completion popup. The popup contains seven lines with the functions from python module named random. The items are named choice, choices, _ceil, _cos, _accumulate, _acos, and _bisect.

It takes you about a second to figure out that the desired item is in the second place. You scroll down and select it. Here’s what we will remember about this action:

  • (numeric) The delay before the user acted was 1100 milliseconds.
  • (enum) The user action was to select an item from the list (success).
  • (numeric) The selected position was 1 (we count from 0, just like programmers do).

There are also things that we want to remember about each item in the popup. Let’s review the data that we send using the target random.choices() as an example:

  • (numeric) The length of the suggestion is 7 characters.
  • (numeric) We matched 1 character of the prefix.
  • (enum) The match location is the beginning of the suggestion. For _bisect, we would log a middle match.
  • (enum) The suggestion is a token from a standard library.

Of course, we write other features, as well. These are just examples of how we store a description of the code instead of the code itself.

Surprisingly important data

The concept of feature strength is essential in machine learning. Using some data yields a huge boost in quality, while the effect of other data may be barely noticeable.

Here are a couple of things we didn’t expect to be as strong as they are. They were actually rather entertaining discoveries.


  • (numeric) The position of the suggestion according to the heuristic ranking.

Earlier, we tried to eliminate heuristic ranking, but doing so disrupts the nice grouping of the suggestions by their type, like in the situation below:

A screenshot of a code completion popup for python language. The screenshot demonstrates how inconvenient it is when the suggestions of different types are mixed.

In this Python example, everything got mixed up, making the suggestions appear inaccurate.

Our users prefer for the list to be more consistent. Therefore, we only re-rank the top 5 suggestions if they have a high score according to our models. Everything below remains grouped as before, so it is easier to find the desired token if it scores low. We use the ML-based ranking only to sort the items within the groups.

Therefore, the old deterministic ranking algorithm appears to be a very useful factor for our ML model.

Don’t repeat yourself

Good developers follow the “don’t repeat yourself” principle and diligently refactor their code to extract the repeating blocks.

Unfortunately, either the above statement is incorrect, or not all the developers are good. Otherwise, it is impossible to explain why the features about the repeating code are the strongest ones in our feature set:

  • (enum) The suggested token is already present in the file 
  • (enum) The suggested token is already present in the file, and the surrounding code context is similar

We determine the similarity of the surrounding context by matching the n-grams –  the sequences of tokens (variables, method calls, operators, and so forth). The more similar the surrounding code is, the more likely it is that the same suggestion will be selected. The more times a developer has used the token in the file, the more likely they are to use it again.

Given the strength of these features, we are led to conclude that, sadly, developers tend to write similar blocks of code again and again, regardless of the “don’t repeat yourself” principle. Well, let’s at least take comfort in the thought that this makes code completion easier to implement.

Can we achieve good results with this data set?

Training a model on such logs is challenging. In some cases, when we suspect the data is incorrect, we can’t investigate without looking into the source code. For example, we may notice a user in our logs who seldom selects any suggestions from the completion window. And even when the user does so, the selections are far from the top, way below the average selection positions. If we could see the code that this user wrote and the suggestions they have seen, we would probably quickly find out if the data from that user was correct. Without access to the code, finding the root cause becomes extremely difficult and is never quick.

Nevertheless, we’ve managed to build a working production system.

Stay with us for the fourth installment of this series to learn how we train our models and what quality levels we can achieve.

image description