Curiosity-Driven Researchers: Between Industry and Academia, Part 3
Welcome to the third part of our interview with Timofey Bryksin, Head of the Research Lab in Machine Learning Methods in Software Engineering at JetBrains. This time, we’ll talk about the frontiers in fundamental and applied research, industry competition, and the latest AI trends. Part two can be found here.
What’s the difference between a lab like yours and a research group focused on pure science?
The difference is quite substantial, and it’s a matter of what we consider the outcome of our work. We look at things from a practical standpoint and ask whether our work can be applied in the real world. In academic research, the main goal is often to generate new knowledge and contribute to the advancement of a field. Any improvement in existing knowledge is considered a good result worthy of publication as long as it’s well-documented and can be reproduced. Sometimes, new knowledge is indeed obtained, but it lacks practical applicability. For instance, once at a software development conference (where people usually tackle some pretty practical issues), they had a whole section dedicated to automatic fixes, and all of the talks followed a similar script: “We took this massive neural network, trained it on this specific dataset, crunched some numbers, and our metrics got better!” Is this new knowledge? Definitely, because it improved a solution to a specific problem. But what’s the novelty here? They used an existing model, trained it on already known data, and applied it to a familiar problem.
That’s why when we get a new model, we want to make sure it actually helps people. This means building tools that go along with it, conducting research with real users, and making sure the problem is actually solved. In pure academic settings, people don’t commonly take these extra steps. But we make an active effort to team up with those who do.
Fundamental research doesn’t always yield practical results in the “here and now”, but at some point in the future, it can be incredibly useful. Do you engage in that kind of research, even occasionally?
Despite our primary focus on practical results, ideas often bounce around from one project to another, and you never know when and where they might come in handy. Unpredictability is a characteristic of science.
We’ve had research projects that unexpectedly led to something we hadn’t initially planned. For example, once we started a project out of sheer curiosity, and it ended up becoming a useful feature in our products.
We were curious about how often GitHub users copied Java code without respecting the licenses. So, we downloaded a lot of repositories from GitHub, identified duplicates, and compared their timestamps to observe the relationship between licenses in early and later versions. In the end, we figured out how open-source licenses work and developed useful analytics regarding their compatibility. This whole adventure ended up inspiring a tool for IDEs. It helped users understand dependencies in their projects, how different code chunks fit together, and what kind of license they needed if their project used various libraries. Along the way, we also refined our clone detection algorithm. But trying to predict all of this from the start would’ve been challenging – these goals came into being during the course of our research.
Is such a result publication-worthy?
In the scientific community, there’s a running joke that any result is worthy of publication.
I believe there are no strict criteria for this. A result is worthy of publication if, alongside contributing to the field, it contains something interesting.
What does “interesting” mean in this context?
For me, interest here is closely tied to novelty. However, different people find various fundamental aspects of things intriguing. Some enjoy delving into things and discovering patterns – they find pleasure in uncovering something new. Some enjoy creating new knowledge, while others are inclined to catalog existing knowledge. A well-structured approach can also be a form of novelty and that’s what makes it interesting.
How does this apply to software engineering?
Even in this field, opinions on what makes something interesting can vary. Is it when you write a new program or experiment with a new method or algorithm? In my view, these are all good enough reasons to write a paper.
In the narrower realm of machine learning for software engineering, the criteria for novelty become more restrictive: you need to introduce a new model, collect a dataset, or formulate a new problem. There’s also a category of publications that aim to legitimize the tools created by academic researchers, and I think that’s a very valid objective.
So let me approach this from a different angle: we believe that there’s no point in writing articles about things that don’t spark interest in discussion, fail to contribute new knowledge, don’t give you a sense of pride, and that you don’t want to showcase. It can’t just be something ‘new’ – there has to be more to it.
Do you have KPIs for the number of publications and citations?
Our goal is not specifically to publish a lot of papers and increase citation rates. In fact, it’s relatively easy to achieve that — you just need to create a bunch of tools and datasets that prove useful to a broad range of people. Our most-cited article happens to be about one of our tools. On the flip side, our article about spotting patterns in code changes had a pretty big real-world impact, but only got four citations. So, what really matters to us is how well-known our work is within the larger community. We’re keen on sharing what’s happening within our company with academic circles.
Why does a company seek recognition in scientific circles?
We demonstrate that JetBrains is actively involved in scientific pursuits and is contributing to the industry. We openly share our research findings and actively seek to collaborate with scientific groups from around the world, particularly those who not only publish papers but also work on putting their results into practice.
Where are the current frontiers in both fundamental and applied fields?
A lot of the most exciting developments are happening in the realm of NLP, or Natural Language Processing, where investments are pouring in by the billions. The recent breakthroughs like OpenAI’s ChatGPT, Meta’s LLaMA, or Google’s Bard emerged from the world of natural language processing. This spills over into related domains like code processing and analyzing biological data, as well as graph models. People are actively exploring dialogue systems, models capable of speech synthesis, and tackling diffusion models for synthesizing visual information.
But here’s the thing: The concept of a “frontier” is rather malleable. We may never really know what commercial companies are chasing at any given time. Unlike academic research groups, companies can choose to keep their work under wraps until it suits them, sometimes not revealing the full ins and outs of how their services work.
Even though OpenAI claims to be transparent?
It’s quite interesting, actually. Google developed the Transformer model architecture, made it public, and OpenAI built on it to create ChatGPT and GPT-4. Then, gradually, OpenAI has transitioned from a non-profit organization to a more commercial entity and is becoming less open with information.
Commercial companies often prefer to keep their developments private, which is completely normal. For instance, Meta has developed its own programming language and all the tools around it, including an IDE. Each major player tends to have its own version control system, and this became the standard a while ago.
But in the research community, things are done differently. OpenAI set an interesting precedent by releasing a paper on GPT-4 without disclosing any specifics about the model or the training data. The paper literally says, “We trained the model on data,” and then goes on for 70 pages about how the model can be safely used in various scenarios. Maybe it’s due to competition, maybe they have ethical concerns, most probably both. But the decision made by OpenAI has an unexpected consequence: many will try to replicate this achievement, consuming a massive amount of electricity and emitting huge volumes of carbon dioxide. If the model isn’t shared openly, the same resources will be expended again. Those who care about the environment view this quite negatively. In this regard, Meta’s approach is much more research-friendly: they released their LLaMA models for the community to build upon, and we already see a number of interesting results created on top of them.
Is this lack of transparency a new requirement to be able to compete in the industry?
Competition will take place on various fronts. For starters, big companies that invest huge amounts of money will achieve results, and the open community will strive to replicate these experiments and make further progress.
Additionally, there will be competition to increase the model sizes and come up with new ways to build model ensembles to improve the end result.
On top of all that, there’s a shift towards smarter training with better datasets. Meta AI, for instance, is training much smaller models that perform on par with models with hundreds of billions of parameters. The idea is that better data can lead to better training, making it unnecessary to significantly increase the model size. It seems to me that working on this approach is more sensible than simply ramping up the number of model parameters. After all, there are more systematic ways to extract information from data, which can make the model more capable, such as various ontologies and knowledge graphs.
At the same time, everyone is eagerly anticipating a leap in computing power with the advent of quantum computers, which will make the volume of computations largely irrelevant. If we learn how to use and scale quantum computers properly, the conversation will take a whole different turn.
But for now, it really comes down to computational power, doesn’t it?
Sure. Firstly, nobody knows how current results will scale to the entire human population. I’m pretty sure that if everyone started using GPT-4 simultaneously, it would simply crash. Efficient model inference is as much its own area of research as model size, training period, and quality of training. Secondly, even though we’re dealing with mathematics where, in theory, you can achieve results with just a pencil and paper, practical validation requires training exceptionally large models. So, there are two options: either you have a proper data center or you don’t. In the latter case, you’re still constrained, as you can rent one, but it’s costly. And even if you do, the computational capacity at your disposal will be nowhere near what Google and Microsoft have.
Is there room for a loner or a small research team to still make a breakthrough?
I’d like to believe there’s always potential to come up with something truly game changing. For example, it would be amazing if someone could figure out how to distill a model as massive as GPT-3 or GPT-4 by several orders of magnitude without compromising the quality too much. Achieving this doesn’t necessarily require a big team. In fact, the ability of smaller groups to experiment and pivot quickly can be incredibly valuable.
Just a reminder, the Midjourney team still consists of just eleven people. By the way, I would recommend small research groups to take a closer look at other media such as voice, video, and images. One reason that there’s far less competition in those areas is that, for now, the focus and budgets of big tech companies are predominantly locked onto text, which is NLP. And it’s understandable why text gets this attention — where there’s text, there are searches, and searching means advertising, assisting people in various ways, and so on. However, we’re already seeing some exciting advancements in fields like images, computer vision, and video generation.
Why did ChatGPT, Midjourney, and Stable Diffusion all take off almost simultaneously?
Setting aside any conspiracy theories, it’s probably more about a significant accumulation of research efforts that paved the way for these results. The field of diffusion models was getting a lot of attention, and the sheer quantity of work eventually evolved into high-quality outcomes aided by some fascinating mathematical developments.
Neural models gained major traction when major tech companies collected large datasets and got access to affordable but powerful computation hardware for processing that data. I’d say it’s mostly a mix of various factors, both in the academic world and the hardware scene.
Are there any areas that have been unfairly overlooked?
Well, that would be the case if humanity were all headed in one single direction, but the thing is, we’re going in different directions at once. Even within OpenAI and Google, there’s an effort to explore entirely different paths. For example, they’re delving into diffusion models, which work quite differently from GPT, but they still have the potential for obtaining interesting results in that domain. However, what counts as success can be a bit fluid in this context. If we’re talking about big tech companies and their accomplishments, Google, for example, introduced some remarkable models with unique architectures that, in my opinion, are more interesting than GPT models. However, they didn’t get as much immediate attention. This leads to another question: how are these results presented to the public? Take GitHub Copilot, for example. It came out about a year and a half ago and was marketed brilliantly. ChatGPT, too, created quite a buzz, even though it was based on a research paper that had already been published. There have been tools capable of generating code before, but why didn’t they get as popular? It’s hard to say how they compared in quality. My suggestion would be not to underestimate the role of marketing in showcasing these results.
Do potential implementations determine research priorities?
On the one hand, it’s pretty clear that companies are eager to replicate any major success they see in-house. On the other hand, they’re also on the lookout for unique and unexpected applications for existing innovations. Currently, it’s quite evident that most big tech companies are exploring similar areas. Microsoft, for instance, used ChatGPT to revive Bing, and a month later, Google followed with Bard. Of course, each company has its own unique focus. Microsoft is much more active in our field. They’ve acquired GitHub and are actively developing Copilot and the ecosystem around it. Google is developing Project IDX, which is a web-based IDE with full support of their LLMs. And they have many more internal tools where they apply machine learning.
So, if you look at a company’s product lineup, it kind of gives away where they’re planning to put their innovation efforts. Of course, there will be entirely new products, but as practical experience has shown, not all of them will be equally useful. Keep an eye on products in AR, VR, or the metaverse – there’s no guarantee that ML-based features will sell as well there.
Isn’t there a sense of a growing bubble in AI?
It kind of feels that way. It’s evident that the rise of ChatGPT, especially with the APIs offered for building custom apps, has sparked a substantial trend, and we’ll be seeing countless startups popping up in this space. There was even this recent article that listed 126 unicorn AI startups – companies valued at over $1 billion each. Can a single project be worth a billion? Absolutely. But can every single one of them? I have my doubts.
AI faces another challenge: its ability to generate new knowledge. OpenAI often boasts that the GPT-4 model can outperform 90% of people who pass the bar exam. Does this mean it’s a great model? Probably, yes. But it also highlights issues in the testing procedure. As far as we know, most tests that people take are focused on how well they’ve memorized information. OpenAI claims that their model can solve competitive programming problems, but that’s only true if you feed it problems from before 2021. Anything new, and it fails completely. Well, some tasks really need responses based on solid knowledge. For example, when ChatGPT-4 was given a step-by-step guide on calculating taxes, it broke it down nicely, explaining what to pay and how. However, it’s pretty clear that it essentially read the instructions and spit them out, one piece at a time. So, the model aces these tests mainly because it was trained on them in the first place. This goes against a fundamental rule: you shouldn’t test on the same data that you trained on. But what if your training dataset is the entire Internet? To us, it might seem like the AI is creating new information, especially because we humans can’t memorize the entire Internet. That’s why the AI’s recollection seems like it’s generating fresh knowledge. But what if it simply has a very good memory?