JetBrains AI

Supercharge your tools with AI-powered features inside many JetBrains products

Explore More

How We Use AlphaEvolve to Make Complex IDE Algorithms Faster

Denis Shiryaev

AlphaEvolve is a Google DeepMind algorithm-discovery system that uses Gemini to generate, test, and refine possible algorithm improvements. Its job is not to answer questions; it searches for faster ways to solve complex algorithmic problems. We tried it on a narrow but important part of IntelliJ-based IDEs: indexing, the background work that makes navigation, search, completion, refactorings, inspections, and other code insight available after a project opens.

That makes indexing speed a simple metric to say out loud and a hard metric to improve. It depends on the language, the framework, the shape of the project, background IDE work, and the storage layer underneath the indexes. Small changes can disappear in noise. Some wins are real in a microbenchmark and invisible in a full IDE run.

We already invest a lot of engineering time here, and that manual performance work continues. The experiment described in this post was not a replacement for engineering judgement, profiling, code review, or product validation. It was a test of an additional search method: could Google DeepMind’s AlphaEvolve help us find useful optimization candidates in code that had already been worked on for years?

Result snapshot

We first tested the generated candidates on a synthetic benchmark, then validated the most promising ones in a full IDE environment.

Integration test, in seconds, lower is better: Kotlin Spring Petclinic on modified IntelliJ IDEA 2026.2 nightly builds. Baseline 17.4 ± 0.5s. Solution 1 measured 16.6 ± 0.2s in our run table.

15-20% Synthetic performance score improvement seen in most AlphaEvolve sessions with 50+ iterations.

17.4s Full IDE baseline for Kotlin Spring Petclinic, with ±0.5s variability.

16.6s Best measured candidate, reported as ±0.2s.

2 / 5 Generated candidates that showed a statistically significant integration-test improvement.

Interactive measurement dashboard

Use the tabs to move between the end-to-end result, individual runs, and the experiment funnel. For time and score charts, lower is better.

Show reported variability

Google DeepMind describes AlphaEvolve in its AlphaEvolve preview blog as a Gemini-powered coding agent for designing algorithms by combining LLM-generated code with automated evaluators. For this experiment, that evaluator was our performance and correctness setup.

The target: a B-tree in the indexing stack

We chose the B-tree at the foundation of our index implementation. The starting point was not a naive prototype. It was a deeply optimized piece of infrastructure where manual exploration had become expensive. Even a plausible change takes time to write, review, and validate, and a wrong change can be fast for the wrong reason.

The engineering description was deliberately plain: the original algorithm was essentially a classic B-tree, and the proposed candidates were mostly improved B-tree variants with optimizations around edge cases. That is the kind of problem AlphaEvolve is well suited for. There is code to change. There is a clear score. There are tests that reject broken ideas.

The loop: generate, score, validate

AlphaEvolve optimizing an instance of the "Tammes problem".

We gave AlphaEvolve an internal performance test suite for the storage layer. The suite is synthetic. It does not use real customer projects. It writes and reads synthetic data so that candidate changes can be tested quickly and repeatedly.

The score was based on the sum of median results across our mid-sized benchmarks. Unit tests acted as the correctness check. With that setup, most AlphaEvolve sessions with more than 50 iterations produced a 15-20% improvement in the synthetic performance score.

That was encouraging, but it was not enough. Synthetic benchmarks are useful because they are controlled. Users do not run controlled benchmarks. They run full IDEs, with background processes, language services, and project-specific behavior running at the same time. So we took the best generated candidates into integration tests.

For the full IDE step, the team used Kotlin Spring Petclinic and modified IntelliJ IDEA 2026.2 nightly builds. The reported baseline for total end-to-end indexing time was 17.4 ± 0.5 seconds. Out of five generated candidates, two showed statistically significant improvements, with reproducible results below 16.8 seconds.

Claim boundaries

Most 50+ iteration sessions improved the synthetic performance score by 15-20%. This is the strongest claim about the autonomous optimization loop because the benchmark was the optimization target.

What changed in the numbers

Our end-to-end run table contains two measured candidates. Solution 1 produced a mean result of 16.6 seconds, reported as ±0.2 seconds. Against the 17.4-second baseline, that is about 0.8 seconds faster, or roughly a 4.6% reduction in this integration scenario.

Solution 2 is useful for the story too, although not because it won the full IDE test. It measured at 17.5 ± 0.4 seconds, which is effectively baseline in this scenario. Both candidates improved the fast synthetic benchmark, but only one of these two showed a user-visible end-to-end improvement in the integration measurements.

That distinction matters. A performance workflow that only celebrates synthetic wins will eventually ship misleading claims. A workflow that pairs autonomous search with full IDE validation has a better chance of finding changes users can feel.

AlphaEvolve can change how we approach complex performance work. It turns optimizations that were once too time-consuming to explore into candidates we can test routinely. Engineers still own the benchmark, review, and release decision. The search space is what gets smaller.
Dmitrii Batkovich, Director of Engineering for IntelliJ Platform

What we measure next

The next step is product validation. The team plans to check whether improvements show up in megaAPDEX, the APDEX-style internal KPI JetBrains uses for indexing satisfaction, and in other user-facing indexing signals. That is the right bar. A faster internal benchmark is useful. A faster full IDE test is better. A better user experience is the result that matters.

For us, the important lesson is not that AlphaEvolve magically made indexing fast. It did something more practical. It helped generate and rank low-level optimization ideas in a space where manual exploration is slow. JetBrains engineers supplied the problem, the tests, the measurement discipline, and the judgement. AlphaEvolve expanded the search.

Acknowledgements

This project was a collaboration between the JetBrains team, including Denis Shiryaev and Dmitrii Batkovich, and the AI for Science and account teams at Google Cloud, including Anant Nawalgaria, Skander Hannachi, Kartik San, Laurynas Tamulevičius, Nicolas Stroppa, and Artemiy Yashin.

Introducing the Cloud9 JetStream Theme for JetBrains IDEs Why Zig Isn’t 1.0 (Yet)

Discover more

Part 3 of a series where we take public "token saver" add-ons for coding agents and run the same paired A/B benchmark against each of them. Part 1 was the caveman skill (advertised −65%, measured −8.5%). Part 2 was rtk (advertised −60–90%, measured +7.6%). We ran 80 paired tasks to test the pony…

Today, we’re launching JetBrains Context, a new repository intelligence layer that helps coding agents work more efficiently and produce higher-quality results on complex codebases. As part of the JetBrains AI for Teams and Organizations rollout, JetBrains Context is now available in early access at…

girl thinking about token saving with rtk skill in Claude

Does "rtk" reduce Claude Code token usage? Part 2 of a series where we take public “token saving” add-ons for coding agents and run the same paired A/B benchmark against each of them. Part 1 was the caveman skill (advertised −65%, measured −8.5%). TL;DR: rtk advertised saving: 60–90%. Measured…

A paired A/B benchmark of the token-compression skill Caveman on Claude Code, run on SkillsBench: does it actually save tokens, and does it degrade AI agent output quality? Advertised saving: 65%. Measured saving: 8.5%. Output-token saving on real agentic tasks, with the skill forcibly activat…

JetBrains AI

How We Use AlphaEvolve to Make Complex IDE Algorithms Faster

Result snapshot

Interactive measurement dashboard

The target: a B-tree in the indexing stack

The loop: generate, score, validate

Claim boundaries

What changed in the numbers

What we measure next

Acknowledgements

Discover more

Ponytail Skill for Claude Code: Does It Really Cut Agent Code by 54%?

Introducing JetBrains Context: Repository Intelligence for Coding Agents

Does “rtk” skill really cut agent tokens by 60–90%? We tested it

Does Speaking to Agents Like Cavemen Really Save 65% of Tokens? We Test

JetBrains AI

How We Use AlphaEvolve to Make Complex IDE Algorithms Faster

Result snapshot

Interactive measurement dashboard

The target: a B-tree in the indexing stack

The loop: generate, score, validate

Claim boundaries

What changed in the numbers

What we measure next

Acknowledgements

Subscribe to JetBrains AI Blog updates

Discover more

Ponytail Skill for Claude Code: Does It Really Cut Agent Code by 54%?

Introducing JetBrains Context: Repository Intelligence for Coding Agents

Does “rtk” skill really cut agent tokens by 60–90%? We tested it

Does Speaking to Agents Like Cavemen Really Save 65% of Tokens? We Test