JetBrains AI

Supercharge your tools with AI-powered features inside many JetBrains products

Explore More

We Gave Agents IDE-Native Search Tools. They Got Faster and Cheaper.

Denis Shiryaev

We ran the same coding tasks with and without prebundled tooling, across multiple models and languages. Here’s what changed.

Eval-driven development

IDE-native search reduced latency, cost, and budget overruns.

The comparison below uses paired task-level deltas. Aggregate medians and totals are shown for orientation. Budget overruns are tasks that exceeded the USD 0.50 per-task cap.

8.33% Median latency reduced 83.11s → 79.03s

16.44% P95 latency reduced 268.71s → 213.17s

5.60% Total cost reduced USD 44.17 → USD 41.67

33.28% Budget overruns reduced 6.67% → 4.44%

Why We Built This

When coding agents search code, they default to shell tools. grep and find work, but they’re blind to project structure, symbol boundaries, and language semantics. The agent burns tokens sifting through noisy output and making follow-up calls to narrow things down.

So we tried something obvious: what if the agent could use the IDE’s own search instead?

We built a prebundled skill that pairs a search prompt with a unified MCP tool. One tool, four modes: file search, text search, regex, and symbol lookup. A universal router dispatches calls to the right backend.

MCP Tools

Functions the agent calls via an MCP server during task execution. IDE-native tools can tap into indices, ASTs, and project models that shell tools cannot see.

Skills

Packaged agent behaviors: a prompt plus orchestration logic. A skill can work on its own, use tools, or ship bundled with the tools it needs.

Nothing ships by default until the eval says it should. We tested four different configurations of this tooling before picking one.

Methodology

The eval pipeline spins up an MCP server alongside the IDE so the agent has access to the configured tools and skills. We run identical coding tasks with and without tooling, then compare with paired delta analysis.

We track four things: quality, latency, cost, and budget discipline. Quality asks whether all tests passed. Latency tracks median and P95 task time. Cost converts token consumption into dollars. Budget discipline tracks how often a single task exceeds the USD 0.50 budget cap.

We report improvement deltas only when they pass our significance threshold: p < 0.05, paired test with 95% confidence intervals. Metrics without a significant change are either omitted from the charts or called out explicitly. We tried four configuration variants, selected the one with the best latency and cost tradeoff, then re-ran it on different models and languages to check that the results held.

Results

The selected configuration was a prebundled search skill plus a unified IDE-native tool and universal router. Compared with the no-tooling baseline, it reduced latency and cost without producing a statistically significant quality change.

Baseline vs. tooling

Absolute metrics moved in the right direction.

Median latency

Baseline83.11s

With tooling79.03s

P95 latency

Baseline268.71s

With tooling213.17s

Total cost

BaselineUSD 44.17

With toolingUSD 41.67

Budget overruns

Baseline6.67%

With tooling4.44%

Budget overruns

33.28%

P95 latency

16.44%

Median latency

8.33%

Total cost

5.60%

No statistically significant change in quality. All shown deltas passed the significance threshold.

Trace snapshots

The difference is visible in the agent’s path through the project.

These are shortened traces from cases that improved in both time and cost. The baseline spends more steps discovering context; the prebundled setup gets to the relevant files faster.

Service comments and replies

prompt Update service and controller layers for comments and replies. before: no prebundled IDE search agent> list files -> search x2 -> list files x2 agent> jar inspect x5 -> javap -> jar inspect -> javap x5 agent> curl download -> decompile -> search -> find files x2 agent> read 9 files -> edit file x8 -> respond time: 472s after: prebundled skill and unified search agent> read SKILL.md -> search x3 -> read 5 files agent> read FeatureController.java -> read 4 files agent> edit file x2 -> respond time: 127s

Jackson key deserializer

prompt Preserve detailed error messages from a custom key deserializer. before: broad code walk agent> list files -> search x2 -> read README.md agent> search x5 -> read DeserializationContext.java agent> search x4 -> read StdDeserializer.java agent> search -> read DeserializerCache.java agent> read MapEntryDeserializer.java -> read JsonMappingException.java agent> edit file -> respond time: 150s after: targeted search agent> read SKILL.md -> search x3 agent> read MapDeserializer.java agent> read StdKeyDeserializer.java agent> read DeserializationContext.java agent> edit file -> respond time: 34s

Configuration Explorer

We tested four tool configurations before choosing the final shape. Lower latency and lower total cost are better, so the lower-left corner of the plot is the target.

Cross-Model Validation

We re-ran the experiment with GPT 5.4 on Java and Kotlin codebases. The pattern holds: latency and cost both drop. Kotlin saw the biggest cost improvement, with total cost falling 13.48%.

Cross-model check

The effect held beyond the original run.

Codex 5.2

Median latency8.33%

Total cost5.60%

P95 latency16.44%

GPT 5.4, Java

Median latency3.75%

Total cost4.07%

P95 latency13.00%

GPT 5.4, Kotlin

Median latency6.92%

Total cost13.48%

P95 latencynot significant

Missing bars mean that metric was not statistically significant for that model and language.

How Models Adopt Tooling

Codex sends 91% of its search calls through the new IDE-native tool. Claude is a different story: Opus uses it for about half its searches, and Haiku only 28%, preferring grep and find instead.

This makes sense. Claude already has strong built-in code search, so it leans on what it knows. Codex doesn’t, so it grabs the better tool when one is available. The takeaway: prebundled tooling fills gaps. Where the model already has good search, it adds less. Where search is weak, it makes a real difference.

What’s Next

The eval pipeline works. Now we’re using it.

We’re running the same experiment on smaller models next. Our hunch is that they’ll benefit even more, since they have less built-in search capability to fall back on.

The current results are strongest on Java and Kotlin. We’re expanding to Python, .NET, and TypeScript with bigger sample sizes.

Meanwhile, the winning configuration is being prepared for the integrated IntelliJ IDEA MCP Server, so agent sessions can use IDE-native tooling when the server is enabled.

The next step is to turn this feature on by default in upcoming AI Assistant plugin updates.

Want to try it before the default rollout?

Set these registry keys to true: llm.chat.agent.codex.mcp.idea, llm.chat.agent.skills.settings.enabled, and llm.agents.contrib.bundled.skills.sync.enabled.
In AI Assistant, choose Codex for the best results.
Ask the agent to find something across the current project.

The IDE Is Already an AI Quality Variable. Is It on Your AI Agenda?

Discover more

Your developers' AI tools are only as good as what they know going in. When those tools run through the right IDE, it can give them a head start – a picture of the codebase the tools would otherwise need to piece together themselves. That means your team's IDE choices belong on your AI agenda alo…

443 developers applied. 39 projects shipped. Here's a sneak peak at everything that went down at the JetBrains X Codex Hackathon in San Francisco.

We’ll keep building AI workflows that speed up creation – and we’ll keep strengthening the IDE as the best place to review, understand, and own what gets shipped.

Install JetBrains verified skills once, then use them across agents and projects.

JetBrains AI

We Gave Agents IDE-Native Search Tools. They Got Faster and Cheaper.

IDE-native search reduced latency, cost, and budget overruns.