Ai logo

JetBrains AI

Supercharge your tools with AI-powered features inside many JetBrains products

We Gave Agents IDE-Native Search Tools. They Got Faster and Cheaper.

We ran the same coding tasks with and without prebundled tooling, across multiple models and languages. Here’s what changed.

Eval-driven development

IDE-native search reduced latency, cost, and budget overruns.

The comparison below uses paired task-level deltas. Aggregate medians and totals are shown for orientation. Budget overruns are tasks that exceeded the USD 0.50 per-task cap.

8.33% Median latency reduced 83.11s → 79.03s
16.44% P95 latency reduced 268.71s → 213.17s
5.60% Total cost reduced USD 44.17 → USD 41.67
33.28% Budget overruns reduced 6.67% → 4.44%

Why We Built This

When coding agents search code, they default to shell tools. grep and find work, but they’re blind to project structure, symbol boundaries, and language semantics. The agent burns tokens sifting through noisy output and making follow-up calls to narrow things down.

So we tried something obvious: what if the agent could use the IDE’s own search instead?

We built a prebundled skill that pairs a search prompt with a unified MCP tool. One tool, four modes: file search, text search, regex, and symbol lookup. A universal router dispatches calls to the right backend.

MCP Tools

Functions the agent calls via an MCP server during task execution. IDE-native tools can tap into indices, ASTs, and project models that shell tools cannot see.

Skills

Packaged agent behaviors: a prompt plus orchestration logic. A skill can work on its own, use tools, or ship bundled with the tools it needs.

Nothing ships by default until the eval says it should. We tested four different configurations of this tooling before picking one.

Methodology

The eval pipeline spins up an MCP server alongside the IDE so the agent has access to the configured tools and skills. We run identical coding tasks with and without tooling, then compare with paired delta analysis.

We track four things: quality, latency, cost, and budget discipline. Quality asks whether all tests passed. Latency tracks median and P95 task time. Cost converts token consumption into dollars. Budget discipline tracks how often a single task exceeds the USD 0.50 budget cap.

We report improvement deltas only when they pass our significance threshold: p < 0.05, paired test with 95% confidence intervals. Metrics without a significant change are either omitted from the charts or called out explicitly. We tried four configuration variants, selected the one with the best latency and cost tradeoff, then re-ran it on different models and languages to check that the results held.

Eval frame

Same tasks, same grading, one controlled difference.

Quality All-tests-passed rate, checked before performance claims.
Latency Median and P95 task duration, compared with paired deltas.
Cost Token use converted to dollars across the task set.
Budget discipline Share of tasks exceeding the USD 0.50 single-task cap.

Results

The selected configuration was a prebundled search skill plus a unified IDE-native tool and universal router. Compared with the no-tooling baseline, it reduced latency and cost without producing a statistically significant quality change.

Baseline vs. tooling

Absolute metrics moved in the right direction.

Budget overruns
33.28%
P95 latency
16.44%
Median latency
8.33%
Total cost
5.60%

No statistically significant change in quality. All shown deltas passed the significance threshold.

Trace snapshots

The difference is visible in the agent’s path through the project.

These are shortened traces from cases that improved in both time and cost. The baseline spends more steps discovering context; the prebundled setup gets to the relevant files faster.

Service comments and replies
prompt Update service and controller layers for comments and replies. before: no prebundled IDE search agent> list files -> search x2 -> list files x2 agent> jar inspect x5 -> javap -> jar inspect -> javap x5 agent> curl download -> decompile -> search -> find files x2 agent> read 9 files -> edit file x8 -> respond time: 472s after: prebundled skill and unified search agent> read SKILL.md -> search x3 -> read 5 files agent> read FeatureController.java -> read 4 files agent> edit file x2 -> respond time: 127s
Jackson key deserializer
prompt Preserve detailed error messages from a custom key deserializer. before: broad code walk agent> list files -> search x2 -> read README.md agent> search x5 -> read DeserializationContext.java agent> search x4 -> read StdDeserializer.java agent> search -> read DeserializerCache.java agent> read MapEntryDeserializer.java -> read JsonMappingException.java agent> edit file -> respond time: 150s after: targeted search agent> read SKILL.md -> search x3 agent> read MapDeserializer.java agent> read StdKeyDeserializer.java agent> read DeserializationContext.java agent> edit file -> respond time: 34s

Configuration Explorer

We tested four tool configurations before choosing the final shape. Lower latency and lower total cost are better, so the lower-left corner of the plot is the target.

Configuration search

The selected option had the best latency while preserving cost reduction.

Median latency, 78s to 84s Total cost, USD 39.50 to USD 45.00
Baseline 4 Search Tools Unified Search Tool 4 Tools + Router Unified Tool + Router

Cross-Model Validation

We re-ran the experiment with GPT 5.4 on Java and Kotlin codebases. The pattern holds: latency and cost both drop. Kotlin saw the biggest cost improvement, with total cost falling 13.48%.

Cross-model check

The effect held beyond the original run.

Codex 5.2

Median latency8.33%
Total cost5.60%
P95 latency16.44%

GPT 5.4, Java

Median latency3.75%
Total cost4.07%
P95 latency13.00%

GPT 5.4, Kotlin

Median latency6.92%
Total cost13.48%
P95 latencynot significant

Missing bars mean that metric was not statistically significant for that model and language.

How Models Adopt Tooling

Codex sends 91% of its search calls through the new IDE-native tool. Claude is a different story: Opus uses it for about half its searches, and Haiku only 28%, preferring grep and find instead.

This makes sense. Claude already has strong built-in code search, so it leans on what it knows. Codex doesn’t, so it grabs the better tool when one is available. The takeaway: prebundled tooling fills gaps. Where the model already has good search, it adds less. Where search is weak, it makes a real difference.

Tool adoption

Models do not use new tools at the same rate.

Codex
91 8 1
Claude Opus
53 28 19
Claude Haiku
28 33 39
IDE Search grep find

What’s Next

The eval pipeline works. Now we’re using it.

We’re running the same experiment on smaller models next. Our hunch is that they’ll benefit even more, since they have less built-in search capability to fall back on.

The current results are strongest on Java and Kotlin. We’re expanding to Python, .NET, and TypeScript with bigger sample sizes.

Meanwhile, the winning configuration is being prepared for the integrated IntelliJ IDEA MCP Server, so agent sessions can use IDE-native tooling when the server is enabled.

The next step is to turn this feature on by default in upcoming AI Assistant plugin updates.

Want to try it before the default rollout?

  1. Set these registry keys to true: llm.chat.agent.codex.mcp.idea, llm.chat.agent.skills.settings.enabled, and llm.agents.contrib.bundled.skills.sync.enabled.
  2. In AI Assistant, choose Codex for the best results.
  3. Ask the agent to find something across the current project.
Measure first, ship second, keep measuring after. That’s the whole approach.