Agentic Code Retrieval Benchmark · V1

Find the files before writing the patch.

A curated benchmark for testing whether retrievers can locate the repository context a coding agent needs before it edits code.

Start here

The blog explains why agent retrieval is not ordinary code search.

The benchmark is motivated by a simple failure mode: a coding agent can generate a plausible patch and still fail because it never opened the right files.

  • Why patch-generation benchmarks hide retrieval failures.
  • How V1 defines `code2test`, `comment2context`, and `trace2code`.
  • Why RepoMap-style structure helps on failure-trace retrieval.
225manually curated samples
3agentic retrieval tasks
4baseline families reported
What V1 measures

Three ways agents lose context.

Each sample asks for files the agent must read, not merely files whose text looks similar to the query.

code2test

Given implementation intent, retrieve the tests that should validate the change.

comment2context

Given a review comment and its local file, retrieve the extra context needed to act correctly.

trace2code

Given a reproduced failure trace, retrieve audited root-cause source files.

V1 leaderboard

Embeddings lead overall, but structure matters.

Full V1 reports lexical, vectorless RepoMap, and embedding baselines on the same 225 curated samples. Overall MRR is the primary ranking metric.

Model
MRR
R@10
R@20
Qwen3-Embedding-4B
0.2455
0.4033
0.5828
aider-style-repomap
0.2227
0.4705
0.6299
jina-code-embeddings-0.5b
0.1883
0.3133
0.4492
lexical
0.1450
0.3267
0.4874

Candidate set: all_files. Full task-level results are included in the release report.

Reproduce

One command gets the release bundle.

The downloader fetches V1 from Hugging Face, verifies the checksum, and extracts benchmark, corpus, eval, and report files.

pip install "git+https://github.com/eyuansu62/agent-retrieval-bench.git"
arb download-benchmark --version v1 --local-dir data --force
arb validate data/benchmark/v1/*.jsonl