The blog explains why agent retrieval is not ordinary code search.
The benchmark is motivated by a simple failure mode: a coding agent can generate a plausible patch and still fail because it never opened the right files.
- Why patch-generation benchmarks hide retrieval failures.
- How V1 defines `code2test`, `comment2context`, and `trace2code`.
- Why RepoMap-style structure helps on failure-trace retrieval.
Three ways agents lose context.
Each sample asks for files the agent must read, not merely files whose text looks similar to the query.
code2test
Given implementation intent, retrieve the tests that should validate the change.
comment2context
Given a review comment and its local file, retrieve the extra context needed to act correctly.
trace2code
Given a reproduced failure trace, retrieve audited root-cause source files.
Embeddings lead overall, but structure matters.
Full V1 reports lexical, vectorless RepoMap, and embedding baselines on the same 225 curated samples. Overall MRR is the primary ranking metric.
Candidate set: all_files. Full task-level results are included in the release report.
One command gets the release bundle.
The downloader fetches V1 from Hugging Face, verifies the checksum, and extracts benchmark, corpus, eval, and report files.
pip install "git+https://github.com/eyuansu62/agent-retrieval-bench.git"
arb download-benchmark --version v1 --local-dir data --force
arb validate data/benchmark/v1/*.jsonl