Agent Retrieval Bench V1

Start here

The blog explains why agent retrieval is not ordinary code search.

The benchmark is motivated by a simple failure mode: a coding agent can generate a plausible patch and still fail because it never opened the right files.

Why patch-generation benchmarks hide retrieval failures.
How V1 defines `code2test`, `comment2context`, and `trace2code`.
Why RepoMap-style structure helps on failure-trace retrieval.

Open blog

225manually curated samples

3agentic retrieval tasks

4baseline families reported

What V1 measures

Three ways agents lose context.

Each sample asks for files the agent must read, not merely files whose text looks similar to the query.

code2test

Given implementation intent, retrieve the tests that should validate the change.

comment2context

Given a review comment and its local file, retrieve the extra context needed to act correctly.

trace2code

Given a reproduced failure trace, retrieve audited root-cause source files.

V1 leaderboard

Embeddings lead overall, but structure matters.

Full V1 reports lexical, vectorless RepoMap, and embedding baselines on the same 225 curated samples. Overall MRR is the primary ranking metric.

Model

MRR

R@10

R@20

Qwen3-Embedding-4B

0.2455

0.4033

0.5828

aider-style-repomap

0.2227

0.4705

0.6299

jina-code-embeddings-0.5b

0.1883

0.3133

0.4492

lexical

0.1450

0.3267

0.4874

Candidate set: all_files. Full task-level results are included in the release report.

Open full leaderboard

Reproduce

One command gets the release bundle.

The downloader fetches V1 from Hugging Face, verifies the checksum, and extracts benchmark, corpus, eval, and report files.

pip install "git+https://github.com/eyuansu62/agent-retrieval-bench.git"
arb download-benchmark --version v1 --local-dir data --force
arb validate data/benchmark/v1/*.jsonl

Hugging Face dataset Evaluator source

Find the files before writing the patch.

The blog explains why agent retrieval is not ordinary code search.

Three ways agents lose context.

Embeddings lead overall, but structure matters.

One command gets the release bundle.