Existing research benchmarks and LLM agents have achieved great success in general web search, yet they fall short when entering the rigorous domain of scientific literature. AutoResearchBench reframes academic research as two complementary paradigms — Deep Research and Wide Research — pushing the limits of AI agents in real-world scientific scenarios.
Sorted by Deep Research Accuracy (descending). All results are obtained under a unified evaluation framework. Submissions are welcome.
| # | Model / System | Type | Deep Research (Accuracy) | Wide Research (IoU) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc (%) ↕ | Time (s) ↕ | Tokens ↕ | Turns ↕ | Calls ↕ | IoU (%) ↕ | Time (s) ↕ | Tokens ↕ | Turns ↕ | Calls ↕ | |||
We conducted ablation studies to better understand the impact of different factors on model performance.
We compare each model under Think and NoThink modes using the same ReAct agent with the deepxiv search tool. We report Accuracy (%) for Deep Research and IoU (%) for Wide Research. Bold indicates the better result within each model–task pair.
| Model | Mode | Deep Research | Wide Research | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | Time (s) | Tokens | Turns | Calls | IoU (%) | Time (s) | Tokens | Turns | Calls | ||
| Gemini-3-flash | THINK | 1.83 | 433.6 | 15,861 | 17.70 | 16.50 | 2.53 | 225.4 | 12,157 | 3.30 | 2.30 |
| NOTHINK | 2.75 | 236.9 | 14,052 | 15.90 | 14.90 | 6.11 | 64.7 | 15,632 | 4.69 | 3.37 | |
| Qwen3-max | THINK | 2.33 | 170.9 | 7,797 | 6.36 | 4.90 | 4.18 | 217.6 | 13,438 | 12.23 | 2.85 |
| NOTHINK | 3.24 | 166.0 | 8,771 | 6.10 | 5.10 | 6.89 | 181.9 | 13,302 | 4.20 | 2.80 | |
| Deepseek-V3.2 | THINK | 5.67 | 583.7 | 18,937 | 27.90 | 24.00 | 4.28 | 511.5 | 23,893 | 8.35 | 5.80 |
| NOTHINK | 4.21 | 405.7 | 21,575 | 28.80 | 25.80 | 7.70 | 560.5 | 25,038 | 6.25 | 5.00 | |
NoThink mode generally achieves better Wide Research IoU with lower latency. For Deep Research, results are mixed — Deepseek-V3.2 benefits from thinking while other models do not, suggesting that extended reasoning may help highly targeted retrieval but not broad exhaustive collection.
Test-time scaling with DeepXiv Search: pass@k / best@k (k=1,2,4,8), measuring the best-of-k performance upper bound across multiple independent runs for both Deep Research and Wide Research.
Deep Research — best@k (Accuracy)
| Model | Tool | K=1 | K=2 | K=4 | K=8 | K=16 |
|---|---|---|---|---|---|---|
| Gemini-3-flash | Deep Research | 0.031 | 0.031 | 0.057 | 0.068 | 0.083 |
| Gemini-3.1-pro | Deep Research | 0.078 | 0.126 | 0.162 | 0.189 | 0.212 |
| Kimi-K2.5 | Deep Research | 0.060 | 0.110 | 0.152 | 0.194 | 0.242 |
Wide Research — Max_IoU@k
| Model | Tool | K=1 | K=2 | K=4 | K=8 | K=16 |
|---|---|---|---|---|---|---|
| Gemini-3-flash | Wide Research | 0.0501 | 0.0673 | 0.0823 | 0.0952 | 0.1074 |
| Gemini-3.1-pro | Wide Research | 0.0616 | 0.0801 | 0.0970 | 0.1124 | 0.1271 |
| Seed-2.0-pro | Wide Research | 0.0476 | 0.0631 | 0.0784 | 0.0942 | 0.1118 |
Both Deep Research accuracy and Wide Research IoU improve consistently with more samples, confirming test-time scaling potential on AutoResearchBench.
Existing search benchmarks and LLM agents have achieved impressive results in general-domain web search, yet they consistently underperform when applied to rigorous scientific literature. True academic search goes far beyond keyword matching — it demands understanding abstract methodological paradigms, reasoning through citation chains, and verifying precise experimental values and nuanced details deep within papers.
AutoResearchBench is a benchmark specifically designed to bridge the gap between existing evaluations and real-world scientific research scenarios. AutoResearchBench contains 1,000 carefully curated questions spanning multiple CS research areas across 195 fine-grained research topics. All answers must be rigorously verified against real academic literature, covering over 50,000 papers.
AutoResearchBench reformulates academic research into two complementary core paradigms, separately challenging an agent's deep reasoning ability and broad research coverage. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout.
Given a research method or experimental setting, the agent must precisely locate and confirm specific numerical results, experimental details, or method comparisons within a vast body of literature. This challenges the agent's ability to reason through citation chains and perform precise comparisons.
Given a research domain and attribute dimensions, the agent must systematically collect all qualifying papers and organize them into a structured table. This challenges the agent's completeness and fidelity at scale.
AutoResearchBench is designed around six core principles to ensure authenticity, reproducibility, and scientific rigor in evaluation.
All questions are grounded in real CS research literature, requiring agents to perform exact verification against paper content rather than relying on parametric model knowledge.
Deep Research tests precise localization and reasoning; Wide Research tests systematic coverage and organization. Together they provide a complete picture of academic search capability.
All answers are deterministic facts sourced from publicly available literature, providing a clean, uncontaminated evaluation environment with consistent and reproducible automated scoring.
The best current models achieve well below 20% accuracy on Deep Research and below 0.15 IoU on Wide Research, fully testing model upper bounds and preventing ceiling effects.
Spans NLP, Computer Vision, Machine Learning, AI Safety, Theory, and more — 8+ research categories and 195 fine-grained topics ensure generalizability and representative coverage.
Every question undergoes human annotation and cross-validation. Deep Research clues are extracted from exact paper values; Wide Research answer sets are confirmed through multiple verification rounds.
AutoResearchBench contains 1,000 questions spanning 195 research topics across 8+ major categories, with answers verified against over 10,000 academic papers.
Question count distribution by category for DeepResearch (n=602) and WideResearch (n=400). Click to enlarge.
Normalized percentage comparison of category distributions between DeepResearch and WideResearch.
The AutoResearchBench dataset and evaluation code are fully open. Community use and contributions are welcome.
If ADWS is helpful to your research, please cite our paper.
@misc{xiong2026autoresearchbenchbenchmarkingaiagents,
title={AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery},
author={Lei Xiong and Kun Luo and Ziyi Xia and Wenbo Zhang and Jin-Ge Yao and Zheng Liu
and Jingying Shao and Jianlyu Chen and Hongjin Qian and Xi Yang and Qian Yu and Hao Li
and Chen Yue and Xiaan Du and Yuyang Wang and Yesheng Liu and Haiyu Xu and Zhicheng Dou},
year={2026},
eprint={2604.25256},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.25256},
}