🔬 High-Difficulty Academic Research Benchmark

AutoResearchBench
Deep Verification & Wide Exploration for Academic Research

Existing research benchmarks and LLM agents have achieved great success in general web search, yet they fall short when entering the rigorous domain of scientific literature. AutoResearchBench reframes academic research as two complementary paradigms — Deep Research and Wide Research — pushing the limits of AI agents in real-world scientific scenarios.

AutoResearchBench Team

1000

Questions

195

Research Topics

Task Paradigms

50,000+

Papers Covered

📊 View Leaderboard 📦 Get Dataset 📄 Learn More

ADDITIONAL EXPERIMENTS

Additional Experiments

We conducted ablation studies to better understand the impact of different factors on model performance.

We compare each model under Think and NoThink modes using the same ReAct agent with the deepxiv search tool. We report Accuracy (%) for Deep Research and IoU (%) for Wide Research. Bold indicates the better result within each model–task pair.

Model	Mode	Deep Research					Wide Research
Model	Mode	Acc (%)	Time (s)	Tokens	Turns	Calls	IoU (%)	Time (s)	Tokens	Turns	Calls
Gemini-3-flash	THINK	1.83	433.6	15,861	17.70	16.50	2.53	225.4	12,157	3.30	2.30
Gemini-3-flash	NOTHINK	2.75	236.9	14,052	15.90	14.90	6.11	64.7	15,632	4.69	3.37

Qwen3-max	THINK	2.33	170.9	7,797	6.36	4.90	4.18	217.6	13,438	12.23	2.85
Qwen3-max	NOTHINK	3.24	166.0	8,771	6.10	5.10	6.89	181.9	13,302	4.20	2.80

Deepseek-V3.2	THINK	5.67	583.7	18,937	27.90	24.00	4.28	511.5	23,893	8.35	5.80
Deepseek-V3.2	NOTHINK	4.21	405.7	21,575	28.80	25.80	7.70	560.5	25,038	6.25	5.00

NoThink mode generally achieves better Wide Research IoU with lower latency. For Deep Research, results are mixed — Deepseek-V3.2 benefits from thinking while other models do not, suggesting that extended reasoning may help highly targeted retrieval but not broad exhaustive collection.

Test-time scaling with DeepXiv Search: pass@k / best@k (k=1,2,4,8), measuring the best-of-k performance upper bound across multiple independent runs for both Deep Research and Wide Research.

Deep Research — best@k (Accuracy)

Model	Tool	K=1	K=2	K=4	K=8	K=16
Gemini-3-flash	Deep Research	0.031	0.031	0.057	0.068	0.083
Gemini-3.1-pro	Deep Research	0.078	0.126	0.162	0.189	0.212
Kimi-K2.5	Deep Research	0.060	0.110	0.152	0.194	0.242

Wide Research — Max_IoU@k

Model	Tool	K=1	K=2	K=4	K=8	K=16
Gemini-3-flash	Wide Research	0.0501	0.0673	0.0823	0.0952	0.1074
Gemini-3.1-pro	Wide Research	0.0616	0.0801	0.0970	0.1124	0.1271
Seed-2.0-pro	Wide Research	0.0476	0.0631	0.0784	0.0942	0.1118

Both Deep Research accuracy and Wide Research IoU improve consistently with more samples, confirming test-time scaling potential on AutoResearchBench.

OVERVIEW

Benchmark Overview

Existing search benchmarks and LLM agents have achieved impressive results in general-domain web search, yet they consistently underperform when applied to rigorous scientific literature. True academic search goes far beyond keyword matching — it demands understanding abstract methodological paradigms, reasoning through citation chains, and verifying precise experimental values and nuanced details deep within papers.

AutoResearchBench is a benchmark specifically designed to bridge the gap between existing evaluations and real-world scientific research scenarios. AutoResearchBench contains 1,000 carefully curated questions spanning multiple CS research areas across 195 fine-grained research topics. All answers must be rigorously verified against real academic literature, covering over 50,000 papers.

AutoResearchBench reformulates academic research into two complementary core paradigms, separately challenging an agent's deep reasoning ability and broad research coverage. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout.

🔍

Deep Research — Precise Verification

Given a research method or experimental setting, the agent must precisely locate and confirm specific numerical results, experimental details, or method comparisons within a vast body of literature. This challenges the agent's ability to reason through citation chains and perform precise comparisons.

600 Questions Accuracy Score Point Verification

🌐

Wide Research — Exhaustive Collection

Given a research domain and attribute dimensions, the agent must systematically collect all qualifying papers and organize them into a structured table. This challenges the agent's completeness and fidelity at scale.

400 Questions IoU Score Set Coverage

BENCHMARK FEATURES

Key Features

AutoResearchBench is designed around six core principles to ensure authenticity, reproducibility, and scientific rigor in evaluation.

📚

Authentic Scientific Scenarios

All questions are grounded in real CS research literature, requiring agents to perform exact verification against paper content rather than relying on parametric model knowledge.

🎯

Dual-Paradigm Evaluation

Deep Research tests precise localization and reasoning; Wide Research tests systematic coverage and organization. Together they provide a complete picture of academic search capability.

⚖️

Objectively Verifiable Answers

All answers are deterministic facts sourced from publicly available literature, providing a clean, uncontaminated evaluation environment with consistent and reproducible automated scoring.

🏆

Genuinely Challenging

The best current models achieve well below 20% accuracy on Deep Research and below 0.15 IoU on Wide Research, fully testing model upper bounds and preventing ceiling effects.

🌍

Broad Multi-Domain Coverage

Spans NLP, Computer Vision, Machine Learning, AI Safety, Theory, and more — 8+ research categories and 195 fine-grained topics ensure generalizability and representative coverage.

🔬

Rigorous Annotation Pipeline

Every question undergoes human annotation and cross-validation. Deep Research clues are extracted from exact paper values; Wide Research answer sets are confirmed through multiple verification rounds.

BENCHMARK STATISTICS

Dataset Statistics

AutoResearchBench contains 1,000 questions spanning 195 research topics across 8+ major categories, with answers verified against over 10,000 academic papers.

1000

Total Questions

195

Research Topics

50,000+

Papers Covered

Task Paradigms

DEEP RESEARCH

Questions 600

Metric Accuracy

Categories 8

Annotation Human + Auto Verification

WIDE RESEARCH

Questions 400

Metric IoU (Intersection over Union)

Topics 195

Avg. Papers / Question ~50

Question count distribution by category for DeepResearch (n=602) and WideResearch (n=400). Click to enlarge.

Normalized percentage comparison of category distributions between DeepResearch and WideResearch.

DATASET & CODE

Dataset & Code

The AutoResearchBench dataset and evaluation code are fully open. Community use and contributions are welcome.

📦 Dataset Info

📝 DeepResearch test set (600 questions)
📝 WideResearch test set (400 questions)
🏷️ Each entry: question, answer, arxiv_ids
🔗 Answers linked to paper title / arXiv ID
📊 Format: JSONL, UTF-8 encoding
📋 License: CC BY 4.0

⬇ Download Dataset

💻 Evaluation Code

🛠️ ReAct Agent inference scripts
🔧 DeepSearch / WideSearch scorers
📡 Tool API wrappers (arXiv search, semantic search)
🔄 Multi-process parallel inference support
📈 Result statistics and visualization scripts
📖 Detailed documentation and usage examples

🐙 GitHub Repository

📤 How to Submit Model Results

Clone the evaluation repository from GitHub and configure your model API per the documentation
Run inference on the test set using the provided scripts
Compute Accuracy / IoU and other metrics using the scoring scripts
Submit your result file via a Pull Request to the leaderboard repository
We will review and update the leaderboard within 3 business days

CITATION

Citation

If ADWS is helpful to your research, please cite our paper.

@misc{xiong2026autoresearchbenchbenchmarkingaiagents,
  title={AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery}, 
  author={Lei Xiong and Kun Luo and Ziyi Xia and Wenbo Zhang and Jin-Ge Yao and Zheng Liu 
    and Jingying Shao and Jianlyu Chen and Hongjin Qian and Xi Yang and Qian Yu and Hao Li 
    and Chen Yue and Xiaan Du and Yuyang Wang and Yesheng Liu and Haiyu Xu and Zhicheng Dou},
  year={2026},
  eprint={2604.25256},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2604.25256}, 
}

AutoResearchBench Deep Verification & Wide Exploration for Academic Research

Model Leaderboard