🔬 High-Difficulty Academic Research Benchmark

AutoResearchBench
Deep Verification & Wide Exploration for Academic Research

Existing research benchmarks and LLM agents have achieved great success in general web search, yet they fall short when entering the rigorous domain of scientific literature. AutoResearchBench reframes academic research as two complementary paradigms — Deep Research and Wide Research — pushing the limits of AI agents in real-world scientific scenarios.

AutoResearchBench Team
1000
Questions
195
Research Topics
2
Task Paradigms
50,000+
Papers Covered
📊 View Leaderboard 📦 Get Dataset 📄 Learn More

Model Leaderboard

Sorted by Deep Research Accuracy (descending). All results are obtained under a unified evaluation framework. Submissions are welcome.

# Model / System Type Deep Research (Accuracy) Wide Research (IoU)
Acc (%) ↕ Time (s) ↕ Tokens ↕ Turns ↕ Calls ↕ IoU (%) ↕ Time (s) ↕ Tokens ↕ Turns ↕ Calls ↕

Additional Experiments

We conducted ablation studies to better understand the impact of different factors on model performance.

We compare each model under Think and NoThink modes using the same ReAct agent with the deepxiv search tool. We report Accuracy (%) for Deep Research and IoU (%) for Wide Research. Bold indicates the better result within each model–task pair.

Model Mode Deep Research Wide Research
Acc (%)Time (s)TokensTurnsCalls IoU (%)Time (s)TokensTurnsCalls
Gemini-3-flash THINK 1.83433.615,86117.7016.50 2.53225.412,1573.302.30
NOTHINK 2.75236.914,05215.9014.90 6.1164.715,6324.693.37
Qwen3-max THINK 2.33170.97,7976.364.90 4.18217.613,43812.232.85
NOTHINK 3.24166.08,7716.105.10 6.89181.913,3024.202.80
Deepseek-V3.2 THINK 5.67583.718,93727.9024.00 4.28511.523,8938.355.80
NOTHINK 4.21405.721,57528.8025.80 7.70560.525,0386.255.00

NoThink mode generally achieves better Wide Research IoU with lower latency. For Deep Research, results are mixed — Deepseek-V3.2 benefits from thinking while other models do not, suggesting that extended reasoning may help highly targeted retrieval but not broad exhaustive collection.

Test-time scaling with DeepXiv Search: pass@k / best@k (k=1,2,4,8), measuring the best-of-k performance upper bound across multiple independent runs for both Deep Research and Wide Research.

Deep Research — best@k (Accuracy)

Model Tool K=1K=2K=4K=8K=16
Gemini-3-flashDeep Research 0.0310.0310.0570.0680.083
Gemini-3.1-proDeep Research 0.0780.1260.1620.1890.212
Kimi-K2.5Deep Research 0.0600.1100.1520.1940.242

Wide Research — Max_IoU@k

Model Tool K=1K=2K=4K=8K=16
Gemini-3-flashWide Research 0.05010.06730.08230.09520.1074
Gemini-3.1-proWide Research 0.06160.08010.09700.11240.1271
Seed-2.0-proWide Research 0.04760.06310.07840.09420.1118

Both Deep Research accuracy and Wide Research IoU improve consistently with more samples, confirming test-time scaling potential on AutoResearchBench.

Benchmark Overview

Existing search benchmarks and LLM agents have achieved impressive results in general-domain web search, yet they consistently underperform when applied to rigorous scientific literature. True academic search goes far beyond keyword matching — it demands understanding abstract methodological paradigms, reasoning through citation chains, and verifying precise experimental values and nuanced details deep within papers.

AutoResearchBench is a benchmark specifically designed to bridge the gap between existing evaluations and real-world scientific research scenarios. AutoResearchBench contains 1,000 carefully curated questions spanning multiple CS research areas across 195 fine-grained research topics. All answers must be rigorously verified against real academic literature, covering over 50,000 papers.

AutoResearchBench reformulates academic research into two complementary core paradigms, separately challenging an agent's deep reasoning ability and broad research coverage. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout.

🔍
Deep Research — Precise Verification

Given a research method or experimental setting, the agent must precisely locate and confirm specific numerical results, experimental details, or method comparisons within a vast body of literature. This challenges the agent's ability to reason through citation chains and perform precise comparisons.

600 Questions Accuracy Score Point Verification
🌐
Wide Research — Exhaustive Collection

Given a research domain and attribute dimensions, the agent must systematically collect all qualifying papers and organize them into a structured table. This challenges the agent's completeness and fidelity at scale.

400 Questions IoU Score Set Coverage

Key Features

AutoResearchBench is designed around six core principles to ensure authenticity, reproducibility, and scientific rigor in evaluation.

📚
Authentic Scientific Scenarios

All questions are grounded in real CS research literature, requiring agents to perform exact verification against paper content rather than relying on parametric model knowledge.

🎯
Dual-Paradigm Evaluation

Deep Research tests precise localization and reasoning; Wide Research tests systematic coverage and organization. Together they provide a complete picture of academic search capability.

⚖️
Objectively Verifiable Answers

All answers are deterministic facts sourced from publicly available literature, providing a clean, uncontaminated evaluation environment with consistent and reproducible automated scoring.

🏆
Genuinely Challenging

The best current models achieve well below 20% accuracy on Deep Research and below 0.15 IoU on Wide Research, fully testing model upper bounds and preventing ceiling effects.

🌍
Broad Multi-Domain Coverage

Spans NLP, Computer Vision, Machine Learning, AI Safety, Theory, and more — 8+ research categories and 195 fine-grained topics ensure generalizability and representative coverage.

🔬
Rigorous Annotation Pipeline

Every question undergoes human annotation and cross-validation. Deep Research clues are extracted from exact paper values; Wide Research answer sets are confirmed through multiple verification rounds.

Dataset Statistics

AutoResearchBench contains 1,000 questions spanning 195 research topics across 8+ major categories, with answers verified against over 10,000 academic papers.

1000
Total Questions
195
Research Topics
50,000+
Papers Covered
2
Task Paradigms
DEEP RESEARCH
Questions 600
Metric Accuracy
Categories 8
Annotation Human + Auto Verification
WIDE RESEARCH
Questions 400
Metric IoU (Intersection over Union)
Topics 195
Avg. Papers / Question ~50
Category Distribution

Question count distribution by category for DeepResearch (n=602) and WideResearch (n=400). Click to enlarge.

Category Comparison

Normalized percentage comparison of category distributions between DeepResearch and WideResearch.

Dataset & Code

The AutoResearchBench dataset and evaluation code are fully open. Community use and contributions are welcome.

📦 Dataset Info

  • 📝 DeepResearch test set (600 questions)
  • 📝 WideResearch test set (400 questions)
  • 🏷️ Each entry: question, answer, arxiv_ids
  • 🔗 Answers linked to paper title / arXiv ID
  • 📊 Format: JSONL, UTF-8 encoding
  • 📋 License: CC BY 4.0
⬇ Download Dataset

💻 Evaluation Code

  • 🛠️ ReAct Agent inference scripts
  • 🔧 DeepSearch / WideSearch scorers
  • 📡 Tool API wrappers (arXiv search, semantic search)
  • 🔄 Multi-process parallel inference support
  • 📈 Result statistics and visualization scripts
  • 📖 Detailed documentation and usage examples
🐙 GitHub Repository

📤 How to Submit Model Results

  1. Clone the evaluation repository from GitHub and configure your model API per the documentation
  2. Run inference on the test set using the provided scripts
  3. Compute Accuracy / IoU and other metrics using the scoring scripts
  4. Submit your result file via a Pull Request to the leaderboard repository
  5. We will review and update the leaderboard within 3 business days

Citation

If ADWS is helpful to your research, please cite our paper.

@misc{xiong2026autoresearchbenchbenchmarkingaiagents,
  title={AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery}, 
  author={Lei Xiong and Kun Luo and Ziyi Xia and Wenbo Zhang and Jin-Ge Yao and Zheng Liu 
    and Jingying Shao and Jianlyu Chen and Hongjin Qian and Xi Yang and Qian Yu and Hao Li 
    and Chen Yue and Xiaan Du and Yuyang Wang and Yesheng Liu and Haiyu Xu and Zhicheng Dou},
  year={2026},
  eprint={2604.25256},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2604.25256}, 
}