Welcome to the DeepPlanning Paper Collection
·
benchmark planning agents
This site is a companion to the DeepPlanning paper — a benchmark for practical long-horizon agent planning with verifiable constraints.
What is DeepPlanning?
DeepPlanning evaluates how well AI agents handle complex, real-world planning tasks that require global constrained optimization. It features two core task types:
- Multi-day travel itineraries — agents must plan trips satisfying budget, time, preference, and logistical constraints simultaneously.
- Multi-product shopping tasks — agents must research and select products meeting multiple interacting requirements.
These tasks go beyond simple instruction-following. They require agents to search for information, reason over constraints, and produce plans that are globally optimal — not just locally plausible.
What’s in this collection?
We’ve curated 38 papers across 9 categories that represent the landscape of work related to DeepPlanning:
- Planning Benchmarks — other benchmarks that evaluate planning and reasoning
- Travel Planning — systems and benchmarks for travel itinerary generation
- Shopping and Search Agents — e-commerce and local search agent evaluation
- Web Agent Benchmarks — general web interaction benchmarks
- Tool Use and APIs — frameworks for tool-augmented LLMs
- Foundation Models — the models being evaluated on these tasks
- Agent Evaluation Frameworks — meta-benchmarks and evaluation methodology
- LLM Reasoning and Verification — work on self-verification limitations
Key resources
- DeepPlanning on arXiv
- Qwen DeepPlanning Leaderboard
- Sierra tau-bench Blog Post
- OpenAI BrowseComp Writeup
Stay tuned
We’ll be adding more posts covering individual papers, benchmark comparisons, and insights from the collection.