Welcome to the DeepPlanning Paper Collection

This site is a companion to the DeepPlanning paper — a benchmark for practical long-horizon agent planning with verifiable constraints.

What is DeepPlanning?

DeepPlanning evaluates how well AI agents handle complex, real-world planning tasks that require global constrained optimization. It features two core task types:

Multi-day travel itineraries — agents must plan trips satisfying budget, time, preference, and logistical constraints simultaneously.
Multi-product shopping tasks — agents must research and select products meeting multiple interacting requirements.

These tasks go beyond simple instruction-following. They require agents to search for information, reason over constraints, and produce plans that are globally optimal — not just locally plausible.

What’s in this collection?

We’ve curated 38 papers across 9 categories that represent the landscape of work related to DeepPlanning:

Planning Benchmarks — other benchmarks that evaluate planning and reasoning
Travel Planning — systems and benchmarks for travel itinerary generation
Shopping and Search Agents — e-commerce and local search agent evaluation
Web Agent Benchmarks — general web interaction benchmarks
Tool Use and APIs — frameworks for tool-augmented LLMs
Foundation Models — the models being evaluated on these tasks
Agent Evaluation Frameworks — meta-benchmarks and evaluation methodology
LLM Reasoning and Verification — work on self-verification limitations

Key resources

Stay tuned

We’ll be adding more posts covering individual papers, benchmark comparisons, and insights from the collection.