Back to Blog

Welcome to the DeepPlanning Paper Collection

·
benchmark planning agents

This site is a companion to the DeepPlanning paper — a benchmark for practical long-horizon agent planning with verifiable constraints.

What is DeepPlanning?

DeepPlanning evaluates how well AI agents handle complex, real-world planning tasks that require global constrained optimization. It features two core task types:

  • Multi-day travel itineraries — agents must plan trips satisfying budget, time, preference, and logistical constraints simultaneously.
  • Multi-product shopping tasks — agents must research and select products meeting multiple interacting requirements.

These tasks go beyond simple instruction-following. They require agents to search for information, reason over constraints, and produce plans that are globally optimal — not just locally plausible.

What’s in this collection?

We’ve curated 38 papers across 9 categories that represent the landscape of work related to DeepPlanning:

  • Planning Benchmarks — other benchmarks that evaluate planning and reasoning
  • Travel Planning — systems and benchmarks for travel itinerary generation
  • Shopping and Search Agents — e-commerce and local search agent evaluation
  • Web Agent Benchmarks — general web interaction benchmarks
  • Tool Use and APIs — frameworks for tool-augmented LLMs
  • Foundation Models — the models being evaluated on these tasks
  • Agent Evaluation Frameworks — meta-benchmarks and evaluation methodology
  • LLM Reasoning and Verification — work on self-verification limitations

Key resources

Stay tuned

We’ll be adding more posts covering individual papers, benchmark comparisons, and insights from the collection.