Benchmark for practical long-horizon agent planning with multi-day travel and multi-product shopping tasks requiring global constrained optimization.
Collection
Papers
38 papers across 9 categories. Each links to arXiv and includes a local PDF.
Primary Paper
1 paperPlanning Benchmarks
5 papersProposes code generation over chain-of-thought for multi-constraint planning problems.
Scales planning capabilities in deep research agents using advantage shaping in reinforcement learning.
Benchmark focused on planning under temporal constraints, published at EMNLP 2025.
Time-aware simulation for evaluating multitasking efficiency in language agents (ACL 2024).
Extensible benchmark evaluating LLMs on classical planning and reasoning about state changes.
Travel Planning
8 papersUses constraint-gated RL and multi-agent competition for travel planning.
Multi-turn travel planning with 10 real-world tools, infeasible request handling, and sandbox evaluation.
Multi-agent framework with progressive refinement for trip planning.
Unified scoring metric for travel plan quality with RL integration via GRPO.
Evaluates personalization and reasonableness of itineraries (ACL 2025 Findings).
Chinese travel planning benchmark with stricter constraints and DSL-based evaluation.
Case study on personalized travel planning with LLM agents (EMNLP 2024 Industry).
First travel planning benchmark for LLM agents, featuring multi-day itinerary construction (ICML 2024).
Shopping and Search Agents
3 papersEvaluates agentic search for local services like restaurants and businesses.
Benchmark for shopping agents that must deeply research products before purchasing.
Simulated e-commerce environment for grounded web interaction (NeurIPS 2022).
Web Agent Benchmarks
4 papersChallenging benchmark from OpenAI testing browsing agent capabilities.
End-to-end web agent using multimodal models for real-world web tasks (ACL 2024).
Self-hosted realistic web environment for end-to-end agent evaluation (ICLR 2024).
Large-scale dataset for building generalist web agents across diverse websites (NeurIPS 2023).
Tool Use and APIs
3 papersOpen-world environment for testing tool-using agents at scale.
Framework for training LLMs to use 16,000+ real-world APIs (ICLR 2024).
Benchmark for evaluating LLMs on API/tool usage (EMNLP 2023).
Foundation Models
6 papersDeepSeek's frontier open LLM.
MoE model (355B params) strong on agentic benchmarks including TAU-Bench.
Open-weight model focused on agentic intelligence from Moonshot AI.
Google's frontier model with next-generation agentic capabilities.
Technical report for the Qwen3 model family from Alibaba.
ByteDance's reasoning model trained with RL for advanced reasoning tasks.
Agent Evaluation Frameworks
7 papersStudies whether LLM agents know and respect their own limitations.
Evaluates autonomous agents handling million-token real-world contexts.
Versatile interactive task benchmark for LLM agents in real applications.
Meta's framework for scaling agent environments and evaluation.
Interactive environment for evaluating user-centric agent behaviors.
Benchmark for conversational agents under dual-control conditions.
Dynamic policy-guided benchmark for tool-agent-user interaction.
LLM Reasoning and Verification
1 paperDemonstrates that LLMs struggle to verify their own reasoning and planning outputs (ICLR 2025).