Papers | DeepPlanning

Primary Paper

1 paper

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

Benchmark for practical long-horizon agent planning with multi-day travel and multi-product shopping tasks requiring global constrained optimization.

2026 arXiv PDF

Planning Benchmarks

5 papers

Programming over Thinking: Efficient and Robust Multi-Constraint Planning

Proposes code generation over chain-of-thought for multi-constraint planning problems.

2026 arXiv PDF

DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping

Scales planning capabilities in deep research agents using advantage shaping in reinforcement learning.

2025 arXiv PDF

TCP: A Benchmark for Temporal Constraint-Based Planning

Benchmark focused on planning under temporal constraints, published at EMNLP 2025.

2025 arXiv PDF

TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

Time-aware simulation for evaluating multitasking efficiency in language agents (ACL 2024).

2024 arXiv PDF

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Extensible benchmark evaluating LLMs on classical planning and reasoning about state changes.

2022 arXiv PDF

Travel Planning

8 papers

TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning

Uses constraint-gated RL and multi-agent competition for travel planning.

2026 arXiv PDF

TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning

Multi-turn travel planning with 10 real-world tools, infeasible request handling, and sandbox evaluation.

2025 arXiv PDF

TriFlow: A Progressive Multi-Agent Framework for Intelligent Trip Planning

Multi-agent framework with progressive refinement for trip planning.

2025 arXiv PDF

TripScore: Benchmarking and Rewarding Real-World Travel Planning with Fine-Grained Evaluation

Unified scoring metric for travel plan quality with RL integration via GRPO.

2025 arXiv PDF

TripTailor: A Real-World Benchmark for Personalized Travel Planning

Evaluates personalization and reasonableness of itineraries (ACL 2025 Findings).

2025 arXiv PDF

ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning

Chinese travel planning benchmark with stricter constraints and DSL-based evaluation.

2024 arXiv PDF

Personal Large Language Model Agents: A Case Study on Tailored Travel Planning

Case study on personalized travel planning with LLM agents (EMNLP 2024 Industry).

2024 arXiv PDF

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

First travel planning benchmark for LLM agents, featuring multi-day itinerary construction (ICML 2024).

2024 arXiv PDF

Shopping and Search Agents

3 papers

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Evaluates agentic search for local services like restaurants and businesses.

2025 arXiv PDF

DeepShop: A Benchmark for Deep Research Shopping Agents

Benchmark for shopping agents that must deeply research products before purchasing.

2025 arXiv PDF

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Simulated e-commerce environment for grounded web interaction (NeurIPS 2022).

2022 arXiv PDF

Web Agent Benchmarks

4 papers

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Challenging benchmark from OpenAI testing browsing agent capabilities.

2025 arXiv PDF

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

End-to-end web agent using multimodal models for real-world web tasks (ACL 2024).

2024 arXiv PDF

WebArena: A Realistic Web Environment for Building Autonomous Agents

Self-hosted realistic web environment for end-to-end agent evaluation (ICLR 2024).

2023 arXiv PDF

Mind2Web: Towards a Generalist Agent for the Web

Large-scale dataset for building generalist web agents across diverse websites (NeurIPS 2023).

2023 arXiv PDF

Tool Use and APIs

3 papers

ToolGym: An Open-World Tool-Using Environment for Scalable Agent Testing and Data Curation

Open-world environment for testing tool-using agents at scale.

2026 arXiv PDF

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs

Framework for training LLMs to use 16,000+ real-world APIs (ICLR 2024).

2023 arXiv PDF

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Benchmark for evaluating LLMs on API/tool usage (EMNLP 2023).

2023 arXiv PDF

Foundation Models

6 papers

DeepSeek-v3.2: Pushing the Frontier of Open Large Language Models

DeepSeek's frontier open LLM.

2025 arXiv PDF

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

MoE model (355B params) strong on agentic benchmarks including TAU-Bench.

2025 arXiv PDF

Kimi K2: Open Agentic Intelligence

Open-weight model focused on agentic intelligence from Moonshot AI.

2025 arXiv PDF

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, and Agentic Capabilities

Google's frontier model with next-generation agentic capabilities.

2025 arXiv PDF

Qwen3 Technical Report

Technical report for the Qwen3 model family from Alibaba.

2025 arXiv PDF

Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

ByteDance's reasoning model trained with RL for advanced reasoning tasks.

2025 arXiv PDF

Agent Evaluation Frameworks

7 papers

Evaluating the Consistency and Limit-Awareness of LLM Agents

Studies whether LLM agents know and respect their own limitations.

2026 arXiv PDF

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Evaluates autonomous agents handling million-token real-world contexts.

2026 arXiv PDF

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-World Applications

Versatile interactive task benchmark for LLM agents in real applications.

2025 arXiv PDF

ARE: Scaling Up Agent Environments and Evaluations

Meta's framework for scaling agent environments and evaluation.

2025 arXiv PDF

UserBench: An Interactive Gym Environment for User-Centric Agents

Interactive environment for evaluating user-centric agent behaviors.

2025 arXiv PDF

pi2-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Benchmark for conversational agents under dual-control conditions.

2025 arXiv PDF

tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Dynamic policy-guided benchmark for tool-agent-user interaction.

2024 arXiv PDF

LLM Reasoning and Verification

1 paper

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

Demonstrates that LLMs struggle to verify their own reasoning and planning outputs (ICLR 2025).

2024 arXiv PDF