On March 31, 2026, Yinuo Liu and nine co-authors published ATP-Bench, a benchmark built to measure a specific capability in multimodal large language models: planning when and how to call external tools in order to generate responses that interleave text and images. The work targets a gap between two existing approaches — image generation and retrieval augmentation — which have until now been treated as separate, mutually exclusive paths rather than unified capabilities.
- ATP-Bench contains 7,702 QA pairs — including 1,592 VQA pairs — spanning 8 categories and 25 visual-critical intents, all human-verified.
- A new Multi-Agent MLLM-as-a-Judge (MAM) system evaluates tool-call precision without requiring ground-truth execution results.
- Tests on 10 state-of-the-art MLLMs found consistent struggles with coherent interleaved planning and significant variation in tool-use behavior.
- Dataset and code are publicly available via the paper’s arXiv submission.
What Happened
Yinuo Liu, Zi Qian, Heng Zhou, Jiahao Zhang, Yajie Zhang, Zhihang Li, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang submitted ATP-Bench to arXiv on March 31, 2026 (arXiv:2603.29902). The paper introduces both a benchmark dataset and a judge framework aimed at evaluating a capability the authors call Agentic Tool Planning — the ability of a multimodal model to act as a central controller that autonomously decides when, where, and which tools to invoke while generating interleaved text-and-image responses.
The paper argues that the field’s next milestone requires models to move beyond choosing either generation or retrieval, and instead unify both under a single planning layer. “The next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries,” the authors write in the abstract.
Why It Matters
Interleaved text-and-image generation requires a model to solve two distinct problems at once: maintaining factual accuracy and producing relevant visuals. Retrieval augmentation typically improves factuality but constrains visual output to existing images; generative approaches allow novel visuals but can introduce inaccuracies. Current systems handle one or the other — not both adaptively.
Without a dedicated benchmark, the field has lacked a standard method for comparing models on this specific planning capability or tracking progress over time. Prior evaluations of multimodal models have focused on end-to-end output quality, conflating planning decisions with execution results and making it difficult to isolate where failures originate.
Technical Details
ATP-Bench comprises 7,702 QA pairs, of which 1,592 are visual question-answering (VQA) pairs. The full dataset spans eight categories and 25 visual-critical intents, with all queries and ground truths human-verified. The benchmark is specifically designed to evaluate planning behavior independently from tool execution, so that a model’s score is not contingent on which backend tools happen to be available at runtime.
To enable this separation, the authors introduce a Multi-Agent MLLM-as-a-Judge (MAM) system. MAM evaluates three dimensions: tool-call precision (whether the model correctly identifies which tools to invoke), missed tool opportunities (cases where a tool should have been used but was not), and overall response quality — all without requiring ground-truth execution results. This architecture is intended to keep the benchmark stable as underlying tool backends evolve.
Experiments conducted across 10 state-of-the-art MLLMs showed that models “struggle with coherent interleaved planning and exhibit significant variations in tool-use behavior,” according to the paper. The abstract does not name specific models or report per-model numerical scores, but the authors state the findings provide “actionable guidance for advancing interleaved generation.”
Who’s Affected
The benchmark is directly relevant to teams building or evaluating multimodal models that generate mixed-media responses — including AI assistants, automated document generation systems, and visual Q&A applications. Researchers comparing MLLM capabilities now have a dataset with human-verified ground truths and a judge framework that does not require live tool execution, lowering the barrier for standardized evaluation.
The 10 tested models are not named in the abstract. Their developers will face public comparison once the full paper and any associated leaderboard results circulate following the arXiv submission.
What’s Next
The dataset and code are publicly available at the link provided in the arXiv submission. The MAM framework is designed to function as an ongoing evaluation tool as new MLLMs are released, since it does not depend on fixed tool backends. A key limitation implicit in the benchmark’s design is that it measures planning decisions rather than execution quality: a high ATP-Bench score does not guarantee correct final outputs when tools are actually invoked.
The authors indicate the results highlight “substantial room for improvement” across all tested models, suggesting follow-up work benchmarking newer or fine-tuned systems against ATP-Bench as a baseline. Institutional affiliations for the research team were not listed in the arXiv abstract at time of publication.
