While augmenting Multimodal Large Language Models (MLLMs) with tools is a promising direction, existing approaches face critical limitations. They often rely on single, atomic tools, lack support for multi-turn planning, and fail to equip models with the ability to select and coordinate effective tool combinations for complex tasks. To overcome these challenges, we introduce AdaReasoner, a new family of models designed for dynamic tool orchestration in iterative visual reasoning. Our paradigm integrates three key components: a scalable data curation methodology, a tailored Tool-GRPO optimization algorithm, and an adaptive learning mechanism. As a result, AdaReasoner achieves state-of-the-art performance, delivering substantial improvements over its base models (e.g., +24.9% on average for the 7B variant) and even surpassing strong proprietary models such as GPT-5 on challenging benchmarks like VSP and Jigsaw.
Learns to autonomously adopt beneficial tools, discard irrelevant ones, and modulate usage frequency.
Supports complex multi-turn tool interactions with reflection and backtracking capabilities.
Generalizes to unseen tools and novel tasks beyond training distribution.
AdaReasoner achieves substantial gains across diverse benchmarks, outperforming both base models and strong proprietary systems like GPT-5 and Claude Sonnet 4.
Our framework consists of three main stages: (a) Tool Cold Start (TC) with high-quality trajectory data curation, (b) Adaptive Learning for improved generalization, and (c) Multi-Turn Tool GRPO (TG) for reinforcement learning with tool interaction.
AdaReasoner-7B demonstrates advanced capabilities for multi-turn, tool-assisted reasoning and reflection.
@article{adareasoner2025, title={AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning}, author={Song, Mingyang and Sun, Haoyu and Gu, Jiawei and Li, Linjie and Krishna, Ranjay and Cheng, Yu}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2025} }