[Paper] Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation

Published: 2 months ago (November 26, 2025 at 10:45 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21510v1

Overview

The paper introduces Tool‑RoCo, a new benchmark that puts large language models (LLMs) through their paces in long‑term, multi‑robot cooperation scenarios. By treating other agents as tools that can be called on demand, the authors expose how well LLM‑driven agents can self‑organize, activate, deactivate, and coordinate without a pre‑written orchestration script.

Key Contributions

Agent‑as‑Tool paradigm – Reframes inter‑agent communication as tool‑calling, enabling quantitative measurement of cooperation.
Four autonomy levels – Defines centralized cooperation, centralized self‑organization, decentralized cooperation, and fully decentralized self‑organization to compare how much “decision‑making” is left to the LLMs.
Three realistic robot tasks – SORT (object sorting), PACK (box packing), and CABINET (assembly) provide diverse, long‑horizon challenges.
Comprehensive metrics – Evaluates both task‑specific output quality (format & parameter accuracy) and coordination quality (tool‑usage patterns).
Open‑source release – Benchmark code, task definitions, and evaluation scripts are publicly available on GitHub.

Methodology

Benchmark foundation – The authors start from RoCo, an established multi‑robot cooperation suite, and augment it with a tool interface that each LLM‑controlled agent can invoke.
Tool taxonomy – Two main tool families are defined:
- Cooperative tools – Calls that request another agent’s assistance (e.g., “ask robot B to fetch item X”).
- Activation tools – Calls that turn agents on or off (e.g., “activate robot C”).
Agent paradigms –
- Centralized cooperation: One “master” LLM decides which tool each robot should use.
- Centralized self‑organization: The master LLM also decides which robots stay active.
- Decentralized cooperation: Every robot runs its own LLM and picks tools based on its local view.
- Self‑organization: Any robot can start a collaboration chain by calling activation tools for others.
Evaluation loop – For each task, agents repeatedly (a) observe the current state, (b) select a tool from the candidate set, (c) receive the tool’s response, and (d) update their plan. This loop runs until the task succeeds or a timeout occurs.
Metrics collection – The system logs tool‑call frequencies, success rates, and the quality of the final robot actions (e.g., correct sorting order, packing density).

Results & Findings

Tool usage is sparse – Across all LLMs tested, cooperative tools were invoked only 7.09 % of the time, indicating that agents rarely ask peers for help.
Agents stay “always on” – Activation tools made up 96.42 % of calls, showing a strong bias toward keeping all robots active rather than dynamically deactivating them.
Performance gap among paradigms – Centralized cooperation achieved the highest task‑completion rates, while fully decentralized self‑organization lagged behind, revealing that current LLMs still need stronger autonomous coordination capabilities.
Model size matters – Larger LLMs (e.g., GPT‑4‑style) produced slightly more cooperative calls than smaller models, but the overall proportion remained low.

Practical Implications

Designing LLM‑driven robot fleets – Engineers should not assume that LLM agents will naturally delegate work; explicit tool‑calling APIs or higher‑level coordination layers may be required.
Resource management – Since LLMs tend to keep all agents active, real‑world deployments need to implement external throttling or cost‑aware activation policies to avoid unnecessary power/compute consumption.
Benchmark‑driven development – Tool‑RoCo offers a ready‑made testbed for evaluating new prompting strategies, fine‑tuning datasets, or custom tool‑call handlers before deploying to physical robots.
Hybrid orchestration – A practical approach could combine a lightweight central scheduler (to handle activation) with decentralized LLM agents (to handle local decisions), leveraging the strengths observed in the benchmark’s four paradigms.

Limitations & Future Work

Synthetic environment – The benchmark runs in simulation; real‑world noise, latency, and hardware failures may affect tool‑calling behavior differently.
Tool set simplicity – Only two tool families were explored; richer interaction primitives (e.g., shared memory, negotiation protocols) could reveal deeper cooperation patterns.
LLM prompting constraints – The study used off‑the‑shelf prompting; custom fine‑tuning or reinforcement learning from tool‑use feedback might dramatically change the observed low cooperation rates.
Scalability – Experiments were limited to three robots; scaling to larger swarms could expose new coordination challenges that the current benchmark does not capture.

Tool‑RoCo opens the door to systematic, quantitative research on LLM autonomy in multi‑agent robotics. By treating other agents as callable tools, it gives developers a concrete way to measure—and eventually improve—the collaborative intelligence of LLM‑powered systems.

Authors

Ke Zhang
Xiaoning Zhao
Ce Zheng
Jiahong Ning
Dandan Zhu
Wenqi Zhang
Chen Sun
Toshiharu Sugawara

Paper Information

arXiv ID: 2511.21510v1
Categories: cs.MA, cs.AI
Published: November 26, 2025
PDF: Download PDF

[Paper] Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval