[Paper] Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation

Published: (November 26, 2025 at 10:45 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21510v1

Overview

The paper introduces Tool‑RoCo, a new benchmark that puts large language models (LLMs) through their paces in long‑term, multi‑robot cooperation scenarios. By treating other agents as tools that can be called on demand, the authors expose how well LLM‑driven agents can self‑organize, activate, deactivate, and coordinate without a pre‑written orchestration script.

Key Contributions

  • Agent‑as‑Tool paradigm – Reframes inter‑agent communication as tool‑calling, enabling quantitative measurement of cooperation.
  • Four autonomy levels – Defines centralized cooperation, centralized self‑organization, decentralized cooperation, and fully decentralized self‑organization to compare how much “decision‑making” is left to the LLMs.
  • Three realistic robot tasks – SORT (object sorting), PACK (box packing), and CABINET (assembly) provide diverse, long‑horizon challenges.
  • Comprehensive metrics – Evaluates both task‑specific output quality (format & parameter accuracy) and coordination quality (tool‑usage patterns).
  • Open‑source release – Benchmark code, task definitions, and evaluation scripts are publicly available on GitHub.

Methodology

  1. Benchmark foundation – The authors start from RoCo, an established multi‑robot cooperation suite, and augment it with a tool interface that each LLM‑controlled agent can invoke.
  2. Tool taxonomy – Two main tool families are defined:
    • Cooperative tools – Calls that request another agent’s assistance (e.g., “ask robot B to fetch item X”).
    • Activation tools – Calls that turn agents on or off (e.g., “activate robot C”).
  3. Agent paradigms
    • Centralized cooperation: One “master” LLM decides which tool each robot should use.
    • Centralized self‑organization: The master LLM also decides which robots stay active.
    • Decentralized cooperation: Every robot runs its own LLM and picks tools based on its local view.
    • Self‑organization: Any robot can start a collaboration chain by calling activation tools for others.
  4. Evaluation loop – For each task, agents repeatedly (a) observe the current state, (b) select a tool from the candidate set, (c) receive the tool’s response, and (d) update their plan. This loop runs until the task succeeds or a timeout occurs.
  5. Metrics collection – The system logs tool‑call frequencies, success rates, and the quality of the final robot actions (e.g., correct sorting order, packing density).

Results & Findings

  • Tool usage is sparse – Across all LLMs tested, cooperative tools were invoked only 7.09 % of the time, indicating that agents rarely ask peers for help.
  • Agents stay “always on”Activation tools made up 96.42 % of calls, showing a strong bias toward keeping all robots active rather than dynamically deactivating them.
  • Performance gap among paradigms – Centralized cooperation achieved the highest task‑completion rates, while fully decentralized self‑organization lagged behind, revealing that current LLMs still need stronger autonomous coordination capabilities.
  • Model size matters – Larger LLMs (e.g., GPT‑4‑style) produced slightly more cooperative calls than smaller models, but the overall proportion remained low.

Practical Implications

  • Designing LLM‑driven robot fleets – Engineers should not assume that LLM agents will naturally delegate work; explicit tool‑calling APIs or higher‑level coordination layers may be required.
  • Resource management – Since LLMs tend to keep all agents active, real‑world deployments need to implement external throttling or cost‑aware activation policies to avoid unnecessary power/compute consumption.
  • Benchmark‑driven development – Tool‑RoCo offers a ready‑made testbed for evaluating new prompting strategies, fine‑tuning datasets, or custom tool‑call handlers before deploying to physical robots.
  • Hybrid orchestration – A practical approach could combine a lightweight central scheduler (to handle activation) with decentralized LLM agents (to handle local decisions), leveraging the strengths observed in the benchmark’s four paradigms.

Limitations & Future Work

  • Synthetic environment – The benchmark runs in simulation; real‑world noise, latency, and hardware failures may affect tool‑calling behavior differently.
  • Tool set simplicity – Only two tool families were explored; richer interaction primitives (e.g., shared memory, negotiation protocols) could reveal deeper cooperation patterns.
  • LLM prompting constraints – The study used off‑the‑shelf prompting; custom fine‑tuning or reinforcement learning from tool‑use feedback might dramatically change the observed low cooperation rates.
  • Scalability – Experiments were limited to three robots; scaling to larger swarms could expose new coordination challenges that the current benchmark does not capture.

Tool‑RoCo opens the door to systematic, quantitative research on LLM autonomy in multi‑agent robotics. By treating other agents as callable tools, it gives developers a concrete way to measure—and eventually improve—the collaborative intelligence of LLM‑powered systems.

Authors

  • Ke Zhang
  • Xiaoning Zhao
  • Ce Zheng
  • Jiahong Ning
  • Dandan Zhu
  • Wenqi Zhang
  • Chen Sun
  • Toshiharu Sugawara

Paper Information

  • arXiv ID: 2511.21510v1
  • Categories: cs.MA, cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »