[Paper] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

Published: 3 weeks ago (December 29, 2025 at 12:12 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23611v1

Overview

A new paper, “Close the Loop: Synthesizing Infinite Tool-Use Data via Multi‑Agent Role‑Playing,” proposes a fully autonomous pipeline—InfTool—that can teach large language models (LLMs) to call external APIs without any human‑written examples. By letting three specialized agents generate, verify, and refine tool‑calling trajectories, the system repeatedly improves itself, turning raw API specs into massive, high‑quality training data.

Key Contributions

InfTool framework: A closed‑loop, multi‑agent system that synthesizes unlimited tool‑use examples from only API documentation.
Three cooperating agents:
1. User Simulator – creates realistic user requests.
2. Tool‑Calling Assistant – decides which API to invoke and how.
3. MCP Server – executes calls, checks results, and provides feedback.
Group Relative Policy Optimization (GRPO): A reinforcement‑learning‑style update that trains the assistant with gated rewards, encouraging it to fill its own capability gaps.
Zero‑human annotation: All data are generated, verified, and used for training without any manual labeling.
State‑of‑the‑art performance: A 32‑billion‑parameter model jumps from 19.8 % to 70.9 % accuracy on the Berkeley Function‑Calling Leaderboard—outperforming much larger commercial models.

Methodology

Input – API specs: The system starts with OpenAPI‑style descriptions (endpoints, parameters, return types). No examples are needed.
Role‑playing loop:
- The User Simulator drafts a natural‑language request that could plausibly need one of the APIs (e.g., “Show me the weather in Tokyo tomorrow”).
- The Tool‑Calling Assistant (an LLM) interprets the request, selects the appropriate API, and generates the exact function call (JSON arguments, HTTP method, etc.).
- The MCP Server (a lightweight execution sandbox) runs the call against a mock or real service, returns the response, and flags any mismatch or error.
Self‑verification & filtering: Only trajectories that pass the MCP check are kept; the rest are fed back as negative examples.
Training via GRPO: The assistant’s policy is updated using a group‑wise relative reward that compares each new trajectory against a baseline set, rewarding novel, correct, and diverse calls while penalizing repeats or failures.
Iterative improvement: The freshly trained assistant now produces higher‑quality requests, which the loop repeats—hence the “closed loop.”

The whole pipeline runs automatically on commodity GPU clusters, producing millions of verified examples in a few days.

Results & Findings

Metric	Baseline (32B)	After InfTool	Relative Gain
BFCL accuracy	19.8 %	70.9 %	+258 %
Data efficiency (synthetic vs. human)	–	100 % synthetic	—
Model size needed for comparable performance	320 B (Claude‑Opus)	32 B	—

Key observations

Diversity matters: The multi‑turn, multi‑API sequences generated by the agents cover edge cases that single‑model synthetic pipelines miss.
Self‑targeted learning: GRPO pushes the assistant to explore APIs it currently struggles with, automatically balancing the dataset.
No human bottleneck: The entire improvement comes from automatically generated data, eliminating costly annotation cycles.

Practical Implications

Rapid prototyping of tool‑enabled agents: Developers can feed only API docs into InfTool and obtain a ready‑to‑fine‑tune model that reliably calls those services.
Cost‑effective scaling: Companies can bootstrap tool‑use capabilities for internal LLMs without hiring annotators, saving millions in data‑labeling budgets.
Continuous improvement pipelines: As new APIs are added, the same loop can auto‑generate fresh training data, keeping the assistant up‑to‑date without manual regression testing.
Better sandbox testing: The MCP Server acts like an automated integration test suite, catching mismatches early in the development cycle.
Open‑source potential: If released as a library, InfTool could become a standard component in LLM‑as‑a‑service platforms (e.g., LangChain, LlamaIndex) for auto‑generating function‑calling datasets.

Limitations & Future Work

Reliance on accurate API specs: Incomplete or ambiguous documentation can lead to malformed trajectories that the loop may not detect.
Mock vs. real services: The MCP Server often uses mocked responses; bridging the gap to live production APIs (rate limits, authentication) remains an engineering challenge.
Scalability of verification: While the loop is automated, verifying extremely large or stateful workflows could become computationally expensive.
Generalization to non‑REST interfaces: The current design targets HTTP/JSON APIs; extending to GraphQL, gRPC, or custom SDKs is future work.
Safety & bias checks: Synthetic data may still inherit biases from the base LLM; integrating explicit safety filters into the loop is an open research direction.

Overall, InfTool demonstrates that a self‑sustaining, multi‑agent role‑playing system can close the data gap for tool‑use in LLMs, opening a path toward truly autonomous AI assistants that can be deployed at scale with minimal human overhead.

Authors

Yuwen Li
Wei Zhang
Zelong Huang
Mason Yang
Jiajun Wu
Shawn Guo
Huahao Hu
Lingyi Sun
Jian Yang
Mingjie Tang
Byran Dai

Paper Information

arXiv ID: 2512.23611v1
Categories: cs.CL
Published: December 29, 2025
PDF: Download PDF

[Paper] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

Overview

Key Contributions

Methodology

Results & Findings

Key observations

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents