[Paper] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

Published: (December 29, 2025 at 12:12 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23611v1

Overview

A new paper, “Close the Loop: Synthesizing Infinite Tool-Use Data via Multi‑Agent Role‑Playing,” proposes a fully autonomous pipeline—InfTool—that can teach large language models (LLMs) to call external APIs without any human‑written examples. By letting three specialized agents generate, verify, and refine tool‑calling trajectories, the system repeatedly improves itself, turning raw API specs into massive, high‑quality training data.

Key Contributions

  • InfTool framework: A closed‑loop, multi‑agent system that synthesizes unlimited tool‑use examples from only API documentation.
  • Three cooperating agents:
    1. User Simulator – creates realistic user requests.
    2. Tool‑Calling Assistant – decides which API to invoke and how.
    3. MCP Server – executes calls, checks results, and provides feedback.
  • Group Relative Policy Optimization (GRPO): A reinforcement‑learning‑style update that trains the assistant with gated rewards, encouraging it to fill its own capability gaps.
  • Zero‑human annotation: All data are generated, verified, and used for training without any manual labeling.
  • State‑of‑the‑art performance: A 32‑billion‑parameter model jumps from 19.8 % to 70.9 % accuracy on the Berkeley Function‑Calling Leaderboard—outperforming much larger commercial models.

Methodology

  1. Input – API specs: The system starts with OpenAPI‑style descriptions (endpoints, parameters, return types). No examples are needed.
  2. Role‑playing loop:
    • The User Simulator drafts a natural‑language request that could plausibly need one of the APIs (e.g., “Show me the weather in Tokyo tomorrow”).
    • The Tool‑Calling Assistant (an LLM) interprets the request, selects the appropriate API, and generates the exact function call (JSON arguments, HTTP method, etc.).
    • The MCP Server (a lightweight execution sandbox) runs the call against a mock or real service, returns the response, and flags any mismatch or error.
  3. Self‑verification & filtering: Only trajectories that pass the MCP check are kept; the rest are fed back as negative examples.
  4. Training via GRPO: The assistant’s policy is updated using a group‑wise relative reward that compares each new trajectory against a baseline set, rewarding novel, correct, and diverse calls while penalizing repeats or failures.
  5. Iterative improvement: The freshly trained assistant now produces higher‑quality requests, which the loop repeats—hence the “closed loop.”

The whole pipeline runs automatically on commodity GPU clusters, producing millions of verified examples in a few days.

Results & Findings

MetricBaseline (32B)After InfToolRelative Gain
BFCL accuracy19.8 %70.9 %+258 %
Data efficiency (synthetic vs. human)100 % synthetic
Model size needed for comparable performance320 B (Claude‑Opus)32 B

Key observations

  • Diversity matters: The multi‑turn, multi‑API sequences generated by the agents cover edge cases that single‑model synthetic pipelines miss.
  • Self‑targeted learning: GRPO pushes the assistant to explore APIs it currently struggles with, automatically balancing the dataset.
  • No human bottleneck: The entire improvement comes from automatically generated data, eliminating costly annotation cycles.

Practical Implications

  • Rapid prototyping of tool‑enabled agents: Developers can feed only API docs into InfTool and obtain a ready‑to‑fine‑tune model that reliably calls those services.
  • Cost‑effective scaling: Companies can bootstrap tool‑use capabilities for internal LLMs without hiring annotators, saving millions in data‑labeling budgets.
  • Continuous improvement pipelines: As new APIs are added, the same loop can auto‑generate fresh training data, keeping the assistant up‑to‑date without manual regression testing.
  • Better sandbox testing: The MCP Server acts like an automated integration test suite, catching mismatches early in the development cycle.
  • Open‑source potential: If released as a library, InfTool could become a standard component in LLM‑as‑a‑service platforms (e.g., LangChain, LlamaIndex) for auto‑generating function‑calling datasets.

Limitations & Future Work

  • Reliance on accurate API specs: Incomplete or ambiguous documentation can lead to malformed trajectories that the loop may not detect.
  • Mock vs. real services: The MCP Server often uses mocked responses; bridging the gap to live production APIs (rate limits, authentication) remains an engineering challenge.
  • Scalability of verification: While the loop is automated, verifying extremely large or stateful workflows could become computationally expensive.
  • Generalization to non‑REST interfaces: The current design targets HTTP/JSON APIs; extending to GraphQL, gRPC, or custom SDKs is future work.
  • Safety & bias checks: Synthetic data may still inherit biases from the base LLM; integrating explicit safety filters into the loop is an open research direction.

Overall, InfTool demonstrates that a self‑sustaining, multi‑agent role‑playing system can close the data gap for tool‑use in LLMs, opening a path toward truly autonomous AI assistants that can be deployed at scale with minimal human overhead.

Authors

  • Yuwen Li
  • Wei Zhang
  • Zelong Huang
  • Mason Yang
  • Jiajun Wu
  • Shawn Guo
  • Huahao Hu
  • Lingyi Sun
  • Jian Yang
  • Mingjie Tang
  • Byran Dai

Paper Information

  • arXiv ID: 2512.23611v1
  • Categories: cs.CL
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »