[Paper] Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
Source: arXiv - 2601.21972v1
Overview
The paper introduces CoLLM, a suite of Multi‑Agent Actor‑Critic (MAAC) techniques for training large language models (LLMs) to collaborate without a central controller. By moving from Monte‑Carlo fine‑tuning to actor‑critic learning, the authors show how decentralized LLM teams can be trained more sample‑efficiently, especially on complex, long‑horizon tasks.
Key Contributions
- Two novel MAAC frameworks for LLM collaboration:
- CoLLM‑CC – a centralized critic that evaluates the joint actions of all agents.
- CoLLM‑DC – decentralized critics, each estimating value for its own agent.
- Theoretical analysis of when centralized vs. decentralized critics provide advantages (e.g., reward sparsity, horizon length).
- Comprehensive empirical study across three domains—creative writing, code generation, and multi‑agent game playing—highlighting trade‑offs between Monte‑Carlo, CoLLM‑CC, and CoLLM‑DC.
- Open‑source implementation (v1.3.2) that integrates with popular LLM toolkits, enabling reproducibility and rapid experimentation.
Methodology
- Problem Setup – Model a team of LLM agents as a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP). Each agent receives its own prompt/context and produces a text output (action).
- Actor‑Critic Design
- Actor: Each LLM is fine‑tuned with a policy head that maps its hidden states to token probabilities.
- Critic:
- CoLLM‑CC: A single transformer‑based critic receives the concatenated observations and actions of all agents and outputs a joint state‑value estimate.
- CoLLM‑DC: Each agent has its own lightweight critic that only sees its local observation/action, approximating a local value function.
- Training Loop –
- Run parallel inference episodes (no central scheduler needed).
- Collect trajectories, compute advantage estimates using Generalized Advantage Estimation (GAE) to reduce variance.
- Update actors with PPO‑style clipped surrogate loss; update critics with a mean‑squared error loss on the bootstrapped returns.
- Baselines – Standard Monte‑Carlo policy‑gradient fine‑tuning (no critic) and a fully centralized execution protocol (where a master node orchestrates the agents).
Results & Findings
| Domain | Horizon / Reward Density | Monte‑Carlo | CoLLM‑DC | CoLLM‑CC |
|---|---|---|---|---|
| Writing (short story) | Short, dense | Comparable | Comparable | Best |
| Code synthesis (single function) | Medium, dense | Slightly worse | Comparable | Best |
| Turn‑based strategy game | Long, sparse | Needs ~3× more samples | Fails to converge reliably | Clear win |
- Sample Efficiency: Both MAAC variants cut the number of required fine‑tuning steps by 30‑50 % on dense‑reward tasks.
- Stability: The centralized critic (CoLLM‑CC) consistently yields lower variance gradients, leading to smoother training curves on sparse‑reward problems.
- Scalability: CoLLM‑DC scales better with the number of agents (communication overhead stays local), but its performance degrades when the global reward signal is weak or delayed.
Practical Implications
- Parallel Deployments: Teams of LLM‑powered micro‑services (e.g., a “research assistant + code reviewer + documentation writer” pipeline) can be trained offline with CoLLM‑DC and then run completely independently at inference time—no need for a coordinating server.
- Reduced Cloud Costs: Actor‑critic fine‑tuning converges with fewer API calls to expensive LLM endpoints, translating into lower compute bills for enterprises experimenting with multi‑agent workflows.
- Better Long‑Term Planning: For applications like automated game testing, multi‑step troubleshooting, or multi‑turn dialogue agents, CoLLM‑CC offers a practical route to teach LLMs to anticipate future outcomes without hand‑crafted reward shaping.
- Plug‑and‑Play: The released code wraps the critic logic as a lightweight PyTorch module that can be attached to any Hugging Face transformer, making it straightforward for devs to prototype decentralized collaboration in their own stacks.
Limitations & Future Work
- Centralized Critic Bottleneck: CoLLM‑CC still requires a global view of all agents during training, which can become a memory bottleneck for very large teams (>10 agents).
- Sparse‑Reward Sensitivity: While CoLLM‑CC outperforms other methods on sparse rewards, it still needs careful reward shaping or curriculum learning to avoid dead‑ends.
- Evaluation Scope: Experiments focus on text‑centric tasks; extending to multimodal agents (e.g., vision‑language) remains an open question.
- Future Directions suggested by the authors include: hierarchical critics that blend centralized and local information, meta‑learning to adapt critics across domains, and exploring off‑policy actor‑critic variants to further reduce sample needs.
Authors
- Shuo Liu
- Tianle Chen
- Ryan Amiri
- Christopher Amato
Paper Information
- arXiv ID: 2601.21972v1
- Categories: cs.AI, cs.DC, cs.MA
- Published: January 29, 2026
- PDF: Download PDF