[Paper] Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Published: 3 months ago (January 29, 2026 at 11:50 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.21972v1

Overview

The paper introduces CoLLM, a suite of Multi‑Agent Actor‑Critic (MAAC) techniques for training large language models (LLMs) to collaborate without a central controller. By moving from Monte‑Carlo fine‑tuning to actor‑critic learning, the authors show how decentralized LLM teams can be trained more sample‑efficiently, especially on complex, long‑horizon tasks.

Key Contributions

Two novel MAAC frameworks for LLM collaboration:
- CoLLM‑CC – a centralized critic that evaluates the joint actions of all agents.
- CoLLM‑DC – decentralized critics, each estimating value for its own agent.
Theoretical analysis of when centralized vs. decentralized critics provide advantages (e.g., reward sparsity, horizon length).
Comprehensive empirical study across three domains—creative writing, code generation, and multi‑agent game playing—highlighting trade‑offs between Monte‑Carlo, CoLLM‑CC, and CoLLM‑DC.
Open‑source implementation (v1.3.2) that integrates with popular LLM toolkits, enabling reproducibility and rapid experimentation.

Methodology

Problem Setup – Model a team of LLM agents as a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP). Each agent receives its own prompt/context and produces a text output (action).
Actor‑Critic Design
- Actor: Each LLM is fine‑tuned with a policy head that maps its hidden states to token probabilities.
- Critic:
  - CoLLM‑CC: A single transformer‑based critic receives the concatenated observations and actions of all agents and outputs a joint state‑value estimate.
  - CoLLM‑DC: Each agent has its own lightweight critic that only sees its local observation/action, approximating a local value function.
Training Loop –
- Run parallel inference episodes (no central scheduler needed).
- Collect trajectories, compute advantage estimates using Generalized Advantage Estimation (GAE) to reduce variance.
- Update actors with PPO‑style clipped surrogate loss; update critics with a mean‑squared error loss on the bootstrapped returns.
Baselines – Standard Monte‑Carlo policy‑gradient fine‑tuning (no critic) and a fully centralized execution protocol (where a master node orchestrates the agents).

Results & Findings

Domain	Horizon / Reward Density	Monte‑Carlo	CoLLM‑DC	CoLLM‑CC
Writing (short story)	Short, dense	Comparable	Comparable	Best
Code synthesis (single function)	Medium, dense	Slightly worse	Comparable	Best
Turn‑based strategy game	Long, sparse	Needs ~3× more samples	Fails to converge reliably	Clear win

Sample Efficiency: Both MAAC variants cut the number of required fine‑tuning steps by 30‑50 % on dense‑reward tasks.
Stability: The centralized critic (CoLLM‑CC) consistently yields lower variance gradients, leading to smoother training curves on sparse‑reward problems.
Scalability: CoLLM‑DC scales better with the number of agents (communication overhead stays local), but its performance degrades when the global reward signal is weak or delayed.

Practical Implications

Parallel Deployments: Teams of LLM‑powered micro‑services (e.g., a “research assistant + code reviewer + documentation writer” pipeline) can be trained offline with CoLLM‑DC and then run completely independently at inference time—no need for a coordinating server.
Reduced Cloud Costs: Actor‑critic fine‑tuning converges with fewer API calls to expensive LLM endpoints, translating into lower compute bills for enterprises experimenting with multi‑agent workflows.
Better Long‑Term Planning: For applications like automated game testing, multi‑step troubleshooting, or multi‑turn dialogue agents, CoLLM‑CC offers a practical route to teach LLMs to anticipate future outcomes without hand‑crafted reward shaping.
Plug‑and‑Play: The released code wraps the critic logic as a lightweight PyTorch module that can be attached to any Hugging Face transformer, making it straightforward for devs to prototype decentralized collaboration in their own stacks.

Limitations & Future Work

Centralized Critic Bottleneck: CoLLM‑CC still requires a global view of all agents during training, which can become a memory bottleneck for very large teams (>10 agents).
Sparse‑Reward Sensitivity: While CoLLM‑CC outperforms other methods on sparse rewards, it still needs careful reward shaping or curriculum learning to avoid dead‑ends.
Evaluation Scope: Experiments focus on text‑centric tasks; extending to multimodal agents (e.g., vision‑language) remains an open question.
Future Directions suggested by the authors include: hierarchical critics that blend centralized and local information, meta‑learning to adapt critics across domains, and exploring off‑policy actor‑critic variants to further reduce sample needs.

Authors

Shuo Liu
Tianle Chen
Ryan Amiri
Christopher Amato

Paper Information

arXiv ID: 2601.21972v1
Categories: cs.AI, cs.DC, cs.MA
Published: January 29, 2026
PDF: Download PDF

[Paper] Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound