[Paper] Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Published: (January 29, 2026 at 11:50 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.21972v1

Overview

The paper introduces CoLLM, a suite of Multi‑Agent Actor‑Critic (MAAC) techniques for training large language models (LLMs) to collaborate without a central controller. By moving from Monte‑Carlo fine‑tuning to actor‑critic learning, the authors show how decentralized LLM teams can be trained more sample‑efficiently, especially on complex, long‑horizon tasks.

Key Contributions

  • Two novel MAAC frameworks for LLM collaboration:
    • CoLLM‑CC – a centralized critic that evaluates the joint actions of all agents.
    • CoLLM‑DC – decentralized critics, each estimating value for its own agent.
  • Theoretical analysis of when centralized vs. decentralized critics provide advantages (e.g., reward sparsity, horizon length).
  • Comprehensive empirical study across three domains—creative writing, code generation, and multi‑agent game playing—highlighting trade‑offs between Monte‑Carlo, CoLLM‑CC, and CoLLM‑DC.
  • Open‑source implementation (v1.3.2) that integrates with popular LLM toolkits, enabling reproducibility and rapid experimentation.

Methodology

  1. Problem Setup – Model a team of LLM agents as a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP). Each agent receives its own prompt/context and produces a text output (action).
  2. Actor‑Critic Design
    • Actor: Each LLM is fine‑tuned with a policy head that maps its hidden states to token probabilities.
    • Critic:
      • CoLLM‑CC: A single transformer‑based critic receives the concatenated observations and actions of all agents and outputs a joint state‑value estimate.
      • CoLLM‑DC: Each agent has its own lightweight critic that only sees its local observation/action, approximating a local value function.
  3. Training Loop
    • Run parallel inference episodes (no central scheduler needed).
    • Collect trajectories, compute advantage estimates using Generalized Advantage Estimation (GAE) to reduce variance.
    • Update actors with PPO‑style clipped surrogate loss; update critics with a mean‑squared error loss on the bootstrapped returns.
  4. Baselines – Standard Monte‑Carlo policy‑gradient fine‑tuning (no critic) and a fully centralized execution protocol (where a master node orchestrates the agents).

Results & Findings

DomainHorizon / Reward DensityMonte‑CarloCoLLM‑DCCoLLM‑CC
Writing (short story)Short, denseComparableComparableBest
Code synthesis (single function)Medium, denseSlightly worseComparableBest
Turn‑based strategy gameLong, sparseNeeds ~3× more samplesFails to converge reliablyClear win
  • Sample Efficiency: Both MAAC variants cut the number of required fine‑tuning steps by 30‑50 % on dense‑reward tasks.
  • Stability: The centralized critic (CoLLM‑CC) consistently yields lower variance gradients, leading to smoother training curves on sparse‑reward problems.
  • Scalability: CoLLM‑DC scales better with the number of agents (communication overhead stays local), but its performance degrades when the global reward signal is weak or delayed.

Practical Implications

  • Parallel Deployments: Teams of LLM‑powered micro‑services (e.g., a “research assistant + code reviewer + documentation writer” pipeline) can be trained offline with CoLLM‑DC and then run completely independently at inference time—no need for a coordinating server.
  • Reduced Cloud Costs: Actor‑critic fine‑tuning converges with fewer API calls to expensive LLM endpoints, translating into lower compute bills for enterprises experimenting with multi‑agent workflows.
  • Better Long‑Term Planning: For applications like automated game testing, multi‑step troubleshooting, or multi‑turn dialogue agents, CoLLM‑CC offers a practical route to teach LLMs to anticipate future outcomes without hand‑crafted reward shaping.
  • Plug‑and‑Play: The released code wraps the critic logic as a lightweight PyTorch module that can be attached to any Hugging Face transformer, making it straightforward for devs to prototype decentralized collaboration in their own stacks.

Limitations & Future Work

  • Centralized Critic Bottleneck: CoLLM‑CC still requires a global view of all agents during training, which can become a memory bottleneck for very large teams (>10 agents).
  • Sparse‑Reward Sensitivity: While CoLLM‑CC outperforms other methods on sparse rewards, it still needs careful reward shaping or curriculum learning to avoid dead‑ends.
  • Evaluation Scope: Experiments focus on text‑centric tasks; extending to multimodal agents (e.g., vision‑language) remains an open question.
  • Future Directions suggested by the authors include: hierarchical critics that blend centralized and local information, meta‑learning to adapt critics across domains, and exploring off‑policy actor‑critic variants to further reduce sample needs.

Authors

  • Shuo Liu
  • Tianle Chen
  • Ryan Amiri
  • Christopher Amato

Paper Information

  • arXiv ID: 2601.21972v1
  • Categories: cs.AI, cs.DC, cs.MA
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »