[Paper] Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Published: (January 14, 2026 at 12:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.09667v1

Overview

The paper presents Multi‑Agent Test‑Time Reinforcement Learning (MATTRL), a novel framework that lets a team of large language model (LLM) agents improve their reasoning at inference time by pulling in relevant “experience” from previous test‑time interactions. By turning the inference stage into a lightweight, collaborative deliberation process, MATTRL sidesteps the costly and unstable training loops that traditionally plague multi‑agent reinforcement learning (MARL).

Key Contributions

  • Test‑time experience injection: Introduces a mechanism for agents to retrieve and reuse textual snippets from prior dialogue turns, effectively turning inference into a form of on‑the‑fly learning.
  • Multi‑expert deliberation: Builds a structured team of specialist agents that discuss, cross‑check, and reach consensus before producing a final answer.
  • Turn‑level credit assignment: Proposes a credit‑assignment scheme that evaluates the usefulness of each retrieved experience, feeding that signal back into the deliberation loop.
  • Robust performance gains: Demonstrates consistent accuracy improvements (≈ 3.7 % over multi‑agent baselines, ≈ 8.7 % over strong single‑agent baselines) across diverse domains such as medicine, mathematics, and education.
  • Stability without extra training: Shows that the approach yields distribution‑shift‑robust reasoning without the need for additional fine‑tuning or expensive MARL training cycles.

Methodology

  1. Forming the Agent Team – A pool of LLM‑based specialists is assembled, each tuned (or prompted) for a particular sub‑task (e.g., fact‑checking, calculation, domain knowledge).
  2. Experience Pool Construction – During inference, every turn of the multi‑turn dialogue is logged along with a lightweight reward signal derived from turn‑level credit assignment (e.g., how much a turn contributed to the final correct answer).
  3. Retrieval at Test Time – When faced with a new query, the system retrieves the most relevant past turns from the experience pool using semantic similarity search.
  4. Deliberation Loop – The agents ingest the retrieved snippets, discuss the problem in a structured multi‑turn chat, and iteratively refine their reasoning.
  5. Consensus Decision – After a fixed number of deliberation rounds, a voting or weighted‑averaging scheme produces the final answer.

The entire pipeline runs at inference time only, so no extra gradient updates or policy‑gradient training are required.

Results & Findings

  • Benchmarks: Tested on three challenging suites—medical question answering, grade‑school math problems, and educational concept explanations.
  • Accuracy Gains: MATTRL lifts average accuracy by 3.67 % over a strong multi‑agent baseline that lacks test‑time experience, and by 8.67 % over the best single‑agent LLM baseline.
  • Ablation Insights:
    • Removing the credit‑assignment step drops performance by ~2 %, confirming its role in surfacing high‑utility experiences.
    • Using a naïve random retrieval instead of similarity‑based retrieval reduces gains to ~1 %, highlighting the importance of relevance matching.
  • Stability: Across multiple random seeds, variance in performance is markedly lower than traditional MARL training, indicating a more predictable inference behavior.

Practical Implications

  • Plug‑and‑play reasoning boost: Developers can wrap existing LLM APIs with MATTRL’s deliberation layer to get immediate accuracy improvements without retraining models.
  • Domain‑specific assistants: In regulated fields like healthcare, the ability to cite and reuse prior vetted reasoning steps can aid compliance and auditability.
  • Cost‑effective scaling: Since the heavy lifting happens at inference, organizations avoid the massive compute budgets typically required for MARL training, making the approach attractive for SaaS products and edge deployments.
  • Robustness to distribution shift: By leveraging a dynamic experience pool, the system can adapt to new question styles or emerging knowledge without explicit model updates.

Limitations & Future Work

  • Experience pool size: The method relies on a sufficiently rich repository of past dialogues; sparse or domain‑novel queries may suffer if relevant experiences are unavailable.
  • Latency overhead: Multi‑turn deliberation and retrieval introduce extra inference latency, which may be problematic for real‑time applications.
  • Credit assignment heuristics: The current turn‑level reward signals are handcrafted; learning more nuanced credit mechanisms could further boost performance.
  • Scalability to many agents: Managing coordination among a large number of specialist agents could become complex; future work may explore hierarchical or dynamic team formation.

MATTRL opens a promising avenue for turning inference into a collaborative, experience‑driven process, offering developers a practical tool to enhance LLM reasoning without the heavy cost of traditional multi‑agent reinforcement learning.

Authors

  • Zhiyuan Hu
  • Yunhai Hu
  • Juncheng Liu
  • Shuyue Stella Li
  • Yucheng Wang
  • Zhen Xu
  • See‑Kiong Ng
  • Anh Tuan Luu
  • Xinxing Xu
  • Bryan Hooi
  • Cynthia Breazeal
  • Hae Won Park

Paper Information

  • arXiv ID: 2601.09667v1
  • Categories: cs.AI, cs.CL
  • Published: January 14, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »