[Paper] Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Published: 3 weeks ago (January 14, 2026 at 12:57 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.09667v1

Overview

The paper presents Multi‑Agent Test‑Time Reinforcement Learning (MATTRL), a novel framework that lets a team of large language model (LLM) agents improve their reasoning at inference time by pulling in relevant “experience” from previous test‑time interactions. By turning the inference stage into a lightweight, collaborative deliberation process, MATTRL sidesteps the costly and unstable training loops that traditionally plague multi‑agent reinforcement learning (MARL).

Key Contributions

Test‑time experience injection: Introduces a mechanism for agents to retrieve and reuse textual snippets from prior dialogue turns, effectively turning inference into a form of on‑the‑fly learning.
Multi‑expert deliberation: Builds a structured team of specialist agents that discuss, cross‑check, and reach consensus before producing a final answer.
Turn‑level credit assignment: Proposes a credit‑assignment scheme that evaluates the usefulness of each retrieved experience, feeding that signal back into the deliberation loop.
Robust performance gains: Demonstrates consistent accuracy improvements (≈ 3.7 % over multi‑agent baselines, ≈ 8.7 % over strong single‑agent baselines) across diverse domains such as medicine, mathematics, and education.
Stability without extra training: Shows that the approach yields distribution‑shift‑robust reasoning without the need for additional fine‑tuning or expensive MARL training cycles.

Methodology

Forming the Agent Team – A pool of LLM‑based specialists is assembled, each tuned (or prompted) for a particular sub‑task (e.g., fact‑checking, calculation, domain knowledge).
Experience Pool Construction – During inference, every turn of the multi‑turn dialogue is logged along with a lightweight reward signal derived from turn‑level credit assignment (e.g., how much a turn contributed to the final correct answer).
Retrieval at Test Time – When faced with a new query, the system retrieves the most relevant past turns from the experience pool using semantic similarity search.
Deliberation Loop – The agents ingest the retrieved snippets, discuss the problem in a structured multi‑turn chat, and iteratively refine their reasoning.
Consensus Decision – After a fixed number of deliberation rounds, a voting or weighted‑averaging scheme produces the final answer.

The entire pipeline runs at inference time only, so no extra gradient updates or policy‑gradient training are required.

Results & Findings

Benchmarks: Tested on three challenging suites—medical question answering, grade‑school math problems, and educational concept explanations.
Accuracy Gains: MATTRL lifts average accuracy by 3.67 % over a strong multi‑agent baseline that lacks test‑time experience, and by 8.67 % over the best single‑agent LLM baseline.
Ablation Insights:
- Removing the credit‑assignment step drops performance by ~2 %, confirming its role in surfacing high‑utility experiences.
- Using a naïve random retrieval instead of similarity‑based retrieval reduces gains to ~1 %, highlighting the importance of relevance matching.
Stability: Across multiple random seeds, variance in performance is markedly lower than traditional MARL training, indicating a more predictable inference behavior.

Practical Implications

Plug‑and‑play reasoning boost: Developers can wrap existing LLM APIs with MATTRL’s deliberation layer to get immediate accuracy improvements without retraining models.
Domain‑specific assistants: In regulated fields like healthcare, the ability to cite and reuse prior vetted reasoning steps can aid compliance and auditability.
Cost‑effective scaling: Since the heavy lifting happens at inference, organizations avoid the massive compute budgets typically required for MARL training, making the approach attractive for SaaS products and edge deployments.
Robustness to distribution shift: By leveraging a dynamic experience pool, the system can adapt to new question styles or emerging knowledge without explicit model updates.

Limitations & Future Work

Experience pool size: The method relies on a sufficiently rich repository of past dialogues; sparse or domain‑novel queries may suffer if relevant experiences are unavailable.
Latency overhead: Multi‑turn deliberation and retrieval introduce extra inference latency, which may be problematic for real‑time applications.
Credit assignment heuristics: The current turn‑level reward signals are handcrafted; learning more nuanced credit mechanisms could further boost performance.
Scalability to many agents: Managing coordination among a large number of specialist agents could become complex; future work may explore hierarchical or dynamic team formation.

MATTRL opens a promising avenue for turning inference into a collaborative, experience‑driven process, offering developers a practical tool to enhance LLM reasoning without the heavy cost of traditional multi‑agent reinforcement learning.

Authors

Zhiyuan Hu
Yunhai Hu
Juncheng Liu
Shuyue Stella Li
Yucheng Wang
Zhen Xu
See‑Kiong Ng
Anh Tuan Luu
Xinxing Xu
Bryan Hooi
Cynthia Breazeal
Hae Won Park

Paper Information

arXiv ID: 2601.09667v1
Categories: cs.AI, cs.CL
Published: January 14, 2026
PDF: Download PDF

[Paper] Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models