[Paper] Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Source: arXiv - 2601.09667v1
Overview
The paper presents Multi‑Agent Test‑Time Reinforcement Learning (MATTRL), a novel framework that lets a team of large language model (LLM) agents improve their reasoning at inference time by pulling in relevant “experience” from previous test‑time interactions. By turning the inference stage into a lightweight, collaborative deliberation process, MATTRL sidesteps the costly and unstable training loops that traditionally plague multi‑agent reinforcement learning (MARL).
Key Contributions
- Test‑time experience injection: Introduces a mechanism for agents to retrieve and reuse textual snippets from prior dialogue turns, effectively turning inference into a form of on‑the‑fly learning.
- Multi‑expert deliberation: Builds a structured team of specialist agents that discuss, cross‑check, and reach consensus before producing a final answer.
- Turn‑level credit assignment: Proposes a credit‑assignment scheme that evaluates the usefulness of each retrieved experience, feeding that signal back into the deliberation loop.
- Robust performance gains: Demonstrates consistent accuracy improvements (≈ 3.7 % over multi‑agent baselines, ≈ 8.7 % over strong single‑agent baselines) across diverse domains such as medicine, mathematics, and education.
- Stability without extra training: Shows that the approach yields distribution‑shift‑robust reasoning without the need for additional fine‑tuning or expensive MARL training cycles.
Methodology
- Forming the Agent Team – A pool of LLM‑based specialists is assembled, each tuned (or prompted) for a particular sub‑task (e.g., fact‑checking, calculation, domain knowledge).
- Experience Pool Construction – During inference, every turn of the multi‑turn dialogue is logged along with a lightweight reward signal derived from turn‑level credit assignment (e.g., how much a turn contributed to the final correct answer).
- Retrieval at Test Time – When faced with a new query, the system retrieves the most relevant past turns from the experience pool using semantic similarity search.
- Deliberation Loop – The agents ingest the retrieved snippets, discuss the problem in a structured multi‑turn chat, and iteratively refine their reasoning.
- Consensus Decision – After a fixed number of deliberation rounds, a voting or weighted‑averaging scheme produces the final answer.
The entire pipeline runs at inference time only, so no extra gradient updates or policy‑gradient training are required.
Results & Findings
- Benchmarks: Tested on three challenging suites—medical question answering, grade‑school math problems, and educational concept explanations.
- Accuracy Gains: MATTRL lifts average accuracy by 3.67 % over a strong multi‑agent baseline that lacks test‑time experience, and by 8.67 % over the best single‑agent LLM baseline.
- Ablation Insights:
- Removing the credit‑assignment step drops performance by ~2 %, confirming its role in surfacing high‑utility experiences.
- Using a naïve random retrieval instead of similarity‑based retrieval reduces gains to ~1 %, highlighting the importance of relevance matching.
- Stability: Across multiple random seeds, variance in performance is markedly lower than traditional MARL training, indicating a more predictable inference behavior.
Practical Implications
- Plug‑and‑play reasoning boost: Developers can wrap existing LLM APIs with MATTRL’s deliberation layer to get immediate accuracy improvements without retraining models.
- Domain‑specific assistants: In regulated fields like healthcare, the ability to cite and reuse prior vetted reasoning steps can aid compliance and auditability.
- Cost‑effective scaling: Since the heavy lifting happens at inference, organizations avoid the massive compute budgets typically required for MARL training, making the approach attractive for SaaS products and edge deployments.
- Robustness to distribution shift: By leveraging a dynamic experience pool, the system can adapt to new question styles or emerging knowledge without explicit model updates.
Limitations & Future Work
- Experience pool size: The method relies on a sufficiently rich repository of past dialogues; sparse or domain‑novel queries may suffer if relevant experiences are unavailable.
- Latency overhead: Multi‑turn deliberation and retrieval introduce extra inference latency, which may be problematic for real‑time applications.
- Credit assignment heuristics: The current turn‑level reward signals are handcrafted; learning more nuanced credit mechanisms could further boost performance.
- Scalability to many agents: Managing coordination among a large number of specialist agents could become complex; future work may explore hierarchical or dynamic team formation.
MATTRL opens a promising avenue for turning inference into a collaborative, experience‑driven process, offering developers a practical tool to enhance LLM reasoning without the heavy cost of traditional multi‑agent reinforcement learning.
Authors
- Zhiyuan Hu
- Yunhai Hu
- Juncheng Liu
- Shuyue Stella Li
- Yucheng Wang
- Zhen Xu
- See‑Kiong Ng
- Anh Tuan Luu
- Xinxing Xu
- Bryan Hooi
- Cynthia Breazeal
- Hae Won Park
Paper Information
- arXiv ID: 2601.09667v1
- Categories: cs.AI, cs.CL
- Published: January 14, 2026
- PDF: Download PDF