[Paper] $π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
Source: arXiv - 2604.14054v1
Overview
The paper introduces π‑Play, a novel self‑play framework that lets multiple AI agents teach each other without any external labeled data. By turning the question‑construction path (the hidden reasoning steps that generate a task) into privileged information, π‑Play gives a “teacher” model dense feedback, dramatically speeding up learning for a “student” model. The result is a data‑free system that outperforms fully supervised search agents and learns 2–3× faster than traditional self‑play.
Key Contributions
- Privileged Self‑Distillation: Identifies the intermediate question‑construction path (QCP) produced during self‑play as a rich source of privileged context for teacher‑student training.
- π‑Play Framework: Designs a multi‑agent loop where an examiner creates tasks + QCPs, a teacher consumes the QCPs to generate dense supervision, and a student learns via self‑distillation.
- Data‑Free Training: Demonstrates that the system can be trained from scratch without any human‑annotated data or external reward signals.
- Efficiency Gains: Empirically shows 2–3× faster evolutionary progress and higher final performance compared to conventional sparse‑reward self‑play and even supervised search agents.
- Broad Applicability: Validates the approach on both language (cs.CL) and reinforcement‑learning (cs.LG) benchmarks, suggesting a general recipe for self‑evolving agents.
Methodology
- Task Generation (Examiner): Two agents play a game where one creates a query (e.g., a question) and simultaneously records the question‑construction path—the step‑by‑step reasoning that leads to the final query.
- Privileged Context (Teacher): The teacher model receives the raw query plus the QCP. Because the QCP reveals the hidden solution process, the teacher can produce a much richer supervision signal (e.g., token‑level logits, intermediate states).
- Self‑Distillation (Student): The student only sees the query (no QCP). It is trained to mimic the teacher’s outputs using a standard distillation loss, effectively learning the dense knowledge encoded in the QCP.
- Evolution Loop: After the student improves, it replaces the examiner in the next round, generating new tasks and QCPs. This creates a closed loop of self‑evolution where each generation benefits from the privileged information of the previous one.
- No External Data: All signals come from the agents themselves; there is no need for human‑written examples, reward shaping, or curated privileged datasets.
Results & Findings
- Performance: π‑Play agents achieve higher success rates on complex information‑seeking tasks than state‑of‑the‑art supervised search agents.
- Learning Speed: The dense feedback from QCPs reduces the number of self‑play iterations needed to reach a given performance level by roughly 2–3×.
- Robustness: The framework remains effective across different model sizes and task domains, indicating that the privileged self‑distillation signal is not tied to a specific architecture.
- Ablation: Removing the QCP (i.e., reverting to sparse outcome rewards) drops performance back to conventional self‑play levels, confirming the central role of privileged context.
Practical Implications
- Reduced Data Costs: Companies can bootstrap powerful search or reasoning agents without costly data annotation pipelines.
- Faster Prototyping: Development cycles shrink because agents improve quickly through self‑evolution, enabling rapid iteration on new tasks (e.g., code generation, QA, planning).
- Scalable Multi‑Agent Systems: π‑Play’s multi‑agent loop can be deployed in distributed settings (cloud or edge) where agents continuously generate and refine tasks autonomously.
- Improved Credit Assignment: Dense supervision mitigates the classic sparse‑reward problem, making it easier to integrate such agents into existing reinforcement‑learning pipelines.
- Potential for Continual Learning: Since the examiner continuously creates fresh tasks, the system naturally supports lifelong learning without catastrophic forgetting.
Limitations & Future Work
- Quality of QCPs: The approach assumes the examiner can produce meaningful construction paths; noisy or trivial QCPs could degrade teacher supervision.
- Computational Overhead: Generating and storing QCPs adds extra compute and memory compared to plain self‑play, which may be a bottleneck for very large models.
- Domain Transfer: While experiments span language and RL, applying π‑Play to domains with highly structured or non‑sequential tasks (e.g., robotics) may require additional engineering.
- Future Directions: The authors suggest exploring automated filtering of low‑quality QCPs, scaling to multi‑modal tasks (vision‑language), and integrating human‑in‑the‑loop verification to further boost reliability.
Authors
- Yaocheng Zhang
- Yuanheng Zhu
- Wenyue Chong
- Songjun Tu
- Qichao Zhang
- Jiajun Chai
- Xiaohan Wang
- Wei Lin
- Guojun Yin
- Dongbin Zhao
Paper Information
- arXiv ID: 2604.14054v1
- Categories: cs.LG, cs.CL
- Published: April 15, 2026
- PDF: Download PDF