[Paper] $π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Published: 3 weeks ago (April 15, 2026 at 12:34 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.14054v1

Overview

The paper introduces π‑Play, a novel self‑play framework that lets multiple AI agents teach each other without any external labeled data. By turning the question‑construction path (the hidden reasoning steps that generate a task) into privileged information, π‑Play gives a “teacher” model dense feedback, dramatically speeding up learning for a “student” model. The result is a data‑free system that outperforms fully supervised search agents and learns 2–3× faster than traditional self‑play.

Key Contributions

Privileged Self‑Distillation: Identifies the intermediate question‑construction path (QCP) produced during self‑play as a rich source of privileged context for teacher‑student training.
π‑Play Framework: Designs a multi‑agent loop where an examiner creates tasks + QCPs, a teacher consumes the QCPs to generate dense supervision, and a student learns via self‑distillation.
Data‑Free Training: Demonstrates that the system can be trained from scratch without any human‑annotated data or external reward signals.
Efficiency Gains: Empirically shows 2–3× faster evolutionary progress and higher final performance compared to conventional sparse‑reward self‑play and even supervised search agents.
Broad Applicability: Validates the approach on both language (cs.CL) and reinforcement‑learning (cs.LG) benchmarks, suggesting a general recipe for self‑evolving agents.

Methodology

Task Generation (Examiner): Two agents play a game where one creates a query (e.g., a question) and simultaneously records the question‑construction path—the step‑by‑step reasoning that leads to the final query.
Privileged Context (Teacher): The teacher model receives the raw query plus the QCP. Because the QCP reveals the hidden solution process, the teacher can produce a much richer supervision signal (e.g., token‑level logits, intermediate states).
Self‑Distillation (Student): The student only sees the query (no QCP). It is trained to mimic the teacher’s outputs using a standard distillation loss, effectively learning the dense knowledge encoded in the QCP.
Evolution Loop: After the student improves, it replaces the examiner in the next round, generating new tasks and QCPs. This creates a closed loop of self‑evolution where each generation benefits from the privileged information of the previous one.
No External Data: All signals come from the agents themselves; there is no need for human‑written examples, reward shaping, or curated privileged datasets.

Results & Findings

Performance: π‑Play agents achieve higher success rates on complex information‑seeking tasks than state‑of‑the‑art supervised search agents.
Learning Speed: The dense feedback from QCPs reduces the number of self‑play iterations needed to reach a given performance level by roughly 2–3×.
Robustness: The framework remains effective across different model sizes and task domains, indicating that the privileged self‑distillation signal is not tied to a specific architecture.
Ablation: Removing the QCP (i.e., reverting to sparse outcome rewards) drops performance back to conventional self‑play levels, confirming the central role of privileged context.

Practical Implications

Reduced Data Costs: Companies can bootstrap powerful search or reasoning agents without costly data annotation pipelines.
Faster Prototyping: Development cycles shrink because agents improve quickly through self‑evolution, enabling rapid iteration on new tasks (e.g., code generation, QA, planning).
Scalable Multi‑Agent Systems: π‑Play’s multi‑agent loop can be deployed in distributed settings (cloud or edge) where agents continuously generate and refine tasks autonomously.
Improved Credit Assignment: Dense supervision mitigates the classic sparse‑reward problem, making it easier to integrate such agents into existing reinforcement‑learning pipelines.
Potential for Continual Learning: Since the examiner continuously creates fresh tasks, the system naturally supports lifelong learning without catastrophic forgetting.

Limitations & Future Work

Quality of QCPs: The approach assumes the examiner can produce meaningful construction paths; noisy or trivial QCPs could degrade teacher supervision.
Computational Overhead: Generating and storing QCPs adds extra compute and memory compared to plain self‑play, which may be a bottleneck for very large models.
Domain Transfer: While experiments span language and RL, applying π‑Play to domains with highly structured or non‑sequential tasks (e.g., robotics) may require additional engineering.
Future Directions: The authors suggest exploring automated filtering of low‑quality QCPs, scaling to multi‑modal tasks (vision‑language), and integrating human‑in‑the‑loop verification to further boost reliability.

Authors

Yaocheng Zhang
Yuanheng Zhu
Wenyue Chong
Songjun Tu
Qichao Zhang
Jiajun Chai
Xiaohan Wang
Wei Lin
Guojun Yin
Dongbin Zhao

Paper Information

arXiv ID: 2604.14054v1
Categories: cs.LG, cs.CL
Published: April 15, 2026
PDF: Download PDF

[Paper] $π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

[Paper] Detecting and Suppressing Reward Hacking with Gradient Fingerprints