[Paper] Agentic Policy Optimization via Instruction-Policy Co-Evolution
Source: arXiv - 2512.01945v1
Overview
The paper presents INSPO (Instruction‑Policy Co‑evolution), a new framework that lets large language model (LLM) agents continuously refine both what they are told to do (the instructions) and how they act (the policy) during reinforcement learning. By treating instructions as a dynamic, learnable component rather than a static prompt, INSPO unlocks more efficient multi‑turn reasoning and tool‑use, delivering noticeable performance gains on retrieval‑augmented and complex reasoning benchmarks.
Key Contributions
- Co‑evolutionary Loop: Introduces a closed‑loop system where instructions and policies are optimized together, allowing each to inform the other’s improvement.
- Instruction Population Management: Maintains a diverse pool of candidate instructions, automatically attributing RL rewards to each and pruning low‑performers.
- On‑policy Reflection Optimizer: Leverages an LLM‑based optimizer that analyses replay‑buffer experiences to generate and verify new, higher‑quality instructions.
- Empirical Validation: Shows substantial gains over strong static‑instruction baselines on multi‑turn retrieval and reasoning tasks, with only a modest increase in compute.
- Interpretability Boost: The evolved instructions often reveal novel prompting strategies that guide the agent toward more strategic reasoning paths.
Methodology
- Initial Setup – Start with a base LLM and a small set of seed instructions (e.g., “Answer the question step‑by‑step”).
- Population of Instructions – Keep a dynamic pool of instruction candidates. Each episode, an instruction is sampled and paired with the current policy to interact with the environment (e.g., a retrieval‑augmented QA system).
- Reward Attribution – The RL reward obtained from the episode is back‑propagated not only to the policy but also logged against the sampled instruction.
- Pruning & Generation – Periodically, the lowest‑scoring instructions are removed. A dedicated LLM “reflection” module reviews the replay buffer, identifies failure patterns, and synthesizes new instruction candidates that could better steer the policy.
- Verification – New instructions are briefly tested on a validation set; only those that improve the reward signal are admitted to the pool.
- Policy Update – The policy is updated via standard RL algorithms (e.g., PPO) using the collected trajectories, now conditioned on the evolving instruction set.
The whole process repeats, allowing the instruction set to adapt as the policy becomes more capable, and vice‑versa.
Results & Findings
| Task | Baseline (static instr.) | INSPO | Relative ↑ |
|---|---|---|---|
| Multi‑turn Retrieval QA | 71.3 % EM | 78.9 % EM | +10.6 % |
| Complex Reasoning (CoT) | 64.5 % Acc | 72.1 % Acc | +11.8 % |
| Tool‑integrated Reasoning | 58.2 % Success | 65.4 % Success | +12.4 % |
- Instruction Diversity: The evolved instruction pool converged on prompts like “First locate the most relevant source, then verify each claim before answering,” which were not present in the seed set.
- Compute Overhead: Adding the instruction‑generation step increased wall‑clock time by ~15 % compared to a static‑instruction RL loop, a trade‑off many practitioners found acceptable given the performance boost.
- Stability: The co‑evolution process remained stable across random seeds, with variance in final scores dropping by ~30 % relative to static baselines.
Practical Implications
- Better Prompt Engineering: Developers can offload the tedious trial‑and‑error of hand‑crafting prompts to an automated co‑evolution loop, saving time and uncovering non‑obvious prompting strategies.
- Adaptive Agents: In production systems where the environment evolves (e.g., changing APIs or knowledge bases), INSPO can continuously adapt both its policy and its “operating manual,” keeping performance robust without manual re‑prompting.
- Tool‑Use Integration: For agents that need to call external services (search engines, calculators, code interpreters), dynamically refined instructions can guide more efficient tool selection and sequencing, reducing API costs and latency.
- Transferability: The instruction pool learned on one task can be seeded into related tasks, providing a warm‑start that accelerates learning in new domains.
- Debugging Aid: The evolved instructions serve as interpretable artifacts that explain why an agent chose a particular reasoning path, aiding compliance and safety audits.
Limitations & Future Work
- LLM Dependency: The reflection optimizer itself is an LLM, so the quality of generated instructions hinges on the underlying model’s capabilities and may inherit its biases.
- Scalability to Very Large Pools: While a modest instruction population works well, scaling to hundreds of candidates could increase memory and compute demands, requiring smarter sampling strategies.
- Domain Specificity: Experiments focused on retrieval and reasoning; applying INSPO to domains like robotics or dialogue systems may need task‑specific reward shaping.
- Future Directions: The authors suggest exploring meta‑learning approaches to transfer instruction evolution across tasks, integrating human‑in‑the‑loop verification for safety‑critical applications, and reducing reliance on LLM‑based optimizers by using lighter‑weight models for instruction generation.
Authors
- Han Zhou
- Xingchen Wan
- Ivan Vulić
- Anna Korhonen
Paper Information
- arXiv ID: 2512.01945v1
- Categories: cs.LG, cs.AI, cs.CL
- Published: December 1, 2025
- PDF: Download PDF