[Paper] Agentic Policy Optimization via Instruction-Policy Co-Evolution

Published: 3 days ago (December 1, 2025 at 12:56 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.01945v1

Overview

The paper presents INSPO (Instruction‑Policy Co‑evolution), a new framework that lets large language model (LLM) agents continuously refine both what they are told to do (the instructions) and how they act (the policy) during reinforcement learning. By treating instructions as a dynamic, learnable component rather than a static prompt, INSPO unlocks more efficient multi‑turn reasoning and tool‑use, delivering noticeable performance gains on retrieval‑augmented and complex reasoning benchmarks.

Key Contributions

Co‑evolutionary Loop: Introduces a closed‑loop system where instructions and policies are optimized together, allowing each to inform the other’s improvement.
Instruction Population Management: Maintains a diverse pool of candidate instructions, automatically attributing RL rewards to each and pruning low‑performers.
On‑policy Reflection Optimizer: Leverages an LLM‑based optimizer that analyses replay‑buffer experiences to generate and verify new, higher‑quality instructions.
Empirical Validation: Shows substantial gains over strong static‑instruction baselines on multi‑turn retrieval and reasoning tasks, with only a modest increase in compute.
Interpretability Boost: The evolved instructions often reveal novel prompting strategies that guide the agent toward more strategic reasoning paths.

Methodology

Initial Setup – Start with a base LLM and a small set of seed instructions (e.g., “Answer the question step‑by‑step”).
Population of Instructions – Keep a dynamic pool of instruction candidates. Each episode, an instruction is sampled and paired with the current policy to interact with the environment (e.g., a retrieval‑augmented QA system).
Reward Attribution – The RL reward obtained from the episode is back‑propagated not only to the policy but also logged against the sampled instruction.
Pruning & Generation – Periodically, the lowest‑scoring instructions are removed. A dedicated LLM “reflection” module reviews the replay buffer, identifies failure patterns, and synthesizes new instruction candidates that could better steer the policy.
Verification – New instructions are briefly tested on a validation set; only those that improve the reward signal are admitted to the pool.
Policy Update – The policy is updated via standard RL algorithms (e.g., PPO) using the collected trajectories, now conditioned on the evolving instruction set.

The whole process repeats, allowing the instruction set to adapt as the policy becomes more capable, and vice‑versa.

Results & Findings

Task	Baseline (static instr.)	INSPO	Relative ↑
Multi‑turn Retrieval QA	71.3 % EM	78.9 % EM	+10.6 %
Complex Reasoning (CoT)	64.5 % Acc	72.1 % Acc	+11.8 %
Tool‑integrated Reasoning	58.2 % Success	65.4 % Success	+12.4 %

Instruction Diversity: The evolved instruction pool converged on prompts like “First locate the most relevant source, then verify each claim before answering,” which were not present in the seed set.
Compute Overhead: Adding the instruction‑generation step increased wall‑clock time by ~15 % compared to a static‑instruction RL loop, a trade‑off many practitioners found acceptable given the performance boost.
Stability: The co‑evolution process remained stable across random seeds, with variance in final scores dropping by ~30 % relative to static baselines.

Practical Implications

Better Prompt Engineering: Developers can offload the tedious trial‑and‑error of hand‑crafting prompts to an automated co‑evolution loop, saving time and uncovering non‑obvious prompting strategies.
Adaptive Agents: In production systems where the environment evolves (e.g., changing APIs or knowledge bases), INSPO can continuously adapt both its policy and its “operating manual,” keeping performance robust without manual re‑prompting.
Tool‑Use Integration: For agents that need to call external services (search engines, calculators, code interpreters), dynamically refined instructions can guide more efficient tool selection and sequencing, reducing API costs and latency.
Transferability: The instruction pool learned on one task can be seeded into related tasks, providing a warm‑start that accelerates learning in new domains.
Debugging Aid: The evolved instructions serve as interpretable artifacts that explain why an agent chose a particular reasoning path, aiding compliance and safety audits.

Limitations & Future Work

LLM Dependency: The reflection optimizer itself is an LLM, so the quality of generated instructions hinges on the underlying model’s capabilities and may inherit its biases.
Scalability to Very Large Pools: While a modest instruction population works well, scaling to hundreds of candidates could increase memory and compute demands, requiring smarter sampling strategies.
Domain Specificity: Experiments focused on retrieval and reasoning; applying INSPO to domains like robotics or dialogue systems may need task‑specific reward shaping.
Future Directions: The authors suggest exploring meta‑learning approaches to transfer instruction evolution across tasks, integrating human‑in‑the‑loop verification for safety‑critical applications, and reducing reliance on LLM‑based optimizers by using lighter‑weight models for instruction generation.

Authors

Han Zhou
Xingchen Wan
Ivan Vulić
Anna Korhonen

Paper Information

arXiv ID: 2512.01945v1
Categories: cs.LG, cs.AI, cs.CL
Published: December 1, 2025
PDF: Download PDF

[Paper] Agentic Policy Optimization via Instruction-Policy Co-Evolution

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

[Paper] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

[Paper] Structured Document Translation via Format Reinforcement Learning

[Paper] Multi-LLM Collaboration for Medication Recommendation