[Paper] Internalizing Agency from Reflective Experience
Source: arXiv - 2603.16843v1
Overview
Large language models (LLMs) are increasingly being used as autonomous agents that must plan, act, and recover from mistakes while interacting with complex environments (e.g., coding assistants, game bots). The new paper introduces LEAFE—a learning framework that lets agents reflect on the rich feedback they receive during interaction and turn it into concrete recovery strategies, rather than only chasing a final success signal.
Key Contributions
- Feedback‑grounded agency: Proposes a method for agents to internalize environment feedback (error messages, partial scores, hints) and use it to improve recovery behavior.
- Reflective experience loop: During exploration, the agent summarizes feedback, backtracks to earlier decision points, and re‑explores alternative actions guided by the summary.
- Supervised fine‑tuning from reflections: The corrected trajectories are distilled into the LLM via supervised fine‑tuning, enabling the model to recover without extra search at inference time.
- Empirical gains on long‑horizon tasks: Across interactive coding benchmarks and other agentic tasks, LEAFE raises Pass@1 and Pass@k (up to Pass@128) by up to 14 % over strong outcome‑driven baselines such as GRPO and Early Experience.
- Budget‑aware improvement: Demonstrates consistent benefits under fixed interaction budgets, showing that smarter recovery can outweigh simply taking more steps.
Methodology
- Exploration Phase – The agent interacts with the environment (e.g., writes code, executes it) and collects rich feedback (error traces, test failures, partial scores).
- Reflection Phase – A lightweight summarizer compresses this feedback into a short “experience note” that highlights what went wrong and what could be tried next.
- Backtrack & Re‑explore – The agent rewinds to a prior decision point (e.g., the last line of code) and, using the experience note, generates alternative actions. This creates a corrected trajectory that successfully resolves the earlier failure.
- Distillation – All corrected trajectories are gathered into a dataset. The base LLM is then fine‑tuned with standard supervised learning (input = original state, target = corrected action) so the model learns to anticipate and fix mistakes on its own.
- Inference – The fine‑tuned model can now recover from errors without an explicit backtrack loop, staying within the same interaction budget.
The pipeline is deliberately simple: it reuses existing LLM capabilities (summarization, generation) and standard fine‑tuning pipelines, making it easy to plug into existing agent stacks.
Results & Findings
| Benchmark | Metric | Baseline (GRPO) | LEAFE | Δ |
|---|---|---|---|---|
| Interactive coding (Pass@1) | Success rate | 42 % | 48 % | +6 % |
| Interactive coding (Pass@128) | Success rate | 68 % | 82 % | +14 % |
| Agentic navigation tasks | Completion score | 0.71 | 0.78 | +0.07 |
- Higher Pass@k: LEAFE consistently outperforms outcome‑driven methods across k values, indicating better diversity and robustness of solutions.
- Better sample efficiency: With the same number of interaction steps, LEAFE achieves higher success, confirming that reflective recovery is more budget‑friendly than simply taking more actions.
- Generalization: The same framework works for both code generation (where feedback is compile/runtime errors) and navigation‑style tasks (where feedback is distance or collision signals), suggesting broad applicability.
Practical Implications
- Developer tools: Coding assistants can now suggest fixes on the fly, turning compile errors into actionable suggestions without needing a separate “debug” loop.
- Autonomous bots: Game AI, robotics, or web‑automation agents can use error messages or partial rewards to self‑correct, reducing the need for hand‑crafted reward shaping.
- Cost savings: Since LEAFE improves performance under a fixed interaction budget, services that charge per API call (e.g., OpenAI, Anthropic) can deliver higher quality results for the same cost.
- Simplified pipelines: Teams can adopt LEAFE by adding a reflection‑summarization step and a periodic fine‑tuning job—no reinforcement‑learning infrastructure is required.
- Safety & reliability: By explicitly training on failure cases, agents become less likely to repeat catastrophic mistakes, a step toward more trustworthy LLM‑driven automation.
Limitations & Future Work
- Reflection quality depends on the summarizer: Poorly summarized feedback can misguide backtracking, limiting gains.
- Backtrack depth is heuristic: Deciding how far to rewind is currently a rule‑based choice; learning an optimal backtrack policy could improve results.
- Scalability to massive state spaces: The current experiments focus on tasks with relatively compact histories; extending to long‑running simulations may require more efficient memory handling.
- Future directions include automated curriculum generation for reflective experiences, integrating learned backtrack policies, and testing LEAFE on real‑world robotics or multi‑agent coordination scenarios.
Authors
- Rui Ge
- Yichao Fu
- Yuyang Qian
- Junda Su
- Yiming Zhao
- Peng Zhao
- Hao Zhang
Paper Information
- arXiv ID: 2603.16843v1
- Categories: cs.AI
- Published: March 17, 2026
- PDF: Download PDF