[Paper] Internalizing Agency from Reflective Experience

Published: 3 days ago (March 17, 2026 at 01:50 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.16843v1

Overview

Large language models (LLMs) are increasingly being used as autonomous agents that must plan, act, and recover from mistakes while interacting with complex environments (e.g., coding assistants, game bots). The new paper introduces LEAFE—a learning framework that lets agents reflect on the rich feedback they receive during interaction and turn it into concrete recovery strategies, rather than only chasing a final success signal.

Key Contributions

Feedback‑grounded agency: Proposes a method for agents to internalize environment feedback (error messages, partial scores, hints) and use it to improve recovery behavior.
Reflective experience loop: During exploration, the agent summarizes feedback, backtracks to earlier decision points, and re‑explores alternative actions guided by the summary.
Supervised fine‑tuning from reflections: The corrected trajectories are distilled into the LLM via supervised fine‑tuning, enabling the model to recover without extra search at inference time.
Empirical gains on long‑horizon tasks: Across interactive coding benchmarks and other agentic tasks, LEAFE raises Pass@1 and Pass@k (up to Pass@128) by up to 14 % over strong outcome‑driven baselines such as GRPO and Early Experience.
Budget‑aware improvement: Demonstrates consistent benefits under fixed interaction budgets, showing that smarter recovery can outweigh simply taking more steps.

Methodology

Exploration Phase – The agent interacts with the environment (e.g., writes code, executes it) and collects rich feedback (error traces, test failures, partial scores).
Reflection Phase – A lightweight summarizer compresses this feedback into a short “experience note” that highlights what went wrong and what could be tried next.
Backtrack & Re‑explore – The agent rewinds to a prior decision point (e.g., the last line of code) and, using the experience note, generates alternative actions. This creates a corrected trajectory that successfully resolves the earlier failure.
Distillation – All corrected trajectories are gathered into a dataset. The base LLM is then fine‑tuned with standard supervised learning (input = original state, target = corrected action) so the model learns to anticipate and fix mistakes on its own.
Inference – The fine‑tuned model can now recover from errors without an explicit backtrack loop, staying within the same interaction budget.

The pipeline is deliberately simple: it reuses existing LLM capabilities (summarization, generation) and standard fine‑tuning pipelines, making it easy to plug into existing agent stacks.

Results & Findings

Benchmark	Metric	Baseline (GRPO)	LEAFE	Δ
Interactive coding (Pass@1)	Success rate	42 %	48 %	+6 %
Interactive coding (Pass@128)	Success rate	68 %	82 %	+14 %
Agentic navigation tasks	Completion score	0.71	0.78	+0.07

Higher Pass@k: LEAFE consistently outperforms outcome‑driven methods across k values, indicating better diversity and robustness of solutions.
Better sample efficiency: With the same number of interaction steps, LEAFE achieves higher success, confirming that reflective recovery is more budget‑friendly than simply taking more actions.
Generalization: The same framework works for both code generation (where feedback is compile/runtime errors) and navigation‑style tasks (where feedback is distance or collision signals), suggesting broad applicability.

Practical Implications

Developer tools: Coding assistants can now suggest fixes on the fly, turning compile errors into actionable suggestions without needing a separate “debug” loop.
Autonomous bots: Game AI, robotics, or web‑automation agents can use error messages or partial rewards to self‑correct, reducing the need for hand‑crafted reward shaping.
Cost savings: Since LEAFE improves performance under a fixed interaction budget, services that charge per API call (e.g., OpenAI, Anthropic) can deliver higher quality results for the same cost.
Simplified pipelines: Teams can adopt LEAFE by adding a reflection‑summarization step and a periodic fine‑tuning job—no reinforcement‑learning infrastructure is required.
Safety & reliability: By explicitly training on failure cases, agents become less likely to repeat catastrophic mistakes, a step toward more trustworthy LLM‑driven automation.

Limitations & Future Work

Reflection quality depends on the summarizer: Poorly summarized feedback can misguide backtracking, limiting gains.
Backtrack depth is heuristic: Deciding how far to rewind is currently a rule‑based choice; learning an optimal backtrack policy could improve results.
Scalability to massive state spaces: The current experiments focus on tasks with relatively compact histories; extending to long‑running simulations may require more efficient memory handling.
Future directions include automated curriculum generation for reflective experiences, integrating learned backtrack policies, and testing LEAFE on real‑world robotics or multi‑agent coordination scenarios.

Authors

Rui Ge
Yichao Fu
Yuyang Qian
Junda Su
Yiming Zhao
Peng Zhao
Hao Zhang

Paper Information

arXiv ID: 2603.16843v1
Categories: cs.AI
Published: March 17, 2026
PDF: Download PDF

[Paper] Internalizing Agency from Reflective Experience

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Spectrally-Guided Diffusion Noise Schedules