[Paper] SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Published: 3 days ago (February 24, 2026 at 01:04 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.21158v1

Overview

The paper SELAUR: Self‑Evolving LLM Agent via Uncertainty‑aware Rewards proposes a new way to train large language models (LLMs) that act as autonomous agents. By weaving the model’s own uncertainty into the reward signal, SELAUR lets agents explore more intelligently and learn faster, leading to higher success rates on complex decision‑making tasks such as interactive household simulation (ALFWorld) and web‑based shopping (WebShop).

Key Contributions

Uncertainty‑driven reward design – combines entropy, least‑confidence, and margin metrics into a single token‑level uncertainty score that directly influences both step‑wise and trajectory‑level rewards.
Failure‑aware reward reshaping – injects uncertainty signals when an episode fails, turning “mistakes” into useful learning cues rather than pure penalties.
Dense, confidence‑aligned supervision – provides richer feedback than sparse binary rewards, improving credit assignment across long action sequences.
Empirical gains on two diverse benchmarks – SELAUR consistently outperforms strong RL‑from‑human‑feedback (RLHF) and PPO baselines on ALFWorld (embodied tasks) and WebShop (web navigation).
Comprehensive ablations – demonstrate the individual impact of each uncertainty component and the robustness benefits of the failure‑aware reshaping.

Methodology

Token‑level uncertainty estimation
- For every generated token, three classic uncertainty measures are computed:
  - Entropy – captures overall distribution spread.
  - Least confidence – 1 − max probability, highlighting the most doubtful prediction.
  - Margin – difference between top‑2 probabilities, indicating how “close” the model is to an alternative choice.
- These scores are normalized and summed to produce a single uncertainty value per token, which is then aggregated (e.g., averaged) over the tokens that constitute an action step.
Uncertainty‑aware reward shaping
- Step‑level reward: the base task reward (e.g., +1 for success, 0 otherwise) is modulated by the inverse of the step’s uncertainty, rewarding confident correct actions and penalizing over‑confident mistakes.
- Trajectory‑level reward: when an episode ends in failure, the accumulated uncertainty across the trajectory is used to redistribute reward, encouraging the agent to revisit high‑uncertainty regions in future attempts.
RL loop
- The agent is fine‑tuned with Proximal Policy Optimization (PPO) where the policy gradient is computed using the uncertainty‑aware rewards.
- The LLM’s parameters are updated jointly with a value head that also receives the uncertainty‑augmented signal, stabilizing learning.
Self‑evolution
- As training proceeds, the model’s uncertainty naturally shrinks on familiar sub‑tasks, shifting exploration toward the remaining “unknown” parts of the environment—hence the “self‑evolving” behavior.

Results & Findings

Benchmark	Baseline (PPO) Success	SELAUR Success	Relative Gain
ALFWorld (Household tasks)	42.3 %	55.8 %	+13.5 pp
WebShop (Web navigation)	31.7 %	44.2 %	+12.5 pp

Exploration efficiency: SELAUR reaches comparable performance 30‑40 % faster in terms of training steps, thanks to uncertainty‑guided exploration.
Stability: Variance across random seeds drops noticeably, indicating that the uncertainty signal reduces catastrophic policy swings.
Ablation insights: Removing any of the three uncertainty components hurts performance (entropy ≈ ‑3 pp, least‑confidence ≈ ‑2 pp, margin ≈ ‑1 pp). The failure‑aware reshaping contributes the largest single boost (+5 pp).

Practical Implications

Better autonomous assistants – Developers building chat‑oriented bots that must plan multi‑turn actions (e.g., scheduling, troubleshooting) can adopt SELAUR’s reward scheme to make the agents more self‑aware of their confidence, leading to fewer dead‑ends.
Reduced need for exhaustive human feedback – By extracting learning signals from the model’s own uncertainty, teams can cut down on costly RLHF data collection, especially for niche domains where labeled trajectories are scarce.
Improved safety in high‑stakes deployments – Uncertainty‑aware rewards naturally penalize over‑confident mistakes, which is valuable for agents operating in regulated environments (finance, healthcare) where blind confidence can be dangerous.
Plug‑and‑play integration – The uncertainty computation works on top of any decoder‑only LLM that provides token logits, meaning existing pipelines (OpenAI API, Hugging Face Transformers) can incorporate SELAUR with minimal code changes.

Limitations & Future Work

Scalability of uncertainty computation – Calculating three metrics per token adds overhead; the authors note a ~15 % slowdown in training throughput. Optimizing this (e.g., approximations or batch‑wise caching) is an open avenue.
Domain transfer – Experiments focus on simulated environments; it remains to be seen how well the approach generalizes to real‑world web APIs or physical robots with noisy observations.
Reward design still task‑specific – While uncertainty is universal, the exact weighting of step‑ vs. trajectory‑level reshaping may need tuning per domain. Future work could explore meta‑learning the weighting automatically.
Long‑horizon credit assignment – For extremely long episodes (hundreds of steps), uncertainty alone may not fully resolve delayed reward issues; combining with hierarchical RL could be promising.

Overall, SELAUR opens a practical path for developers to make LLM‑based agents that learn from their own confidence signals, delivering more robust and efficient autonomous systems.

Authors

Dengjia Zhang
Xiaoou Liu
Lu Cheng
Yaqing Wang
Kenton Murray
Hua Wei

Paper Information

arXiv ID: 2602.21158v1
Categories: cs.LG, cs.CL
Published: February 24, 2026
PDF: Download PDF

[Paper] SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

[Paper] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?