[Paper] SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Source: arXiv - 2602.21158v1
Overview
The paper SELAUR: Self‑Evolving LLM Agent via Uncertainty‑aware Rewards proposes a new way to train large language models (LLMs) that act as autonomous agents. By weaving the model’s own uncertainty into the reward signal, SELAUR lets agents explore more intelligently and learn faster, leading to higher success rates on complex decision‑making tasks such as interactive household simulation (ALFWorld) and web‑based shopping (WebShop).
Key Contributions
- Uncertainty‑driven reward design – combines entropy, least‑confidence, and margin metrics into a single token‑level uncertainty score that directly influences both step‑wise and trajectory‑level rewards.
- Failure‑aware reward reshaping – injects uncertainty signals when an episode fails, turning “mistakes” into useful learning cues rather than pure penalties.
- Dense, confidence‑aligned supervision – provides richer feedback than sparse binary rewards, improving credit assignment across long action sequences.
- Empirical gains on two diverse benchmarks – SELAUR consistently outperforms strong RL‑from‑human‑feedback (RLHF) and PPO baselines on ALFWorld (embodied tasks) and WebShop (web navigation).
- Comprehensive ablations – demonstrate the individual impact of each uncertainty component and the robustness benefits of the failure‑aware reshaping.
Methodology
-
Token‑level uncertainty estimation
- For every generated token, three classic uncertainty measures are computed:
- Entropy – captures overall distribution spread.
- Least confidence – 1 − max probability, highlighting the most doubtful prediction.
- Margin – difference between top‑2 probabilities, indicating how “close” the model is to an alternative choice.
- These scores are normalized and summed to produce a single uncertainty value per token, which is then aggregated (e.g., averaged) over the tokens that constitute an action step.
- For every generated token, three classic uncertainty measures are computed:
-
Uncertainty‑aware reward shaping
- Step‑level reward: the base task reward (e.g., +1 for success, 0 otherwise) is modulated by the inverse of the step’s uncertainty, rewarding confident correct actions and penalizing over‑confident mistakes.
- Trajectory‑level reward: when an episode ends in failure, the accumulated uncertainty across the trajectory is used to redistribute reward, encouraging the agent to revisit high‑uncertainty regions in future attempts.
-
RL loop
- The agent is fine‑tuned with Proximal Policy Optimization (PPO) where the policy gradient is computed using the uncertainty‑aware rewards.
- The LLM’s parameters are updated jointly with a value head that also receives the uncertainty‑augmented signal, stabilizing learning.
-
Self‑evolution
- As training proceeds, the model’s uncertainty naturally shrinks on familiar sub‑tasks, shifting exploration toward the remaining “unknown” parts of the environment—hence the “self‑evolving” behavior.
Results & Findings
| Benchmark | Baseline (PPO) Success | SELAUR Success | Relative Gain |
|---|---|---|---|
| ALFWorld (Household tasks) | 42.3 % | 55.8 % | +13.5 pp |
| WebShop (Web navigation) | 31.7 % | 44.2 % | +12.5 pp |
- Exploration efficiency: SELAUR reaches comparable performance 30‑40 % faster in terms of training steps, thanks to uncertainty‑guided exploration.
- Stability: Variance across random seeds drops noticeably, indicating that the uncertainty signal reduces catastrophic policy swings.
- Ablation insights: Removing any of the three uncertainty components hurts performance (entropy ≈ ‑3 pp, least‑confidence ≈ ‑2 pp, margin ≈ ‑1 pp). The failure‑aware reshaping contributes the largest single boost (+5 pp).
Practical Implications
- Better autonomous assistants – Developers building chat‑oriented bots that must plan multi‑turn actions (e.g., scheduling, troubleshooting) can adopt SELAUR’s reward scheme to make the agents more self‑aware of their confidence, leading to fewer dead‑ends.
- Reduced need for exhaustive human feedback – By extracting learning signals from the model’s own uncertainty, teams can cut down on costly RLHF data collection, especially for niche domains where labeled trajectories are scarce.
- Improved safety in high‑stakes deployments – Uncertainty‑aware rewards naturally penalize over‑confident mistakes, which is valuable for agents operating in regulated environments (finance, healthcare) where blind confidence can be dangerous.
- Plug‑and‑play integration – The uncertainty computation works on top of any decoder‑only LLM that provides token logits, meaning existing pipelines (OpenAI API, Hugging Face Transformers) can incorporate SELAUR with minimal code changes.
Limitations & Future Work
- Scalability of uncertainty computation – Calculating three metrics per token adds overhead; the authors note a ~15 % slowdown in training throughput. Optimizing this (e.g., approximations or batch‑wise caching) is an open avenue.
- Domain transfer – Experiments focus on simulated environments; it remains to be seen how well the approach generalizes to real‑world web APIs or physical robots with noisy observations.
- Reward design still task‑specific – While uncertainty is universal, the exact weighting of step‑ vs. trajectory‑level reshaping may need tuning per domain. Future work could explore meta‑learning the weighting automatically.
- Long‑horizon credit assignment – For extremely long episodes (hundreds of steps), uncertainty alone may not fully resolve delayed reward issues; combining with hierarchical RL could be promising.
Overall, SELAUR opens a practical path for developers to make LLM‑based agents that learn from their own confidence signals, delivering more robust and efficient autonomous systems.
Authors
- Dengjia Zhang
- Xiaoou Liu
- Lu Cheng
- Yaqing Wang
- Kenton Murray
- Hua Wei
Paper Information
- arXiv ID: 2602.21158v1
- Categories: cs.LG, cs.CL
- Published: February 24, 2026
- PDF: Download PDF