[Paper] RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System
Source: arXiv - 2602.02488v1
Overview
The paper introduces RLAnything, a new reinforcement‑learning (RL) framework that treats the environment, the policy, and the reward model as co‑evolving components. By closing the loop between them, the system can continuously amplify learning signals and adapt to any large language model (LLM) or agentic scenario without hand‑crafted reward functions or static simulators. The authors demonstrate that this dynamic trio yields sizable performance jumps on several benchmark tasks, suggesting a practical path toward more autonomous, self‑improving AI agents.
Key Contributions
- Closed‑loop co‑optimization of environment, policy, and reward model, allowing each to improve the others during training.
- Integrated feedback that combines step‑wise (per‑action) signals with high‑level outcome signals for richer policy supervision.
- Consistency‑driven reward learning: the reward model is trained to stay consistent with both the policy’s behavior and critic feedback, reducing reliance on costly human annotations.
- Automatic environment adaptation: the simulated environment is dynamically tuned using critic feedback, enabling the system to learn from its own experience rather than a fixed simulator.
- Theoretical grounding: the authors provide convergence guarantees and show how the dynamic components jointly reduce variance in the RL objective.
- Empirical gains across diverse tasks:
- +9.1 % on OSWorld (visual‑language reasoning) for Qwen3‑VL‑8B‑Thinking.
- +18.7 % on AlfWorld and +11.9 % on LiveBench for Qwen2.5‑7B‑Instruct.
- Open‑source release of the codebase (https://github.com/Gen-Verse/Open-AgentRL) to foster reproducibility and community extensions.
Methodology
-
Policy Training – The policy (an LLM or agent) receives two streams of feedback:
- Step‑wise signals (e.g., action‑level rewards, attention to intermediate states).
- Outcome signals (final task success/failure).
These are fused into a single loss that the policy optimizes via standard RL algorithms (e.g., PPO).
-
Reward Model (RM) Learning – Instead of static human‑labeled rewards, the RM is trained jointly with the policy. It receives consistency feedback: the RM should assign higher scores to trajectories that the policy and a learned critic deem better, and lower scores otherwise. This creates a self‑reinforcing loop where a better RM yields a better policy, which in turn provides cleaner signals for the RM.
-
Dynamic Environment Adaptation – The environment simulator is not fixed. A critic evaluates the current environment’s difficulty and suggests adjustments (e.g., altering task parameters, noise levels). The environment parameters are updated to keep the learning signal informative—neither too easy (no learning) nor too hard (no convergence).
-
Closed‑Loop Optimization – All three components are updated iteratively:
- Policy → generates trajectories.
- RM → scores trajectories, providing reward signals.
- Critic → evaluates both policy and environment, feeding back to adjust the environment and refine the RM.
The loop continues until performance plateaus.
-
Theoretical Analysis – The authors prove that under mild assumptions, the joint optimization converges to a stationary point of a joint objective that balances policy performance, reward consistency, and environment relevance.
Results & Findings
| Model / Task | Baseline | RLAnything (+Δ) |
|---|---|---|
| Qwen3‑VL‑8B‑Thinking on OSWorld | 71.2 % | 80.3 % (+9.1 %) |
| Qwen2.5‑7B‑Instruct on AlfWorld | 62.5 % | 81.2 % (+18.7 %) |
| Qwen2.5‑7B‑Instruct on LiveBench | 68.4 % | 80.3 % (+11.9 %) |
- Reward Model vs. Human Labels – The learned RM consistently outperformed a reward signal derived from human annotations, indicating that the consistency‑driven approach can replace expensive labeling pipelines.
- Ablation Studies – Removing any of the three dynamic components (policy‑only RL, static RM, or fixed environment) caused a noticeable drop (5‑12 %) in final performance, confirming the synergy of the closed‑loop design.
- Stability – Training curves showed smoother convergence and lower variance when the environment adaptation was enabled, suggesting better sample efficiency.
Practical Implications
- Reduced Annotation Costs – Developers can train task‑specific agents without building large human‑rated reward datasets; the system self‑generates reliable reward signals.
- Rapid Prototyping of New Tasks – By plugging a new environment description into RLAnything, the framework automatically tunes difficulty and reward shaping, cutting the time needed to get a functional agent.
- Scalable Agentic Systems – For products that rely on LLM‑driven agents (e.g., autonomous assistants, code generation bots, or game AI), RLAnything offers a plug‑and‑play way to continuously improve policies as usage data streams in.
- Better Generalization – Dynamic environment adaptation forces the policy to handle a broader distribution of scenarios, which can translate to more robust behavior in real‑world deployments.
- Open‑source Toolkit – The released code includes ready‑made adapters for popular LLM backbones (Qwen, LLaMA, etc.), making it straightforward for engineers to experiment on their own domains.
Limitations & Future Work
- Computational Overhead – Jointly training three interacting modules requires more GPU memory and longer wall‑clock time compared to a standard RL pipeline.
- Environment Design Dependency – While the environment adapts automatically, it still needs an initial parametric simulator; tasks lacking a reasonable simulator may need additional engineering.
- Theoretical Assumptions – Convergence guarantees rely on smoothness and boundedness assumptions that may not hold for extremely large LLMs or highly stochastic environments.
- Future Directions suggested by the authors include:
- Scaling RLAnything to multi‑agent settings where multiple policies co‑evolve.
- Exploring meta‑learning techniques to accelerate environment adaptation across tasks.
- Integrating human‑in‑the‑loop corrections to further refine reward consistency when edge‑case failures arise.
RLAnything showcases how a fully dynamic RL loop can turn the traditionally static components of reinforcement learning into adaptable, self‑improving modules—opening a practical route for developers to build smarter, less hand‑tuned AI agents.
Authors
- Yinjie Wang
- Tianbao Xie
- Ke Shen
- Mengdi Wang
- Ling Yang
Paper Information
- arXiv ID: 2602.02488v1
- Categories: cs.LG, cs.CL
- Published: February 2, 2026
- PDF: Download PDF