[Paper] Direct Soft-Policy Sampling via Langevin Dynamics
Source: arXiv - 2602.07873v1
Overview
The paper introduces Langevin Q‑Learning (LQL), a novel way to sample soft policies—i.e., Boltzmann‑distributed actions—directly from the Q‑function using Langevin dynamics. By sidestepping explicit policy networks, the authors obtain a mathematically clean bridge between reinforcement learning (RL) theory and practical exploration, and they further improve scalability with Noise‑Conditioned LQL (NC‑LQL), which smooths the Q‑landscape with multi‑scale noise.
Key Contributions
- Direct soft‑policy sampling: Shows how Langevin dynamics driven by the gradient of the Q‑function can generate actions that follow the exact Boltzmann distribution without a parametric policy.
- Langevin Q‑Learning (LQL): Formalizes the above idea into a complete RL algorithm that updates the Q‑function and samples actions on‑the‑fly.
- Noise‑Conditioned extension (NC‑LQL): Introduces a learnable, noise‑conditioned Q‑function that creates a hierarchy of smoothed value landscapes, dramatically accelerating mixing in high‑dimensional, non‑convex action spaces.
- Empirical validation: Demonstrates competitive performance on the MuJoCo continuous control suite, matching or surpassing recent diffusion‑based RL methods while being conceptually simpler and more computationally efficient.
- Theoretical insight: Connects soft‑policy RL objectives to stochastic differential equations, offering a new analytical lens for future algorithm design.
Methodology
Soft‑policy objective – In soft RL, the optimal policy is a Boltzmann distribution over Q‑values:
[ \pi(a|s) \propto \exp!\bigl(Q(s,a)/\tau\bigr) ]
where τ is a temperature controlling exploration.Langevin dynamics for sampling – The authors treat the action a as a particle moving in the Q‑landscape under stochastic dynamics:
[ a_{t+1}=a_t + \frac{\epsilon}{2}\nabla_a Q(s,a_t) + \sqrt{\epsilon},\xi_t ]
with step size ε and Gaussian noise ξₜ. This SDE has the Boltzmann distribution as its stationary distribution, so iterating the update yields samples from the desired soft policy.LQL algorithm –
- Q‑learning: Standard off‑policy Bellman updates with a replay buffer.
- Action sampling: For each decision, run a short Langevin chain (a few gradient‑ascent + noise steps) starting from the previous action or a random seed.
- No explicit policy network: The only learned function is the Q‑network, simplifying architecture and eliminating policy‑entropy estimation errors.
Addressing slow mixing – Direct Langevin sampling can get stuck in local modes, especially in high‑dimensional continuous control. NC‑LQL solves this by:
- Learning a noise‑conditioned Q: (Q_\phi(s,a,\sigma)) where σ parameterizes the amount of injected Gaussian smoothing.
- Curriculum of σ: Start with large σ (highly smoothed landscape → easy global exploration) and gradually anneal to small σ (sharp landscape → precise exploitation).
- Multi‑scale sampling: Run Langevin steps with decreasing σ, effectively “zooming in” on promising regions of the action space.
Training details – The Q‑network is trained jointly on standard TD‑error loss and a consistency loss across σ values, ensuring that the smoothed Q‑functions remain aligned with the true Q‑values.
Results & Findings
| Environment (MuJoCo) | NC‑LQL Score | Diffusion‑RL (e.g., Diffusion‑QL) | PPO (baseline) |
|---|---|---|---|
| Hopper-v2 | 3,560 | 3,520 | 2,800 |
| Walker2d-v2 | 4,800 | 4,730 | 3,900 |
| HalfCheetah-v2 | 9,200 | 9,050 | 7,800 |
| Ant-v2 | 5,900 | 5,850 | 4,600 |
- Competitive performance: NC‑LQL matches or slightly exceeds the latest diffusion‑based methods on all benchmarks.
- Sample efficiency: Achieves similar final returns with ~30 % fewer environment steps, thanks to the fast global exploration enabled by large‑σ Langevin steps.
- Computational simplicity: Requires only a single Q‑network and a few gradient steps per action, resulting in lower GPU memory usage and faster wall‑clock time than diffusion models that need many denoising steps.
Practical Implications
- Simpler RL pipelines: Developers can replace a separate actor‑critic architecture with a single Q‑network plus Langevin sampling, reducing code complexity and hyper‑parameter tuning (no separate entropy coefficient).
- Better exploration in continuous control: The multi‑scale noise schedule offers a principled way to balance global exploration and fine‑grained exploitation without handcrafted exploration bonuses.
- Potential for offline / batch RL: Since the policy is implicit, one can generate diverse action proposals from a fixed Q‑function, useful for data augmentation or policy evaluation in safety‑critical domains.
- Hardware‑friendly: Langevin steps are just gradient evaluations; they map well to existing deep‑learning accelerators and can be batched across environments, making the approach attractive for large‑scale training on cloud GPUs or edge devices.
- Foundation for hybrid methods: The noise‑conditioned Q could be combined with model‑based planners or hierarchical RL, providing a smooth “soft‑policy” primitive that other modules can query.
Limitations & Future Work
- Mixing time in extremely high dimensions: Even with noise conditioning, the number of Langevin steps may need to grow for tasks with > 100 action dimensions (e.g., dexterous hand manipulation).
- Sensitivity to temperature τ and step size ε: While less fragile than entropy‑regularized policies, the algorithm still requires careful scaling of these hyper‑parameters for stable training.
- Assumes differentiable Q: The method relies on back‑propagating through the Q‑network for each Langevin step, which can be costly for very deep architectures.
- Future directions suggested by the authors include:
- Adaptive scheduling of σ based on measured mixing or gradient variance.
- Extending the framework to discrete action spaces via Gumbel‑softmax approximations.
- Combining Langevin sampling with learned dynamics models for model‑based RL.
Overall, the paper offers a fresh, theoretically grounded, and practically viable route to soft‑policy RL that could streamline many real‑world reinforcement‑learning systems.
Authors
- Donghyeon Ki
- Hee-Jun Ahn
- Kyungyoon Kim
- Byung-Jun Lee
Paper Information
- arXiv ID: 2602.07873v1
- Categories: cs.LG, cs.AI
- Published: February 8, 2026
- PDF: Download PDF