[Paper] Direct Soft-Policy Sampling via Langevin Dynamics

Published: 3 days ago (February 8, 2026 at 04:01 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.07873v1

Overview

The paper introduces Langevin Q‑Learning (LQL), a novel way to sample soft policies—i.e., Boltzmann‑distributed actions—directly from the Q‑function using Langevin dynamics. By sidestepping explicit policy networks, the authors obtain a mathematically clean bridge between reinforcement learning (RL) theory and practical exploration, and they further improve scalability with Noise‑Conditioned LQL (NC‑LQL), which smooths the Q‑landscape with multi‑scale noise.

Key Contributions

Direct soft‑policy sampling: Shows how Langevin dynamics driven by the gradient of the Q‑function can generate actions that follow the exact Boltzmann distribution without a parametric policy.
Langevin Q‑Learning (LQL): Formalizes the above idea into a complete RL algorithm that updates the Q‑function and samples actions on‑the‑fly.
Noise‑Conditioned extension (NC‑LQL): Introduces a learnable, noise‑conditioned Q‑function that creates a hierarchy of smoothed value landscapes, dramatically accelerating mixing in high‑dimensional, non‑convex action spaces.
Empirical validation: Demonstrates competitive performance on the MuJoCo continuous control suite, matching or surpassing recent diffusion‑based RL methods while being conceptually simpler and more computationally efficient.
Theoretical insight: Connects soft‑policy RL objectives to stochastic differential equations, offering a new analytical lens for future algorithm design.

Methodology

Soft‑policy objective – In soft RL, the optimal policy is a Boltzmann distribution over Q‑values:
[ \pi(a|s) \propto \exp!\bigl(Q(s,a)/\tau\bigr) ]
where τ is a temperature controlling exploration.
Langevin dynamics for sampling – The authors treat the action a as a particle moving in the Q‑landscape under stochastic dynamics:
[ a_{t+1}=a_t + \frac{\epsilon}{2}\nabla_a Q(s,a_t) + \sqrt{\epsilon},\xi_t ]
with step size ε and Gaussian noise ξₜ. This SDE has the Boltzmann distribution as its stationary distribution, so iterating the update yields samples from the desired soft policy.
LQL algorithm –
- Q‑learning: Standard off‑policy Bellman updates with a replay buffer.
- Action sampling: For each decision, run a short Langevin chain (a few gradient‑ascent + noise steps) starting from the previous action or a random seed.
- No explicit policy network: The only learned function is the Q‑network, simplifying architecture and eliminating policy‑entropy estimation errors.
Addressing slow mixing – Direct Langevin sampling can get stuck in local modes, especially in high‑dimensional continuous control. NC‑LQL solves this by:
- Learning a noise‑conditioned Q: (Q_\phi(s,a,\sigma)) where σ parameterizes the amount of injected Gaussian smoothing.
- Curriculum of σ: Start with large σ (highly smoothed landscape → easy global exploration) and gradually anneal to small σ (sharp landscape → precise exploitation).
- Multi‑scale sampling: Run Langevin steps with decreasing σ, effectively “zooming in” on promising regions of the action space.
Training details – The Q‑network is trained jointly on standard TD‑error loss and a consistency loss across σ values, ensuring that the smoothed Q‑functions remain aligned with the true Q‑values.

Results & Findings

Environment (MuJoCo)	NC‑LQL Score	Diffusion‑RL (e.g., Diffusion‑QL)	PPO (baseline)
Hopper-v2	3,560	3,520	2,800
Walker2d-v2	4,800	4,730	3,900
HalfCheetah-v2	9,200	9,050	7,800
Ant-v2	5,900	5,850	4,600

Competitive performance: NC‑LQL matches or slightly exceeds the latest diffusion‑based methods on all benchmarks.
Sample efficiency: Achieves similar final returns with ~30 % fewer environment steps, thanks to the fast global exploration enabled by large‑σ Langevin steps.
Computational simplicity: Requires only a single Q‑network and a few gradient steps per action, resulting in lower GPU memory usage and faster wall‑clock time than diffusion models that need many denoising steps.

Practical Implications

Simpler RL pipelines: Developers can replace a separate actor‑critic architecture with a single Q‑network plus Langevin sampling, reducing code complexity and hyper‑parameter tuning (no separate entropy coefficient).
Better exploration in continuous control: The multi‑scale noise schedule offers a principled way to balance global exploration and fine‑grained exploitation without handcrafted exploration bonuses.
Potential for offline / batch RL: Since the policy is implicit, one can generate diverse action proposals from a fixed Q‑function, useful for data augmentation or policy evaluation in safety‑critical domains.
Hardware‑friendly: Langevin steps are just gradient evaluations; they map well to existing deep‑learning accelerators and can be batched across environments, making the approach attractive for large‑scale training on cloud GPUs or edge devices.
Foundation for hybrid methods: The noise‑conditioned Q could be combined with model‑based planners or hierarchical RL, providing a smooth “soft‑policy” primitive that other modules can query.

Limitations & Future Work

Mixing time in extremely high dimensions: Even with noise conditioning, the number of Langevin steps may need to grow for tasks with > 100 action dimensions (e.g., dexterous hand manipulation).
Sensitivity to temperature τ and step size ε: While less fragile than entropy‑regularized policies, the algorithm still requires careful scaling of these hyper‑parameters for stable training.
Assumes differentiable Q: The method relies on back‑propagating through the Q‑network for each Langevin step, which can be costly for very deep architectures.
Future directions suggested by the authors include:
- Adaptive scheduling of σ based on measured mixing or gradient variance.
- Extending the framework to discrete action spaces via Gumbel‑softmax approximations.
- Combining Langevin sampling with learned dynamics models for model‑based RL.

Overall, the paper offers a fresh, theoretically grounded, and practically viable route to soft‑policy RL that could streamline many real‑world reinforcement‑learning systems.

Authors

Donghyeon Ki
Hee-Jun Ahn
Kyungyoon Kim
Byung-Jun Lee

Paper Information

arXiv ID: 2602.07873v1
Categories: cs.LG, cs.AI
Published: February 8, 2026
PDF: Download PDF

[Paper] Direct Soft-Policy Sampling via Langevin Dynamics

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

[Paper] Olaf-World: Orienting Latent Actions for Video World Modeling

[Paper] Towards Explainable Federated Learning: Understanding the Impact of Differential Privacy

[Paper] Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders