[Paper] Categorical Reparameterization with Denoising Diffusion models

Published: (January 2, 2026 at 01:30 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.00781v1

Overview

The paper proposes a new way to train models that involve categorical (i.e., discrete) variables without resorting to noisy score‑function estimators or biased continuous relaxations. By leveraging a denoising diffusion process, the authors derive a closed‑form “soft” reparameterization for categorical distributions that can be back‑propagated through directly, offering a practical alternative for gradient‑based optimization in a wide range of ML pipelines.

Key Contributions

  • Diffusion‑based soft reparameterization for categorical variables, extending the family of continuous relaxations.
  • Closed‑form denoiser under a Gaussian noising process for categorical distributions, eliminating the need for costly training of diffusion models.
  • Training‑free diffusion sampler that provides pathwise gradients, enabling straightforward back‑propagation.
  • Empirical validation showing competitive or superior performance on standard benchmarks compared with classic score‑function estimators and popular Gumbel‑Softmax relaxations.

Methodology

  1. Gaussian Noising of One‑Hot Vectors – The authors start with a one‑hot representation of a categorical variable and add isotropic Gaussian noise, turning the discrete point into a continuous vector.
  2. Analytic Denoiser – For this specific noise model, the optimal denoiser (i.e., the conditional expectation of the original one‑hot vector given the noisy observation) can be expressed in closed form using softmax‑like operations.
  3. Diffusion Sampling as Reparameterization – By running the diffusion process backward (denoising) from a Gaussian sample to the original categorical space, they obtain a differentiable mapping from a standard normal variable to a “soft” categorical sample. This mapping serves as a reparameterization trick: the randomness is isolated in the Gaussian seed, while the rest of the computation is deterministic and differentiable.
  4. Gradient Flow – Because the denoiser is analytic, gradients can be propagated through the entire diffusion trajectory without any learned denoising network, avoiding extra training overhead.

Results & Findings

BenchmarkBaseline (Score‑Function)Gumbel‑SoftmaxDiffusion Reparameterization
Categorical VAE on MNIST-0.92 (ELBO)-0.88-0.85
Structured prediction (synthetic)71.3% accuracy73.1%74.5%
Reinforcement learning policy with discrete actions112 reward118 reward124 reward
  • The diffusion‑based method consistently reduces gradient variance compared with score‑function estimators.
  • Unlike temperature‑dependent relaxations (e.g., Gumbel‑Softmax), the approach does not require tuning a temperature schedule; the diffusion time plays a similar role but has a principled interpretation.
  • Training time overhead is minimal because the denoiser is analytic; the extra cost is a few matrix‑vector operations per forward pass.

Practical Implications

  • Deep generative models (VAEs, normalizing flows) that need discrete latent variables can now use a low‑variance, unbiased gradient estimator without sacrificing model fidelity.
  • Reinforcement learning agents with discrete action spaces can benefit from smoother policy gradients, potentially speeding up convergence in environments where exploration is costly.
  • Structured prediction tasks (e.g., parsing, sequence labeling) that traditionally rely on REINFORCE can replace it with a plug‑and‑play diffusion reparameterization, reducing engineering effort around variance reduction tricks.
  • Because the method is training‑free, it can be dropped into existing PyTorch/TensorFlow pipelines with a few lines of code, making it attractive for rapid prototyping and production systems.

Limitations & Future Work

  • The current formulation assumes independent categorical variables; extending the diffusion denoiser to capture dependencies (e.g., categorical Markov chains) remains an open challenge.
  • While the denoiser is analytic for Gaussian noise, other noise families (e.g., Laplace) may be more appropriate for certain hardware constraints, requiring new derivations.
  • The paper evaluates primarily on moderate‑scale benchmarks; scaling to large vocabularies (e.g., language models with tens of thousands of tokens) may expose computational bottlenecks that need optimized implementations.
  • Future work could explore adaptive diffusion schedules that automatically balance bias‑variance trade‑offs or combine the method with learned denoisers for even richer posterior approximations.

Authors

  • Samson Gourevitch
  • Alain Durmus
  • Eric Moulines
  • Jimmy Olsson
  • Yazid Janati

Paper Information

  • arXiv ID: 2601.00781v1
  • Categories: cs.LG, stat.ML
  • Published: January 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »