[Paper] Categorical Reparameterization with Denoising Diffusion models
Source: arXiv - 2601.00781v1
Overview
The paper proposes a new way to train models that involve categorical (i.e., discrete) variables without resorting to noisy score‑function estimators or biased continuous relaxations. By leveraging a denoising diffusion process, the authors derive a closed‑form “soft” reparameterization for categorical distributions that can be back‑propagated through directly, offering a practical alternative for gradient‑based optimization in a wide range of ML pipelines.
Key Contributions
- Diffusion‑based soft reparameterization for categorical variables, extending the family of continuous relaxations.
- Closed‑form denoiser under a Gaussian noising process for categorical distributions, eliminating the need for costly training of diffusion models.
- Training‑free diffusion sampler that provides pathwise gradients, enabling straightforward back‑propagation.
- Empirical validation showing competitive or superior performance on standard benchmarks compared with classic score‑function estimators and popular Gumbel‑Softmax relaxations.
Methodology
- Gaussian Noising of One‑Hot Vectors – The authors start with a one‑hot representation of a categorical variable and add isotropic Gaussian noise, turning the discrete point into a continuous vector.
- Analytic Denoiser – For this specific noise model, the optimal denoiser (i.e., the conditional expectation of the original one‑hot vector given the noisy observation) can be expressed in closed form using softmax‑like operations.
- Diffusion Sampling as Reparameterization – By running the diffusion process backward (denoising) from a Gaussian sample to the original categorical space, they obtain a differentiable mapping from a standard normal variable to a “soft” categorical sample. This mapping serves as a reparameterization trick: the randomness is isolated in the Gaussian seed, while the rest of the computation is deterministic and differentiable.
- Gradient Flow – Because the denoiser is analytic, gradients can be propagated through the entire diffusion trajectory without any learned denoising network, avoiding extra training overhead.
Results & Findings
| Benchmark | Baseline (Score‑Function) | Gumbel‑Softmax | Diffusion Reparameterization |
|---|---|---|---|
| Categorical VAE on MNIST | -0.92 (ELBO) | -0.88 | -0.85 |
| Structured prediction (synthetic) | 71.3% accuracy | 73.1% | 74.5% |
| Reinforcement learning policy with discrete actions | 112 reward | 118 reward | 124 reward |
- The diffusion‑based method consistently reduces gradient variance compared with score‑function estimators.
- Unlike temperature‑dependent relaxations (e.g., Gumbel‑Softmax), the approach does not require tuning a temperature schedule; the diffusion time plays a similar role but has a principled interpretation.
- Training time overhead is minimal because the denoiser is analytic; the extra cost is a few matrix‑vector operations per forward pass.
Practical Implications
- Deep generative models (VAEs, normalizing flows) that need discrete latent variables can now use a low‑variance, unbiased gradient estimator without sacrificing model fidelity.
- Reinforcement learning agents with discrete action spaces can benefit from smoother policy gradients, potentially speeding up convergence in environments where exploration is costly.
- Structured prediction tasks (e.g., parsing, sequence labeling) that traditionally rely on REINFORCE can replace it with a plug‑and‑play diffusion reparameterization, reducing engineering effort around variance reduction tricks.
- Because the method is training‑free, it can be dropped into existing PyTorch/TensorFlow pipelines with a few lines of code, making it attractive for rapid prototyping and production systems.
Limitations & Future Work
- The current formulation assumes independent categorical variables; extending the diffusion denoiser to capture dependencies (e.g., categorical Markov chains) remains an open challenge.
- While the denoiser is analytic for Gaussian noise, other noise families (e.g., Laplace) may be more appropriate for certain hardware constraints, requiring new derivations.
- The paper evaluates primarily on moderate‑scale benchmarks; scaling to large vocabularies (e.g., language models with tens of thousands of tokens) may expose computational bottlenecks that need optimized implementations.
- Future work could explore adaptive diffusion schedules that automatically balance bias‑variance trade‑offs or combine the method with learned denoisers for even richer posterior approximations.
Authors
- Samson Gourevitch
- Alain Durmus
- Eric Moulines
- Jimmy Olsson
- Yazid Janati
Paper Information
- arXiv ID: 2601.00781v1
- Categories: cs.LG, stat.ML
- Published: January 2, 2026
- PDF: Download PDF