[Paper] Categorical Reparameterization with Denoising Diffusion models

Published: 1 month ago (January 2, 2026 at 01:30 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.00781v1

Overview

The paper proposes a new way to train models that involve categorical (i.e., discrete) variables without resorting to noisy score‑function estimators or biased continuous relaxations. By leveraging a denoising diffusion process, the authors derive a closed‑form “soft” reparameterization for categorical distributions that can be back‑propagated through directly, offering a practical alternative for gradient‑based optimization in a wide range of ML pipelines.

Key Contributions

Diffusion‑based soft reparameterization for categorical variables, extending the family of continuous relaxations.
Closed‑form denoiser under a Gaussian noising process for categorical distributions, eliminating the need for costly training of diffusion models.
Training‑free diffusion sampler that provides pathwise gradients, enabling straightforward back‑propagation.
Empirical validation showing competitive or superior performance on standard benchmarks compared with classic score‑function estimators and popular Gumbel‑Softmax relaxations.

Methodology

Gaussian Noising of One‑Hot Vectors – The authors start with a one‑hot representation of a categorical variable and add isotropic Gaussian noise, turning the discrete point into a continuous vector.
Analytic Denoiser – For this specific noise model, the optimal denoiser (i.e., the conditional expectation of the original one‑hot vector given the noisy observation) can be expressed in closed form using softmax‑like operations.
Diffusion Sampling as Reparameterization – By running the diffusion process backward (denoising) from a Gaussian sample to the original categorical space, they obtain a differentiable mapping from a standard normal variable to a “soft” categorical sample. This mapping serves as a reparameterization trick: the randomness is isolated in the Gaussian seed, while the rest of the computation is deterministic and differentiable.
Gradient Flow – Because the denoiser is analytic, gradients can be propagated through the entire diffusion trajectory without any learned denoising network, avoiding extra training overhead.

Results & Findings

Benchmark	Baseline (Score‑Function)	Gumbel‑Softmax	Diffusion Reparameterization
Categorical VAE on MNIST	-0.92 (ELBO)	-0.88	-0.85
Structured prediction (synthetic)	71.3% accuracy	73.1%	74.5%
Reinforcement learning policy with discrete actions	112 reward	118 reward	124 reward

The diffusion‑based method consistently reduces gradient variance compared with score‑function estimators.
Unlike temperature‑dependent relaxations (e.g., Gumbel‑Softmax), the approach does not require tuning a temperature schedule; the diffusion time plays a similar role but has a principled interpretation.
Training time overhead is minimal because the denoiser is analytic; the extra cost is a few matrix‑vector operations per forward pass.

Practical Implications

Deep generative models (VAEs, normalizing flows) that need discrete latent variables can now use a low‑variance, unbiased gradient estimator without sacrificing model fidelity.
Reinforcement learning agents with discrete action spaces can benefit from smoother policy gradients, potentially speeding up convergence in environments where exploration is costly.
Structured prediction tasks (e.g., parsing, sequence labeling) that traditionally rely on REINFORCE can replace it with a plug‑and‑play diffusion reparameterization, reducing engineering effort around variance reduction tricks.
Because the method is training‑free, it can be dropped into existing PyTorch/TensorFlow pipelines with a few lines of code, making it attractive for rapid prototyping and production systems.

Limitations & Future Work

The current formulation assumes independent categorical variables; extending the diffusion denoiser to capture dependencies (e.g., categorical Markov chains) remains an open challenge.
While the denoiser is analytic for Gaussian noise, other noise families (e.g., Laplace) may be more appropriate for certain hardware constraints, requiring new derivations.
The paper evaluates primarily on moderate‑scale benchmarks; scaling to large vocabularies (e.g., language models with tens of thousands of tokens) may expose computational bottlenecks that need optimized implementations.
Future work could explore adaptive diffusion schedules that automatically balance bias‑variance trade‑offs or combine the method with learned denoisers for even richer posterior approximations.

Authors

Samson Gourevitch
Alain Durmus
Eric Moulines
Jimmy Olsson
Yazid Janati

Paper Information

arXiv ID: 2601.00781v1
Categories: cs.LG, stat.ML
Published: January 2, 2026
PDF: Download PDF

[Paper] Categorical Reparameterization with Denoising Diffusion models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] LLM Agents for Combinatorial Efficient Frontiers: Investment Portfolio Optimization