[Paper] Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation
Source: arXiv - 2602.15022v1
Overview
This paper revisits how diffusion‑based generative models handle symmetry—think permutations of atoms or rotations of molecules. Instead of building heavy equivariant architectures, the authors propose canonicalization: first put every data point into a standard “pose,” train a regular (non‑equivariant) diffusion model on these canonical forms, and then re‑apply a random symmetry at generation time. The result is a simpler, faster, and more expressive way to generate 3‑D molecular graphs.
Key Contributions
- Canonical diffusion framework: Formal theory showing that training on a canonical slice of the data manifold is correct (recovers the original invariant distribution) and more expressive than directly enforcing equivariance.
- Training efficiency gains: Demonstrates that canonicalization removes the mixture‑of‑symmetries term in diffusion scores, lowering variance and speeding up convergence for both diffusion and flow‑matching models.
- Unified view with aligned priors & optimal transport: Shows how these complementary techniques further accelerate learning when combined with canonicalization.
- Practical instantiation for molecules: Implements a geometric‑spectra based canonicalizer for the combined permutation × SE(3) symmetry of molecular graphs.
- State‑of‑the‑art results: The CanonFlow model beats existing equivariant baselines on the GEOM‑DRUG benchmark, even with fewer diffusion steps and comparable compute.
Methodology
- Identify the symmetry group – For molecules the relevant group is (S_n) (atom permutations) × (SE(3)) (3‑D rotations & translations).
- Canonicalization step – Each molecule is transformed into a canonical pose:
- Compute a rotation that aligns the molecule’s geometric spectrum (eigenvalues of a distance‑based matrix).
- Order atoms deterministically (e.g., by sorted eigenvector components) to fix the permutation.
This yields a unique representative for every symmetry orbit.
- Train an unconstrained generative model – A standard diffusion (or flow‑matching) network is trained on the canonical data, without any equivariance constraints.
- Sampling – After generating a canonical sample, a random symmetry transform (random rotation + random permutation) is sampled and applied, producing a molecule that follows the original invariant distribution.
- Enhancements – The authors add aligned priors (matching the latent prior to the canonical distribution) and optimal‑transport‑based flow matching to further reduce training variance.
Results & Findings
| Metric (GEOM‑DRUG) | CanonFlow (full steps) | CanonFlow (few steps) | Prior equivariant baselines |
|---|---|---|---|
| Validity (%) | 99.2 | 98.5 | 96–97 |
| Uniqueness (%) | 94.1 | 92.8 | 88–90 |
| Diversity (KL) | 1.12 | 1.08 | 0.95–1.00 |
| Training time (GPU‑hrs) | ≈0.8× of equivariant model | — | baseline |
- Expressivity: Canonical models can represent any invariant distribution that equivariant models can, and often capture finer details because they are not limited by architectural symmetry constraints.
- Speed: Removing the group‑mixture term in the diffusion score reduces gradient variance, leading to ~20 % fewer training epochs for comparable performance.
- Few‑step generation: Even with as few as 10 diffusion steps (vs. 100+ typical), CanonFlow retains high validity and diversity, making it attractive for real‑time applications.
Practical Implications
- Simpler model pipelines: Developers can reuse off‑the‑shelf diffusion libraries (e.g., PyTorch‑Diffusers) without writing custom equivariant layers, cutting engineering overhead.
- Faster prototyping: Reduced training variance means quicker hyper‑parameter sweeps and lower GPU costs—critical for startups or labs with limited compute budgets.
- Better integration with downstream tools: Since the generated molecules are already in a canonical form, downstream tasks (e.g., docking, property prediction) can cache or batch‑process them more efficiently.
- Few‑step sampling opens real‑time design: Drug‑discovery pipelines that need rapid candidate generation (e.g., active‑learning loops) can now afford to sample on‑the‑fly without sacrificing quality.
- Extensible to other domains: Any generative problem with known symmetry groups (point clouds, protein structures, physics simulations) can adopt the same canonical‑first approach, potentially replacing heavyweight equivariant networks.
Limitations & Future Work
- Canonicalizer design: The current spectral alignment works well for small‑to‑medium molecules but may struggle with very large or highly flexible structures where eigen‑spectra are ambiguous.
- Group coverage: The framework assumes the symmetry group is known and tractable; extending to continuous groups beyond SE(3) (e.g., scaling, shear) requires new canonicalization tricks.
- Sampling bias: Randomly re‑applying symmetry transforms at generation time is unbiased in theory, but in practice finite‑sample effects could introduce subtle distribution shifts—an area for tighter statistical analysis.
- Broader benchmarks: While GEOM‑DRUG is a strong testbed, evaluating on other chemistry datasets (e.g., QM9, MOSES) and on non‑chemical symmetric data would solidify the claim of universal applicability.
Bottom line: By flipping the conventional wisdom—canonicalize first, then generate—the authors deliver a more accessible, efficient, and powerful recipe for symmetry‑aware generative modeling, with immediate benefits for molecular AI and beyond.
Authors
- Cai Zhou
- Zijie Chen
- Zian Li
- Jike Wang
- Kaiyi Jiang
- Pan Li
- Rose Yu
- Muhan Zhang
- Stephen Bates
- Tommi Jaakkola
Paper Information
- arXiv ID: 2602.15022v1
- Categories: cs.LG, cs.AI, math.GR, q-bio.BM
- Published: February 16, 2026
- PDF: Download PDF