[Paper] Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation

Published: (February 16, 2026 at 01:58 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.15022v1

Overview

This paper revisits how diffusion‑based generative models handle symmetry—think permutations of atoms or rotations of molecules. Instead of building heavy equivariant architectures, the authors propose canonicalization: first put every data point into a standard “pose,” train a regular (non‑equivariant) diffusion model on these canonical forms, and then re‑apply a random symmetry at generation time. The result is a simpler, faster, and more expressive way to generate 3‑D molecular graphs.

Key Contributions

  • Canonical diffusion framework: Formal theory showing that training on a canonical slice of the data manifold is correct (recovers the original invariant distribution) and more expressive than directly enforcing equivariance.
  • Training efficiency gains: Demonstrates that canonicalization removes the mixture‑of‑symmetries term in diffusion scores, lowering variance and speeding up convergence for both diffusion and flow‑matching models.
  • Unified view with aligned priors & optimal transport: Shows how these complementary techniques further accelerate learning when combined with canonicalization.
  • Practical instantiation for molecules: Implements a geometric‑spectra based canonicalizer for the combined permutation × SE(3) symmetry of molecular graphs.
  • State‑of‑the‑art results: The CanonFlow model beats existing equivariant baselines on the GEOM‑DRUG benchmark, even with fewer diffusion steps and comparable compute.

Methodology

  1. Identify the symmetry group – For molecules the relevant group is (S_n) (atom permutations) × (SE(3)) (3‑D rotations & translations).
  2. Canonicalization step – Each molecule is transformed into a canonical pose:
    • Compute a rotation that aligns the molecule’s geometric spectrum (eigenvalues of a distance‑based matrix).
    • Order atoms deterministically (e.g., by sorted eigenvector components) to fix the permutation.
      This yields a unique representative for every symmetry orbit.
  3. Train an unconstrained generative model – A standard diffusion (or flow‑matching) network is trained on the canonical data, without any equivariance constraints.
  4. Sampling – After generating a canonical sample, a random symmetry transform (random rotation + random permutation) is sampled and applied, producing a molecule that follows the original invariant distribution.
  5. Enhancements – The authors add aligned priors (matching the latent prior to the canonical distribution) and optimal‑transport‑based flow matching to further reduce training variance.

Results & Findings

Metric (GEOM‑DRUG)CanonFlow (full steps)CanonFlow (few steps)Prior equivariant baselines
Validity (%)99.298.596–97
Uniqueness (%)94.192.888–90
Diversity (KL)1.121.080.95–1.00
Training time (GPU‑hrs)≈0.8× of equivariant modelbaseline
  • Expressivity: Canonical models can represent any invariant distribution that equivariant models can, and often capture finer details because they are not limited by architectural symmetry constraints.
  • Speed: Removing the group‑mixture term in the diffusion score reduces gradient variance, leading to ~20 % fewer training epochs for comparable performance.
  • Few‑step generation: Even with as few as 10 diffusion steps (vs. 100+ typical), CanonFlow retains high validity and diversity, making it attractive for real‑time applications.

Practical Implications

  • Simpler model pipelines: Developers can reuse off‑the‑shelf diffusion libraries (e.g., PyTorch‑Diffusers) without writing custom equivariant layers, cutting engineering overhead.
  • Faster prototyping: Reduced training variance means quicker hyper‑parameter sweeps and lower GPU costs—critical for startups or labs with limited compute budgets.
  • Better integration with downstream tools: Since the generated molecules are already in a canonical form, downstream tasks (e.g., docking, property prediction) can cache or batch‑process them more efficiently.
  • Few‑step sampling opens real‑time design: Drug‑discovery pipelines that need rapid candidate generation (e.g., active‑learning loops) can now afford to sample on‑the‑fly without sacrificing quality.
  • Extensible to other domains: Any generative problem with known symmetry groups (point clouds, protein structures, physics simulations) can adopt the same canonical‑first approach, potentially replacing heavyweight equivariant networks.

Limitations & Future Work

  • Canonicalizer design: The current spectral alignment works well for small‑to‑medium molecules but may struggle with very large or highly flexible structures where eigen‑spectra are ambiguous.
  • Group coverage: The framework assumes the symmetry group is known and tractable; extending to continuous groups beyond SE(3) (e.g., scaling, shear) requires new canonicalization tricks.
  • Sampling bias: Randomly re‑applying symmetry transforms at generation time is unbiased in theory, but in practice finite‑sample effects could introduce subtle distribution shifts—an area for tighter statistical analysis.
  • Broader benchmarks: While GEOM‑DRUG is a strong testbed, evaluating on other chemistry datasets (e.g., QM9, MOSES) and on non‑chemical symmetric data would solidify the claim of universal applicability.

Bottom line: By flipping the conventional wisdom—canonicalize first, then generate—the authors deliver a more accessible, efficient, and powerful recipe for symmetry‑aware generative modeling, with immediate benefits for molecular AI and beyond.

Authors

  • Cai Zhou
  • Zijie Chen
  • Zian Li
  • Jike Wang
  • Kaiyi Jiang
  • Pan Li
  • Rose Yu
  • Muhan Zhang
  • Stephen Bates
  • Tommi Jaakkola

Paper Information

  • arXiv ID: 2602.15022v1
  • Categories: cs.LG, cs.AI, math.GR, q-bio.BM
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »