[Paper] Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation

Published: 3 days ago (February 16, 2026 at 01:58 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.15022v1

Overview

This paper revisits how diffusion‑based generative models handle symmetry—think permutations of atoms or rotations of molecules. Instead of building heavy equivariant architectures, the authors propose canonicalization: first put every data point into a standard “pose,” train a regular (non‑equivariant) diffusion model on these canonical forms, and then re‑apply a random symmetry at generation time. The result is a simpler, faster, and more expressive way to generate 3‑D molecular graphs.

Key Contributions

Canonical diffusion framework: Formal theory showing that training on a canonical slice of the data manifold is correct (recovers the original invariant distribution) and more expressive than directly enforcing equivariance.
Training efficiency gains: Demonstrates that canonicalization removes the mixture‑of‑symmetries term in diffusion scores, lowering variance and speeding up convergence for both diffusion and flow‑matching models.
Unified view with aligned priors & optimal transport: Shows how these complementary techniques further accelerate learning when combined with canonicalization.
Practical instantiation for molecules: Implements a geometric‑spectra based canonicalizer for the combined permutation × SE(3) symmetry of molecular graphs.
State‑of‑the‑art results: The CanonFlow model beats existing equivariant baselines on the GEOM‑DRUG benchmark, even with fewer diffusion steps and comparable compute.

Methodology

Identify the symmetry group – For molecules the relevant group is (S_n) (atom permutations) × (SE(3)) (3‑D rotations & translations).
Canonicalization step – Each molecule is transformed into a canonical pose:
- Compute a rotation that aligns the molecule’s geometric spectrum (eigenvalues of a distance‑based matrix).
- Order atoms deterministically (e.g., by sorted eigenvector components) to fix the permutation.
  This yields a unique representative for every symmetry orbit.
Train an unconstrained generative model – A standard diffusion (or flow‑matching) network is trained on the canonical data, without any equivariance constraints.
Sampling – After generating a canonical sample, a random symmetry transform (random rotation + random permutation) is sampled and applied, producing a molecule that follows the original invariant distribution.
Enhancements – The authors add aligned priors (matching the latent prior to the canonical distribution) and optimal‑transport‑based flow matching to further reduce training variance.

Results & Findings

Metric (GEOM‑DRUG)	CanonFlow (full steps)	CanonFlow (few steps)	Prior equivariant baselines
Validity (%)	99.2	98.5	96–97
Uniqueness (%)	94.1	92.8	88–90
Diversity (KL)	1.12	1.08	0.95–1.00
Training time (GPU‑hrs)	≈0.8× of equivariant model	—	baseline

Expressivity: Canonical models can represent any invariant distribution that equivariant models can, and often capture finer details because they are not limited by architectural symmetry constraints.
Speed: Removing the group‑mixture term in the diffusion score reduces gradient variance, leading to ~20 % fewer training epochs for comparable performance.
Few‑step generation: Even with as few as 10 diffusion steps (vs. 100+ typical), CanonFlow retains high validity and diversity, making it attractive for real‑time applications.

Practical Implications

Simpler model pipelines: Developers can reuse off‑the‑shelf diffusion libraries (e.g., PyTorch‑Diffusers) without writing custom equivariant layers, cutting engineering overhead.
Faster prototyping: Reduced training variance means quicker hyper‑parameter sweeps and lower GPU costs—critical for startups or labs with limited compute budgets.
Better integration with downstream tools: Since the generated molecules are already in a canonical form, downstream tasks (e.g., docking, property prediction) can cache or batch‑process them more efficiently.
Few‑step sampling opens real‑time design: Drug‑discovery pipelines that need rapid candidate generation (e.g., active‑learning loops) can now afford to sample on‑the‑fly without sacrificing quality.
Extensible to other domains: Any generative problem with known symmetry groups (point clouds, protein structures, physics simulations) can adopt the same canonical‑first approach, potentially replacing heavyweight equivariant networks.

Limitations & Future Work

Canonicalizer design: The current spectral alignment works well for small‑to‑medium molecules but may struggle with very large or highly flexible structures where eigen‑spectra are ambiguous.
Group coverage: The framework assumes the symmetry group is known and tractable; extending to continuous groups beyond SE(3) (e.g., scaling, shear) requires new canonicalization tricks.
Sampling bias: Randomly re‑applying symmetry transforms at generation time is unbiased in theory, but in practice finite‑sample effects could introduce subtle distribution shifts—an area for tighter statistical analysis.
Broader benchmarks: While GEOM‑DRUG is a strong testbed, evaluating on other chemistry datasets (e.g., QM9, MOSES) and on non‑chemical symmetric data would solidify the claim of universal applicability.

Bottom line: By flipping the conventional wisdom—canonicalize first, then generate—the authors deliver a more accessible, efficient, and powerful recipe for symmetry‑aware generative modeling, with immediate benefits for molecular AI and beyond.

Authors

Cai Zhou
Zijie Chen
Zian Li
Jike Wang
Kaiyi Jiang
Pan Li
Rose Yu
Muhan Zhang
Stephen Bates
Tommi Jaakkola

Paper Information

arXiv ID: 2602.15022v1
Categories: cs.LG, cs.AI, math.GR, q-bio.BM
Published: February 16, 2026
PDF: Download PDF

[Paper] Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Knowledge-Embedded Latent Projection for Robust Representation Learning

[Paper] Policy Compiler for Secure Agentic Systems

[Paper] Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

[Paper] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents