[Paper] Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders

Published: 2 days ago (February 10, 2026 at 01:58 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.10099v1

Overview

The paper “Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders” shows why conventional diffusion‑transformer models stumble when they try to generate data directly from high‑level representation encoders (e.g., CLIP, DINO). The authors reveal that the problem is not a lack of model capacity but a geometric mismatch between the Euclidean diffusion dynamics and the hyperspherical manifold on which encoder features live. By redesigning the diffusion process to respect the underlying Riemannian geometry, they enable a standard 131 M‑parameter Diffusion Transformer (DiT‑B) to train successfully and reach state‑of‑the‑art image synthesis quality.

Key Contributions

Identify Geometric Interference: Demonstrate that Euclidean flow‑matching forces probability mass through the low‑density interior of the encoder’s hyperspherical feature space, causing training collapse.
Riemannian Flow Matching with Jacobi Regularization (RJF): Introduce a diffusion formulation that follows geodesics on the feature manifold and compensates for curvature‑induced error propagation.
No Width Scaling Required: Show that the standard DiT‑B architecture (131 M parameters) converges without the costly width‑increase tricks previously deemed necessary.
Empirical Validation: Achieve an FID of 3.37 on ImageNet‑256, a regime where prior diffusion‑transformer approaches diverge.
Open‑Source Release: Provide a clean PyTorch implementation (https://github.com/amandpkr/RJF) for reproducibility and downstream research.

Methodology

Problem Setup – The authors start from a representation encoder that maps images to points on a high‑dimensional hypersphere (e.g., normalized CLIP embeddings). Traditional diffusion models define a Euclidean stochastic differential equation (SDE) that interpolates between noise and data in the ambient space.
Geometric Analysis – By visualizing the density of encoder features, they observe that most mass lives on the surface of the sphere, while the Euclidean diffusion trajectory spends a large fraction of its time inside the sphere where no real data exist. This “geometric interference” leads to poor gradient signals and training failure.
Riemannian Flow Matching – Instead of an Euclidean SDE, they formulate a Riemannian flow on the manifold:
- The diffusion path follows geodesics (shortest paths on the sphere).
- The velocity field is defined via Riemannian optimal transport, ensuring that probability mass stays on the manifold throughout training.
Jacobi Regularization – Curvature can amplify small errors when integrating along geodesics. The authors borrow the Jacobi equation from differential geometry to regularize the learned vector field, stabilizing the flow against curvature‑induced drift.
Training Pipeline – The RJF loss replaces the standard flow‑matching loss in a vanilla Diffusion Transformer (DiT‑B). No architectural changes, extra layers, or width scaling are required.

Results & Findings

Model (Params)	Training Setup	FID (ImageNet‑256)	Remarks
DiT‑B (131 M) + Euclidean Flow	Standard	Did not converge	Collapse due to geometric interference
DiT‑B (131 M) + RJF (proposed)	Same hyper‑params	3.37	Matches or exceeds prior width‑scaled baselines
DiT‑L (large, 300 M) + Euclidean Flow (baseline)	Wider model	~3.5	Requires >2× parameters to get comparable quality

Key takeaways

Geometric alignment is the primary bottleneck, not raw capacity.
RJF restores stable training with the same model size, cutting compute and memory roughly in half compared with width‑scaled alternatives.
Qualitative samples show sharper textures and fewer artifacts, especially in regions where the encoder’s manifold curvature is high.

Practical Implications

Cost‑Effective High‑Fidelity Generation: Developers can now deploy diffusion‑transformer pipelines on commodity GPUs without inflating model size, making large‑scale image synthesis more affordable.
Plug‑and‑Play with Existing Encoders: RJF works with any normalized representation encoder (CLIP, DINO, SimCLR), opening the door to conditional generation from semantic embeddings without retraining the encoder.
Better Integration in Multi‑Modal Systems: Since the generative process respects the encoder’s geometry, downstream tasks (e.g., text‑to‑image, style transfer) that rely on the same embeddings become more coherent.
Reduced Training Instability: Teams can avoid the trial‑and‑error of scaling widths or adding ad‑hoc tricks; the RJF loss is a drop‑in replacement for the standard diffusion loss.
Potential for Other Manifolds: The same Riemannian flow‑matching idea could be adapted to graph embeddings, hyperbolic spaces, or any latent space with known geometry, broadening its impact beyond vision.

Limitations & Future Work

Manifold Assumption: RJF presumes that encoder outputs lie on a well‑behaved hypersphere. Encoders that produce non‑normalized or highly anisotropic embeddings may need additional preprocessing.
Computational Overhead of Jacobi Regularization: While the model size stays the same, computing curvature‑aware regularization adds a modest constant factor to training time.
Scalability to Ultra‑High Resolutions: Experiments are limited to 256×256 images; extending to 1024×1024 or video generation may expose new geometric challenges.
Broader Manifold Types: Future work could explore adaptive manifold learning (learning the metric jointly with the diffusion process) or apply RJF to non‑spherical manifolds such as product manifolds for multimodal embeddings.

If you’re interested in trying RJF yourself, the authors provide a ready‑to‑run implementation and pretrained checkpoints on GitHub. The approach offers a clean, geometry‑aware upgrade to existing diffusion‑transformer pipelines, turning a theoretical insight into a practical performance boost.

Authors

Amandeep Kumar
Vishal M. Patel

Paper Information

arXiv ID: 2602.10099v1
Categories: cs.LG, cs.CV
Published: February 10, 2026
PDF: Download PDF

[Paper] Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

[Paper] Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training