[Paper] ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

Published: (February 26, 2026 at 01:07 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.23295v1

Overview

The paper introduces ManifoldGD, a training‑free technique that uses diffusion models to create ultra‑compact synthetic datasets while preserving the knowledge of the original large‑scale collection. By guiding the diffusion process with a geometry‑aware “manifold” built from hierarchical clustering of latent features, the method produces diverse, high‑fidelity images that can replace the full dataset for downstream training.

Key Contributions

  • Training‑free distillation: Leverages a pre‑trained diffusion model and a VAE encoder, eliminating the need to fine‑tune any generative network.
  • Hierarchical IPC (Instance Prototype Centroid) construction: Builds a multi‑scale coreset of centroids via divisive clustering of VAE latent vectors, capturing both coarse class modes and fine intra‑class variations.
  • Manifold‑consistent guidance: At each diffusion denoising step, the direction toward the nearest IPC is projected onto the local tangent space of the latent manifold, keeping the generation trajectory on‑manifold.
  • Unified framework: Works with any off‑the‑shelf diffusion model (e.g., Stable Diffusion, Denoising Diffusion Probabilistic Models) without additional training.
  • State‑of‑the‑art results: Improves Fréchet Inception Distance (FID), embedding L2 distance, and downstream classification accuracy over both training‑free and training‑based baselines.

Methodology

  1. Feature extraction – A pretrained VAE encodes every image in the original dataset into a latent vector.
  2. Hierarchical clustering – The latent vectors are recursively split (divisive clustering) to produce a tree of clusters. The centroid of each leaf cluster becomes an Instance Prototype Centroid (IPC). The hierarchy yields a multi‑scale set of IPCs: high‑level nodes capture broad semantic modes (e.g., “dog vs. cat”), while deeper nodes capture subtle variations (e.g., different breeds).
  3. Manifold construction – For a given diffusion timestep t, the algorithm selects a local neighborhood of IPCs around the current latent estimate. Using these points, it estimates a low‑dimensional tangent space (e.g., via PCA on the neighborhood).
  4. Guided denoising – The standard diffusion denoising step produces a “score” (gradient) that points toward higher probability regions. ManifoldGD adds a mode‑alignment vector that points from the current latent toward the nearest IPC. This vector is then projected onto the tangent space, ensuring the update stays on the learned manifold.
  5. Iterate – Steps 3‑4 repeat for every denoising timestep until a clean image is obtained. The final images constitute the distilled synthetic dataset.

The whole pipeline runs inference‑only: no gradient updates to the diffusion model, VAE, or clustering algorithm are required after the initial preprocessing.

Results & Findings

MetricTraining‑free BaselineTraining‑based BaselineManifoldGD
FID (CIFAR‑10)38.231.527.1
Embedding L2 distance (real ↔ synthetic)0.840.710.58
Classification accuracy (using synthetic set to train a ResNet‑18)71.3 %78.9 %82.4 %
  • Representativeness: The hierarchical IPCs capture both global class structure and fine‑grained nuances, leading to synthetic data that better mirrors the original distribution.
  • Diversity: Tangent‑space projection prevents collapse to a few modes, preserving intra‑class variation.
  • Image fidelity: Visual inspection shows sharper textures and more realistic lighting compared with prior score‑based guidance methods.

Across multiple benchmarks (CIFAR‑10, TinyImageNet, and a subset of ImageNet), ManifoldGD consistently outperformed the strongest existing training‑free methods and even surpassed many training‑based distillation pipelines.

Practical Implications

  • Faster prototyping: Developers can replace a multi‑gigabyte training set with a few megabytes of synthetic images, cutting down data loading time and storage costs.
  • Edge and on‑device learning: Small synthetic datasets enable on‑device fine‑tuning of models (e.g., personalization on smartphones) without shipping the full dataset.
  • Privacy‑preserving sharing: Since the distilled data are generated from a latent manifold rather than raw images, they can be shared with reduced risk of leaking personally identifiable information.
  • Rapid domain adaptation: By recomputing IPCs on a new domain’s latent embeddings, practitioners can instantly generate a compact synthetic set for transfer learning, avoiding costly data collection.
  • Plug‑and‑play: The method works with any off‑the‑shelf diffusion model, so teams can integrate it into existing pipelines without training new generative models.

Limitations & Future Work

  • Dependence on VAE quality: The hierarchical clustering operates on VAE latents; a poorly trained encoder may produce suboptimal IPCs, limiting distillation quality.
  • Scalability of clustering: While divisive clustering is more memory‑efficient than exhaustive k‑means, constructing IPCs for extremely large datasets (e.g., full ImageNet) still incurs non‑trivial preprocessing time.
  • Fixed diffusion schedule: The current implementation assumes the standard diffusion timestep schedule; adapting the guidance to alternative schedules or accelerated samplers could yield further speedups.
  • Extension to non‑image modalities: The paper focuses on visual data; applying manifold‑guided distillation to audio, text, or multimodal datasets remains an open avenue.

Future research may explore learned latent manifolds (e.g., via graph neural networks), adaptive neighborhood sizes, and joint optimization of the VAE encoder and IPC hierarchy to further boost fidelity and reduce preprocessing overhead.

Authors

  • Ayush Roy
  • Wei‑Yang Alex Lee
  • Rudrasis Chakraborty
  • Vishnu Suresh Lokhande

Paper Information

  • arXiv ID: 2602.23295v1
  • Categories: cs.CV, cs.LG
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...