[Paper] Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration

Published: 1 month ago (December 11, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10954v1

Overview

The paper introduces Group Diffusion, a novel twist on diffusion‑based image generators that lets multiple samples “talk” to each other during inference. By sharing the attention maps across a batch of images, the model can coordinate its denoising steps, leading to noticeably higher visual fidelity—up to a 32 % reduction in FID on ImageNet‑256×256. This work opens a new avenue for improving generative AI without retraining the underlying model.

Key Contributions

Cross‑sample attention: Extends the transformer‑style attention mechanism from intra‑image patches to inter‑image patches, enabling collaborative denoising.
Group Diffusion framework: A plug‑and‑play inference‑time modification that works with any diffusion transformer (e.g., Stable Diffusion, Imagen).
Scaling analysis: Demonstrates a monotonic relationship between group size and generation quality—larger groups yield stronger cross‑sample signals.
Qualitative metric: Proposes a simple “cross‑sample attention strength” measure that correlates tightly with FID, offering a diagnostic tool for practitioners.
Empirical gains: Achieves up to 32.2 % lower FID on ImageNet‑256×256 compared to the baseline diffusion model, with no extra training data.

Methodology

Baseline diffusion transformer: The model follows the standard denoising diffusion probabilistic model (DDPM) pipeline, where a UNet‑style transformer predicts noise for each image patch at every timestep.
Group formation: Instead of processing a single image, the inference engine stacks N images into a “group.”
Shared attention: The self‑attention layers are modified so that the query, key, and value tensors are concatenated across the group dimension. Consequently, each patch can attend to patches in any image of the group, not just its own.
Joint denoising: The model performs the usual reverse diffusion steps, but the noise prediction for each image now incorporates information from its peers.
Scaling & measurement: Experiments vary the group size (e.g., 2, 4, 8, 16) and compute the proposed cross‑sample attention strength metric, showing a strong linear correlation with the final FID.

The approach requires no retraining; it is a pure inference‑time change, making it straightforward to drop into existing pipelines.

Results & Findings

Setting	FID (ImageNet‑256)	Relative Improvement
Baseline diffusion transformer (single‑sample)	13.8	—
Group Diffusion, group‑size = 4	11.9	‑13 %
Group Diffusion, group‑size = 8	10.8	‑22 %
Group Diffusion, group‑size = 16	9.3	‑32 %

Cross‑sample attention strength rises with group size and mirrors the FID drop, confirming that the metric captures the underlying signal.
Visual inspection shows sharper textures, more coherent object boundaries, and fewer artifacts, especially in complex scenes with multiple objects.
The method works across different diffusion backbones (e.g., Stable Diffusion v1.4, Imagen‑like models), indicating broad applicability.

Practical Implications

Higher‑quality outputs without extra training data: Companies can boost the fidelity of existing diffusion services (e.g., image‑to‑image editing, content creation) simply by batching requests together.
Cost‑effective scaling: Since the improvement comes from inference, the marginal compute cost is modest—mainly extra memory for the larger batch and a slight increase in attention computation.
Better batch utilization: Cloud providers can schedule inference jobs in groups, turning idle GPU capacity into a quality boost for end users.
Potential for multimodal collaboration: The same principle could be extended to text‑to‑image or video generation, where multiple prompts or frames share attention, opening doors to synchronized storytelling or style consistency across frames.
Diagnostic tool: The cross‑sample attention strength metric can be used to monitor model health or to decide optimal group sizes dynamically based on hardware constraints.

Limitations & Future Work

Memory overhead: Grouping many high‑resolution images can exceed GPU memory limits, requiring careful batch sizing or gradient checkpointing.
Diminishing returns: After a certain group size (≈16 in the paper), gains plateau, suggesting a sweet spot rather than “bigger is always better.”
Applicability to non‑transformer diffusion models: The current design leverages transformer attention; adapting it to convolution‑based diffusion backbones may need additional engineering.
Theoretical understanding: While empirical correlation with FID is strong, a deeper analysis of why cross‑sample attention improves the learned distribution remains open.

Future research could explore adaptive group formation (e.g., grouping images with similar semantic content), extending the idea to video diffusion, or integrating cross‑sample signals into the training loop for even larger gains.

Authors

Sicheng Mo
Thao Nguyen
Richard Zhang
Nick Kolkin
Siddharth Srinivasan Iyer
Eli Shechtman
Krishna Kumar Singh
Yong Jae Lee
Bolei Zhou
Yuheng Li

Paper Information

arXiv ID: 2512.10954v1
Categories: cs.CV
Published: December 11, 2025
PDF: Download PDF

[Paper] Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance

[Paper] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis