[Paper] Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
Source: arXiv - 2512.10954v1
Overview
The paper introduces Group Diffusion, a novel twist on diffusion‑based image generators that lets multiple samples “talk” to each other during inference. By sharing the attention maps across a batch of images, the model can coordinate its denoising steps, leading to noticeably higher visual fidelity—up to a 32 % reduction in FID on ImageNet‑256×256. This work opens a new avenue for improving generative AI without retraining the underlying model.
Key Contributions
- Cross‑sample attention: Extends the transformer‑style attention mechanism from intra‑image patches to inter‑image patches, enabling collaborative denoising.
- Group Diffusion framework: A plug‑and‑play inference‑time modification that works with any diffusion transformer (e.g., Stable Diffusion, Imagen).
- Scaling analysis: Demonstrates a monotonic relationship between group size and generation quality—larger groups yield stronger cross‑sample signals.
- Qualitative metric: Proposes a simple “cross‑sample attention strength” measure that correlates tightly with FID, offering a diagnostic tool for practitioners.
- Empirical gains: Achieves up to 32.2 % lower FID on ImageNet‑256×256 compared to the baseline diffusion model, with no extra training data.
Methodology
- Baseline diffusion transformer: The model follows the standard denoising diffusion probabilistic model (DDPM) pipeline, where a UNet‑style transformer predicts noise for each image patch at every timestep.
- Group formation: Instead of processing a single image, the inference engine stacks N images into a “group.”
- Shared attention: The self‑attention layers are modified so that the query, key, and value tensors are concatenated across the group dimension. Consequently, each patch can attend to patches in any image of the group, not just its own.
- Joint denoising: The model performs the usual reverse diffusion steps, but the noise prediction for each image now incorporates information from its peers.
- Scaling & measurement: Experiments vary the group size (e.g., 2, 4, 8, 16) and compute the proposed cross‑sample attention strength metric, showing a strong linear correlation with the final FID.
The approach requires no retraining; it is a pure inference‑time change, making it straightforward to drop into existing pipelines.
Results & Findings
| Setting | FID (ImageNet‑256) | Relative Improvement |
|---|---|---|
| Baseline diffusion transformer (single‑sample) | 13.8 | — |
| Group Diffusion, group‑size = 4 | 11.9 | ‑13 % |
| Group Diffusion, group‑size = 8 | 10.8 | ‑22 % |
| Group Diffusion, group‑size = 16 | 9.3 | ‑32 % |
- Cross‑sample attention strength rises with group size and mirrors the FID drop, confirming that the metric captures the underlying signal.
- Visual inspection shows sharper textures, more coherent object boundaries, and fewer artifacts, especially in complex scenes with multiple objects.
- The method works across different diffusion backbones (e.g., Stable Diffusion v1.4, Imagen‑like models), indicating broad applicability.
Practical Implications
- Higher‑quality outputs without extra training data: Companies can boost the fidelity of existing diffusion services (e.g., image‑to‑image editing, content creation) simply by batching requests together.
- Cost‑effective scaling: Since the improvement comes from inference, the marginal compute cost is modest—mainly extra memory for the larger batch and a slight increase in attention computation.
- Better batch utilization: Cloud providers can schedule inference jobs in groups, turning idle GPU capacity into a quality boost for end users.
- Potential for multimodal collaboration: The same principle could be extended to text‑to‑image or video generation, where multiple prompts or frames share attention, opening doors to synchronized storytelling or style consistency across frames.
- Diagnostic tool: The cross‑sample attention strength metric can be used to monitor model health or to decide optimal group sizes dynamically based on hardware constraints.
Limitations & Future Work
- Memory overhead: Grouping many high‑resolution images can exceed GPU memory limits, requiring careful batch sizing or gradient checkpointing.
- Diminishing returns: After a certain group size (≈16 in the paper), gains plateau, suggesting a sweet spot rather than “bigger is always better.”
- Applicability to non‑transformer diffusion models: The current design leverages transformer attention; adapting it to convolution‑based diffusion backbones may need additional engineering.
- Theoretical understanding: While empirical correlation with FID is strong, a deeper analysis of why cross‑sample attention improves the learned distribution remains open.
Future research could explore adaptive group formation (e.g., grouping images with similar semantic content), extending the idea to video diffusion, or integrating cross‑sample signals into the training loop for even larger gains.
Authors
- Sicheng Mo
- Thao Nguyen
- Richard Zhang
- Nick Kolkin
- Siddharth Srinivasan Iyer
- Eli Shechtman
- Krishna Kumar Singh
- Yong Jae Lee
- Bolei Zhou
- Yuheng Li
Paper Information
- arXiv ID: 2512.10954v1
- Categories: cs.CV
- Published: December 11, 2025
- PDF: Download PDF