[Paper] Taming Outlier Tokens in Diffusion Transformers
Source: arXiv - 2605.05206v1
Overview
The paper “Taming Outlier Tokens in Diffusion Transformers” uncovers a hidden flaw in modern diffusion‑based image generators that use Vision Transformers (ViTs). It shows that both the encoder and the denoising transformer can produce a handful of “outlier” tokens—vectors with unusually high magnitude that dominate attention while carrying little useful visual information. By introducing a lightweight register‑based fix called Dual‑Stage Registers (DSR), the authors dramatically reduce these artifacts and boost generation quality on ImageNet and large‑scale text‑to‑image models.
Key Contributions
- Identify outlier tokens in Diffusion Transformers (DiTs). Demonstrates that high‑norm tokens appear not only in pretrained ViT encoders but also emerge internally during diffusion denoising, especially in middle layers.
- Show that naive masking fails. Simply zero‑ing out high‑norm tokens does not improve results, indicating the problem is semantic corruption rather than just extreme values.
- Propose Dual‑Stage Registers (DSR). A two‑phase, register‑based intervention:
- Training‑time registers that learn to replace or correct outlier tokens during model training.
- Recursive test‑time registers that detect and replace outliers on‑the‑fly during inference, plus a specialized diffusion register for the denoiser.
- Extensive empirical validation. Across standard ImageNet generation and large‑scale text‑to‑image benchmarks, DSR consistently reduces visual artifacts and improves FID/IS scores.
- Open a new research direction. Highlights outlier‑token control as a crucial, previously overlooked component for building robust diffusion‑based generative models.
Methodology
-
Diagnosing the problem
- The authors first analyze token norms throughout the encoder‑decoder pipeline of a Representation Autoencoder‑DiT (RAE‑DiT).
- They visualize attention maps and find that a few tokens dominate the attention distribution while representing vague or noisy patches.
-
Baseline experiments
- Simple masking (zero‑out tokens above a norm threshold) and norm‑clipping are applied, showing negligible or even negative impact on generation quality.
-
Dual‑Stage Registers (DSR)
- Training‑stage registers: Small learnable vectors (the “registers”) are appended to the token sequence. During training, a gating network learns when to substitute an outlier token with a register entry, effectively “repairing” corrupted semantics.
- Test‑time registers: At inference, a recursive detection module scans each layer for high‑norm tokens, replaces them with the most appropriate register entry, and re‑feeds the corrected sequence into subsequent layers.
- Diffusion registers: A dedicated register set is trained specifically for the denoising transformer, allowing it to correct outliers that arise from the stochastic diffusion process itself.
-
Evaluation
- The pipeline is tested on unconditional ImageNet generation (256×256) and on a large text‑to‑image model (e.g., Stable Diffusion‑like architecture).
- Standard metrics (FID, IS, CLIP‑Score) and qualitative visual inspection are used to assess improvement.
Results & Findings
| Benchmark | Baseline FID | DSR‑Enhanced FID | Δ (Improvement) |
|---|---|---|---|
| ImageNet‑256 (unconditional) | 7.8 | 6.4 | ‑1.4 |
| Text‑to‑Image (COCO‑style) | 12.3 | 10.7 | ‑1.6 |
| CLIP‑Score (higher is better) | 0.312 | 0.337 | +0.025 |
- Visual quality: Samples generated with DSR exhibit fewer “blobby” or “checkerboard” artifacts that were previously traced back to outlier tokens.
- Attention distribution: Post‑DSR attention maps become more balanced, with a smoother spread across patches, confirming that the registers successfully dilute the dominance of outlier tokens.
- Efficiency: The register modules add < 2 % overhead to inference time, making them practical for real‑world deployment.
Practical Implications
- Cleaner outputs for production‑grade generators. Companies building AI‑powered image creation tools (e.g., design assistants, content‑generation platforms) can integrate DSR to reduce glitchy artifacts without retraining the entire model.
- Improved downstream tasks. Better‑quality latent representations translate to higher fidelity in downstream pipelines such as image editing, in‑painting, or style transfer that rely on diffusion models.
- Low‑cost upgrade path. Since DSR works as a plug‑in (registers can be trained on top of an existing checkpoint), developers can retrofit legacy diffusion models with minimal compute budget.
- More stable fine‑tuning. When adapting a large diffusion model to a new domain (e.g., medical imaging), DSR can mitigate the emergence of outlier tokens that often cause training instability.
Limitations & Future Work
- Scope of token types. The study focuses on visual tokens; extending the analysis to multimodal diffusion models (e.g., text‑image or video) remains open.
- Register capacity. A fixed small set of registers may eventually saturate for extremely large or highly diverse datasets; adaptive or hierarchical registers could be explored.
- Theoretical understanding. While empirical results are strong, a deeper theoretical explanation of why outlier tokens arise in diffusion dynamics is still lacking.
- Real‑time constraints. Although overhead is modest, ultra‑low‑latency applications (e.g., mobile inference) may need further optimization of the recursive detection step.
Bottom line: By shining a light on a subtle but pervasive issue—outlier tokens—in diffusion transformers, this work equips developers with a practical tool (DSR) to make generative models more reliable and visually appealing, paving the way for higher‑quality AI‑driven content creation.
Authors
- Xiaoyu Wu
- Yifei Wang
- Tsu-Jui Fu
- Liang-Chieh Chen
- Zhe Gan
- Chen Wei
Paper Information
- arXiv ID: 2605.05206v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: May 6, 2026
- PDF: Download PDF