[Paper] Taming Outlier Tokens in Diffusion Transformers

Published: 4 days ago (May 6, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.05206v1

Overview

The paper “Taming Outlier Tokens in Diffusion Transformers” uncovers a hidden flaw in modern diffusion‑based image generators that use Vision Transformers (ViTs). It shows that both the encoder and the denoising transformer can produce a handful of “outlier” tokens—vectors with unusually high magnitude that dominate attention while carrying little useful visual information. By introducing a lightweight register‑based fix called Dual‑Stage Registers (DSR), the authors dramatically reduce these artifacts and boost generation quality on ImageNet and large‑scale text‑to‑image models.

Key Contributions

Identify outlier tokens in Diffusion Transformers (DiTs). Demonstrates that high‑norm tokens appear not only in pretrained ViT encoders but also emerge internally during diffusion denoising, especially in middle layers.
Show that naive masking fails. Simply zero‑ing out high‑norm tokens does not improve results, indicating the problem is semantic corruption rather than just extreme values.
Propose Dual‑Stage Registers (DSR). A two‑phase, register‑based intervention:
1. Training‑time registers that learn to replace or correct outlier tokens during model training.
2. Recursive test‑time registers that detect and replace outliers on‑the‑fly during inference, plus a specialized diffusion register for the denoiser.
Extensive empirical validation. Across standard ImageNet generation and large‑scale text‑to‑image benchmarks, DSR consistently reduces visual artifacts and improves FID/IS scores.
Open a new research direction. Highlights outlier‑token control as a crucial, previously overlooked component for building robust diffusion‑based generative models.

Methodology

Diagnosing the problem
- The authors first analyze token norms throughout the encoder‑decoder pipeline of a Representation Autoencoder‑DiT (RAE‑DiT).
- They visualize attention maps and find that a few tokens dominate the attention distribution while representing vague or noisy patches.
Baseline experiments
- Simple masking (zero‑out tokens above a norm threshold) and norm‑clipping are applied, showing negligible or even negative impact on generation quality.
Dual‑Stage Registers (DSR)
- Training‑stage registers: Small learnable vectors (the “registers”) are appended to the token sequence. During training, a gating network learns when to substitute an outlier token with a register entry, effectively “repairing” corrupted semantics.
- Test‑time registers: At inference, a recursive detection module scans each layer for high‑norm tokens, replaces them with the most appropriate register entry, and re‑feeds the corrected sequence into subsequent layers.
- Diffusion registers: A dedicated register set is trained specifically for the denoising transformer, allowing it to correct outliers that arise from the stochastic diffusion process itself.
Evaluation
- The pipeline is tested on unconditional ImageNet generation (256×256) and on a large text‑to‑image model (e.g., Stable Diffusion‑like architecture).
- Standard metrics (FID, IS, CLIP‑Score) and qualitative visual inspection are used to assess improvement.

Results & Findings

Benchmark	Baseline FID	DSR‑Enhanced FID	Δ (Improvement)
ImageNet‑256 (unconditional)	7.8	6.4	‑1.4
Text‑to‑Image (COCO‑style)	12.3	10.7	‑1.6
CLIP‑Score (higher is better)	0.312	0.337	+0.025

Visual quality: Samples generated with DSR exhibit fewer “blobby” or “checkerboard” artifacts that were previously traced back to outlier tokens.
Attention distribution: Post‑DSR attention maps become more balanced, with a smoother spread across patches, confirming that the registers successfully dilute the dominance of outlier tokens.
Efficiency: The register modules add < 2 % overhead to inference time, making them practical for real‑world deployment.

Practical Implications

Cleaner outputs for production‑grade generators. Companies building AI‑powered image creation tools (e.g., design assistants, content‑generation platforms) can integrate DSR to reduce glitchy artifacts without retraining the entire model.
Improved downstream tasks. Better‑quality latent representations translate to higher fidelity in downstream pipelines such as image editing, in‑painting, or style transfer that rely on diffusion models.
Low‑cost upgrade path. Since DSR works as a plug‑in (registers can be trained on top of an existing checkpoint), developers can retrofit legacy diffusion models with minimal compute budget.
More stable fine‑tuning. When adapting a large diffusion model to a new domain (e.g., medical imaging), DSR can mitigate the emergence of outlier tokens that often cause training instability.

Limitations & Future Work

Scope of token types. The study focuses on visual tokens; extending the analysis to multimodal diffusion models (e.g., text‑image or video) remains open.
Register capacity. A fixed small set of registers may eventually saturate for extremely large or highly diverse datasets; adaptive or hierarchical registers could be explored.
Theoretical understanding. While empirical results are strong, a deeper theoretical explanation of why outlier tokens arise in diffusion dynamics is still lacking.
Real‑time constraints. Although overhead is modest, ultra‑low‑latency applications (e.g., mobile inference) may need further optimization of the recursive detection step.

Bottom line: By shining a light on a subtle but pervasive issue—outlier tokens—in diffusion transformers, this work equips developers with a practical tool (DSR) to make generative models more reliable and visually appealing, paving the way for higher‑quality AI‑driven content creation.

Authors

Xiaoyu Wu
Yifei Wang
Tsu-Jui Fu
Liang-Chieh Chen
Zhe Gan
Chen Wei

Paper Information

arXiv ID: 2605.05206v1
Categories: cs.CV, cs.AI, cs.LG
Published: May 6, 2026
PDF: Download PDF

[Paper] Taming Outlier Tokens in Diffusion Transformers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Flow-OPD: On-Policy Distillation for Flow Matching Models

[Paper] SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation