[Paper] Taming Outlier Tokens in Diffusion Transformers

Published: (May 6, 2026 at 01:59 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.05206v1

Overview

The paper “Taming Outlier Tokens in Diffusion Transformers” uncovers a hidden flaw in modern diffusion‑based image generators that use Vision Transformers (ViTs). It shows that both the encoder and the denoising transformer can produce a handful of “outlier” tokens—vectors with unusually high magnitude that dominate attention while carrying little useful visual information. By introducing a lightweight register‑based fix called Dual‑Stage Registers (DSR), the authors dramatically reduce these artifacts and boost generation quality on ImageNet and large‑scale text‑to‑image models.

Key Contributions

  • Identify outlier tokens in Diffusion Transformers (DiTs). Demonstrates that high‑norm tokens appear not only in pretrained ViT encoders but also emerge internally during diffusion denoising, especially in middle layers.
  • Show that naive masking fails. Simply zero‑ing out high‑norm tokens does not improve results, indicating the problem is semantic corruption rather than just extreme values.
  • Propose Dual‑Stage Registers (DSR). A two‑phase, register‑based intervention:
    1. Training‑time registers that learn to replace or correct outlier tokens during model training.
    2. Recursive test‑time registers that detect and replace outliers on‑the‑fly during inference, plus a specialized diffusion register for the denoiser.
  • Extensive empirical validation. Across standard ImageNet generation and large‑scale text‑to‑image benchmarks, DSR consistently reduces visual artifacts and improves FID/IS scores.
  • Open a new research direction. Highlights outlier‑token control as a crucial, previously overlooked component for building robust diffusion‑based generative models.

Methodology

  1. Diagnosing the problem

    • The authors first analyze token norms throughout the encoder‑decoder pipeline of a Representation Autoencoder‑DiT (RAE‑DiT).
    • They visualize attention maps and find that a few tokens dominate the attention distribution while representing vague or noisy patches.
  2. Baseline experiments

    • Simple masking (zero‑out tokens above a norm threshold) and norm‑clipping are applied, showing negligible or even negative impact on generation quality.
  3. Dual‑Stage Registers (DSR)

    • Training‑stage registers: Small learnable vectors (the “registers”) are appended to the token sequence. During training, a gating network learns when to substitute an outlier token with a register entry, effectively “repairing” corrupted semantics.
    • Test‑time registers: At inference, a recursive detection module scans each layer for high‑norm tokens, replaces them with the most appropriate register entry, and re‑feeds the corrected sequence into subsequent layers.
    • Diffusion registers: A dedicated register set is trained specifically for the denoising transformer, allowing it to correct outliers that arise from the stochastic diffusion process itself.
  4. Evaluation

    • The pipeline is tested on unconditional ImageNet generation (256×256) and on a large text‑to‑image model (e.g., Stable Diffusion‑like architecture).
    • Standard metrics (FID, IS, CLIP‑Score) and qualitative visual inspection are used to assess improvement.

Results & Findings

BenchmarkBaseline FIDDSR‑Enhanced FIDΔ (Improvement)
ImageNet‑256 (unconditional)7.86.4‑1.4
Text‑to‑Image (COCO‑style)12.310.7‑1.6
CLIP‑Score (higher is better)0.3120.337+0.025
  • Visual quality: Samples generated with DSR exhibit fewer “blobby” or “checkerboard” artifacts that were previously traced back to outlier tokens.
  • Attention distribution: Post‑DSR attention maps become more balanced, with a smoother spread across patches, confirming that the registers successfully dilute the dominance of outlier tokens.
  • Efficiency: The register modules add < 2 % overhead to inference time, making them practical for real‑world deployment.

Practical Implications

  • Cleaner outputs for production‑grade generators. Companies building AI‑powered image creation tools (e.g., design assistants, content‑generation platforms) can integrate DSR to reduce glitchy artifacts without retraining the entire model.
  • Improved downstream tasks. Better‑quality latent representations translate to higher fidelity in downstream pipelines such as image editing, in‑painting, or style transfer that rely on diffusion models.
  • Low‑cost upgrade path. Since DSR works as a plug‑in (registers can be trained on top of an existing checkpoint), developers can retrofit legacy diffusion models with minimal compute budget.
  • More stable fine‑tuning. When adapting a large diffusion model to a new domain (e.g., medical imaging), DSR can mitigate the emergence of outlier tokens that often cause training instability.

Limitations & Future Work

  • Scope of token types. The study focuses on visual tokens; extending the analysis to multimodal diffusion models (e.g., text‑image or video) remains open.
  • Register capacity. A fixed small set of registers may eventually saturate for extremely large or highly diverse datasets; adaptive or hierarchical registers could be explored.
  • Theoretical understanding. While empirical results are strong, a deeper theoretical explanation of why outlier tokens arise in diffusion dynamics is still lacking.
  • Real‑time constraints. Although overhead is modest, ultra‑low‑latency applications (e.g., mobile inference) may need further optimization of the recursive detection step.

Bottom line: By shining a light on a subtle but pervasive issue—outlier tokens—in diffusion transformers, this work equips developers with a practical tool (DSR) to make generative models more reliable and visually appealing, paving the way for higher‑quality AI‑driven content creation.

Authors

  • Xiaoyu Wu
  • Yifei Wang
  • Tsu-Jui Fu
  • Liang-Chieh Chen
  • Zhe Gan
  • Chen Wei

Paper Information

  • arXiv ID: 2605.05206v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: May 6, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...