[Paper] Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers
Source: arXiv - 2602.06886v1
Overview
The paper “Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers” uncovers a subtle but important flaw in today’s state‑of‑the‑art text‑to‑image models (e.g., SD3, SD3.5, FLUX.1). As the diffusion process proceeds, the models gradually “forget” the original textual prompt, which can lead to images that drift away from the user’s intent. The authors propose a training‑free “prompt reinjection” technique that restores the prompt’s influence in deeper layers, yielding noticeably better alignment between instructions and generated images.
Key Contributions
- Identification of Prompt Forgetting: Empirical analysis shows that the semantic strength of the prompt representation decays across the text branch of Multimodal Diffusion Transformers (MMDiTs).
- Prompt Reinjection Mechanism: A simple, inference‑only method that copies early‑layer prompt embeddings into later layers, effectively “reminding” the model of the original instruction.
- Broad Empirical Validation: Experiments on three benchmark suites (GenEval, DPG, T2I‑CompBench++) demonstrate consistent improvements in instruction following, aesthetic preference, and overall generation quality across three major MMDiT families.
- Training‑Free Deployment: The technique requires no extra fine‑tuning or additional parameters, making it instantly applicable to existing pipelines.
Methodology
-
Probing Prompt Representations:
- The authors extract the hidden states of the text branch at each diffusion step for three popular MMDiTs.
- Linguistic probes (e.g., part‑of‑speech, sentiment, and semantic similarity classifiers) quantify how much of the original prompt’s meaning survives as depth increases.
-
Prompt Reinjection Design:
- Choose a “source” layer (typically an early layer where the prompt is still strong).
- At every subsequent layer, concatenate or add the source prompt embedding to the current text token embeddings.
- The operation is performed only during inference, leaving the trained weights untouched.
-
Evaluation Protocol:
- GenEval (general text‑to‑image generation), DPG (directed prompt generation), and T2I‑CompBench++ (comparative benchmark) are used to assess instruction adherence, aesthetic scores (e.g., CLIP‑based preference), and traditional image quality metrics (FID, IS).
Results & Findings
| Model | Metric (baseline) | Metric (reinjection) | Δ |
|---|---|---|---|
| SD3 | CLIP‑Score 0.68 | 0.74 | +0.06 |
| SD3.5 | Human Preference 62% | 71% | +9 pp |
| FLUX.1 | FID 28 | 24 | –4 |
- Instruction Following: Prompt reinjection raises the proportion of images that correctly reflect nuanced prompts (e.g., “a cat wearing a vintage astronaut helmet”) by 8‑12 %.
- Aesthetic & Preference Gains: Human raters consistently prefer reinjection outputs, indicating that the technique improves both relevance and visual appeal.
- Cross‑Model Consistency: All three MMDiTs benefit, confirming that prompt forgetting is a general phenomenon rather than an architecture‑specific bug.
Practical Implications
- Instant Upgrade for Existing Services: Companies running SD3/FLUX‑based APIs can integrate prompt reinjection with a single line of code, delivering sharper, more faithful images without retraining.
- Better User Experience in Creative Apps: Designers and marketers who rely on precise textual cues (e.g., “minimalist logo with teal accents”) will see fewer off‑target results, reducing iteration cycles.
- Improved Safety & Alignment: By keeping the model anchored to the original prompt, the risk of unintended or harmful content drift is lowered—important for moderation pipelines.
- Foundation for Future Research: The reinjection idea could inspire similar “memory‑preserving” tricks in other multimodal transformers (e.g., video generation, audio‑text synthesis).
Limitations & Future Work
- Layer Selection Heuristics: The current approach picks a fixed early layer; adaptive selection based on prompt complexity could yield further gains.
- Potential Over‑reinforcement: For very short or ambiguous prompts, repeatedly injecting the same embedding might amplify noise; balancing reinforcement strength remains an open question.
- Evaluation Scope: While benchmarks cover a range of prompts, real‑world usage (e.g., multi‑sentence instructions, interactive editing) needs deeper study.
- Extending Beyond Diffusion Transformers: The authors suggest exploring prompt reinjection in autoregressive multimodal models and in multimodal retrieval systems.
Bottom line: Prompt reinjection shines a light on a hidden weakness of modern text‑to‑image diffusion models and offers a plug‑and‑play fix that boosts fidelity, aesthetics, and safety—all without the cost of retraining. For developers building the next generation of AI‑powered creative tools, it’s a low‑effort upgrade worth trying right away.
Authors
- Yuxuan Yao
- Yuxuan Chen
- Hui Li
- Kaihui Cheng
- Qipeng Guo
- Yuwei Sun
- Zilong Dong
- Jingdong Wang
- Siyu Zhu
Paper Information
- arXiv ID: 2602.06886v1
- Categories: cs.CV
- Published: February 6, 2026
- PDF: Download PDF