[Paper] Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

Published: 3 days ago (February 6, 2026 at 12:19 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2602.06886v1

Overview

The paper “Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers” uncovers a subtle but important flaw in today’s state‑of‑the‑art text‑to‑image models (e.g., SD3, SD3.5, FLUX.1). As the diffusion process proceeds, the models gradually “forget” the original textual prompt, which can lead to images that drift away from the user’s intent. The authors propose a training‑free “prompt reinjection” technique that restores the prompt’s influence in deeper layers, yielding noticeably better alignment between instructions and generated images.

Key Contributions

Identification of Prompt Forgetting: Empirical analysis shows that the semantic strength of the prompt representation decays across the text branch of Multimodal Diffusion Transformers (MMDiTs).
Prompt Reinjection Mechanism: A simple, inference‑only method that copies early‑layer prompt embeddings into later layers, effectively “reminding” the model of the original instruction.
Broad Empirical Validation: Experiments on three benchmark suites (GenEval, DPG, T2I‑CompBench++) demonstrate consistent improvements in instruction following, aesthetic preference, and overall generation quality across three major MMDiT families.
Training‑Free Deployment: The technique requires no extra fine‑tuning or additional parameters, making it instantly applicable to existing pipelines.

Methodology

Probing Prompt Representations:
- The authors extract the hidden states of the text branch at each diffusion step for three popular MMDiTs.
- Linguistic probes (e.g., part‑of‑speech, sentiment, and semantic similarity classifiers) quantify how much of the original prompt’s meaning survives as depth increases.
Prompt Reinjection Design:
- Choose a “source” layer (typically an early layer where the prompt is still strong).
- At every subsequent layer, concatenate or add the source prompt embedding to the current text token embeddings.
- The operation is performed only during inference, leaving the trained weights untouched.
Evaluation Protocol:
- GenEval (general text‑to‑image generation), DPG (directed prompt generation), and T2I‑CompBench++ (comparative benchmark) are used to assess instruction adherence, aesthetic scores (e.g., CLIP‑based preference), and traditional image quality metrics (FID, IS).

Results & Findings

Model	Metric (baseline)	Metric (reinjection)	Δ
SD3	CLIP‑Score 0.68	0.74	+0.06
SD3.5	Human Preference 62%	71%	+9 pp
FLUX.1	FID 28	24	–4

Instruction Following: Prompt reinjection raises the proportion of images that correctly reflect nuanced prompts (e.g., “a cat wearing a vintage astronaut helmet”) by 8‑12 %.
Aesthetic & Preference Gains: Human raters consistently prefer reinjection outputs, indicating that the technique improves both relevance and visual appeal.
Cross‑Model Consistency: All three MMDiTs benefit, confirming that prompt forgetting is a general phenomenon rather than an architecture‑specific bug.

Practical Implications

Instant Upgrade for Existing Services: Companies running SD3/FLUX‑based APIs can integrate prompt reinjection with a single line of code, delivering sharper, more faithful images without retraining.
Better User Experience in Creative Apps: Designers and marketers who rely on precise textual cues (e.g., “minimalist logo with teal accents”) will see fewer off‑target results, reducing iteration cycles.
Improved Safety & Alignment: By keeping the model anchored to the original prompt, the risk of unintended or harmful content drift is lowered—important for moderation pipelines.
Foundation for Future Research: The reinjection idea could inspire similar “memory‑preserving” tricks in other multimodal transformers (e.g., video generation, audio‑text synthesis).

Limitations & Future Work

Layer Selection Heuristics: The current approach picks a fixed early layer; adaptive selection based on prompt complexity could yield further gains.
Potential Over‑reinforcement: For very short or ambiguous prompts, repeatedly injecting the same embedding might amplify noise; balancing reinforcement strength remains an open question.
Evaluation Scope: While benchmarks cover a range of prompts, real‑world usage (e.g., multi‑sentence instructions, interactive editing) needs deeper study.
Extending Beyond Diffusion Transformers: The authors suggest exploring prompt reinjection in autoregressive multimodal models and in multimodal retrieval systems.

Bottom line: Prompt reinjection shines a light on a hidden weakness of modern text‑to‑image diffusion models and offers a plug‑and‑play fix that boosts fidelity, aesthetics, and safety—all without the cost of retraining. For developers building the next generation of AI‑powered creative tools, it’s a low‑effort upgrade worth trying right away.

Authors

Yuxuan Yao
Yuxuan Chen
Hui Li
Kaihui Cheng
Qipeng Guo
Yuwei Sun
Zilong Dong
Jingdong Wang
Siyu Zhu

Paper Information

arXiv ID: 2602.06886v1
Categories: cs.CV
Published: February 6, 2026
PDF: Download PDF

[Paper] Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

[Paper] CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

[Paper] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data