[Paper] Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
Source: arXiv - 2512.08922v1
Overview
The paper introduces UniT, a unified framework that combines diffusion‑based image generation, vision‑language understanding, and OCR‑style text spotting to restore images whose textual content has been corrupted (e.g., blurry, low‑resolution, or noisy scans). By feeding explicit linguistic cues back into the diffusion denoising loop, UniT dramatically reduces the “text hallucination” problem that plagues generic diffusion restorers, delivering crisp, readable text in the output.
Key Contributions
- Unified Diffusion Transformer (DiT) + Vision‑Language Model (VLM) + Text Spotting Module (TSM): A tightly coupled pipeline where each component informs the others during iterative denoising.
- Explicit textual guidance: The VLM extracts semantic text from the degraded input and injects it as conditioning signals for the diffusion process.
- Iterative OCR feedback: TSM predicts intermediate OCR results from diffusion features at every denoising step, allowing the VLM to refine its guidance on‑the‑fly.
- State‑of‑the‑art performance: On the SA‑Text and Real‑Text benchmarks, UniT achieves the highest end‑to‑end F1 scores for text‑aware image restoration, while markedly cutting down hallucinated characters.
- Generalizable architecture: The design can be swapped with other diffusion backbones or language models, making it a reusable building block for any text‑centric restoration task.
Methodology
-
Input & Degradation
- The system receives a low‑quality image (e.g., compressed, blurred, or partially occluded) that contains textual regions.
-
Diffusion Transformer (DiT)
- A latent‑space diffusion model that progressively denoises a noisy latent representation of the image.
- Unlike vanilla diffusion, DiT accepts conditioning tokens that carry extra information beyond pure pixel statistics.
-
Vision‑Language Model (VLM)
- A pretrained multimodal encoder (e.g., CLIP or BLIP) that processes the current noisy image estimate and extracts a textual embedding describing the visible characters.
- This embedding is turned into a set of guidance tokens that are concatenated to the diffusion transformer’s input at each step.
-
Text Spotting Module (TSM)
- A lightweight OCR head trained on diffusion feature maps.
- At every denoising iteration it predicts an intermediate transcription (character‑level or word‑level).
-
Iterative Loop
-
Step k:
- DiT produces a slightly less noisy latent.
- VLM reads the latent, outputs a textual embedding.
- TSM reads the same latent, outputs a provisional OCR string.
- The OCR string is fed back (via tokenization) to the VLM, sharpening its textual embedding.
- The refined embedding is injected back into DiT for the next denoising step.
-
This closed‑loop continues until the diffusion process converges, yielding a high‑fidelity image where the text matches the original content.
-
Results & Findings
| Dataset | Metric (F1) | Hallucination Rate ↓ | Visual Quality (PSNR/SSIM) |
|---|---|---|---|
| SA‑Text (synthetic) | 0.92 (↑ +7.4% vs. prior SOTA) | 0.03 (↓ 45%) | 31.8 dB / 0.94 |
| Real‑Text (real‑world scans) | 0.88 (↑ +6.1%) | 0.05 (↓ 38%) | 29.5 dB / 0.91 |
- Text fidelity: The OCR‑derived F1 scores show that UniT restores the exact characters far more reliably than diffusion‑only baselines.
- Hallucination suppression: By conditioning on explicit text, the model avoids inventing characters that never existed in the source.
- Ablation: Removing the TSM feedback loop drops F1 by ~4 points, confirming the importance of iterative OCR guidance.
Practical Implications
- Document digitization pipelines: Companies scanning legacy paperwork can plug UniT into existing OCR workflows to boost recognition accuracy on noisy or low‑resolution scans without manual re‑annotation.
- Augmented reality (AR) overlays: Real‑time text restoration on camera feeds (e.g., reading faded signs) becomes feasible, improving readability for translation or accessibility apps.
- Content moderation & forensic analysis: Restoring obscured text in images (e.g., watermarks, blurred license plates) can aid automated analysis while preserving evidentiary integrity.
- Developer‑friendly integration: Because UniT’s components are modular (DiT, VLM, TSM), developers can replace any part with a preferred model (e.g., Stable Diffusion, OpenAI CLIP) and still benefit from the iterative guidance mechanism.
Limitations & Future Work
- Computation cost: Running a diffusion model with multiple conditioning passes and OCR feedback is GPU‑intensive; real‑time deployment may require model distillation or pruning.
- Language coverage: The current VLM and TSM are trained primarily on English text; extending to multilingual or handwritten scripts will need additional data and possibly different tokenizers.
- Robustness to extreme degradation: When the input image is severely corrupted (e.g., >70 % pixel loss), the VLM struggles to extract any reliable textual cue, limiting restoration quality.
- Future directions: The authors suggest exploring lightweight diffusion alternatives, integrating large‑scale language models for richer semantic guidance, and expanding the framework to video‑frame restoration where temporal consistency of text matters.
Authors
- Jin Hyeon Kim
- Paul Hyunbin Cho
- Claire Kim
- Jaewon Min
- Jaeeun Lee
- Jihye Park
- Yeji Choi
- Seungryong Kim
Paper Information
- arXiv ID: 2512.08922v1
- Categories: cs.CV
- Published: December 9, 2025
- PDF: Download PDF