[Paper] MatteViT: High-Frequency-Aware Document Shadow Removal with Shadow Matte Guidance
Source: arXiv - 2512.08789v1
Overview
The paper introduces MatteViT, a new deep‑learning framework that removes shadows from scanned or photographed documents while keeping the crispness of text and line art. By blending spatial cues with frequency‑domain processing, the authors achieve state‑of‑the‑art results that translate into better OCR accuracy—a win for any workflow that relies on clean digital documents.
Key Contributions
- Matte Vision Transformer (MatteViT): a transformer‑based architecture that jointly exploits spatial information and high‑frequency details for shadow removal.
- High‑Frequency Amplification Module (HFAM): a lightweight plug‑in that isolates and adaptively boosts high‑frequency components (edges, strokes) before reconstruction.
- Continuous luminance‑based shadow matte: a novel, densely‑valued shadow mask generated from a custom matte dataset, providing precise guidance from the first network layer.
- Comprehensive benchmark evaluation: achieves new best scores on the RDD and Kligler shadow‑document datasets, and demonstrably improves downstream OCR performance.
Methodology
-
Input preprocessing – The raw document image is fed into a shadow matte generator that predicts a continuous matte (a per‑pixel shadow intensity map). This matte acts like a soft “shadow stencil” that tells the network where shadows are strongest.
-
High‑frequency extraction – Using a simple wavelet‑like decomposition, the image is split into low‑frequency (overall illumination) and high‑frequency (edges, fine text) components.
-
HFAM – The high‑frequency branch passes through the High‑Frequency Amplification Module, which learns pixel‑wise scaling factors to selectively enhance faint edges that shadows have dulled.
-
Transformer backbone – Both the matte‑guided low‑frequency map and the amplified high‑frequency map are concatenated and processed by a Vision Transformer. The self‑attention mechanism lets the model reason globally about illumination while still preserving local detail.
-
Reconstruction – The transformer outputs a shadow‑free image by recombining the refined low‑ and high‑frequency streams. The whole pipeline is end‑to‑end trainable with a combination of L1, perceptual, and matte‑consistency losses.
Results & Findings
- Quantitative gains: MatteViT reduces the mean absolute error (MAE) on the RDD benchmark by ~12 % and improves PSNR/SSIM over the previous best method by 1.8 dB / 0.03, respectively.
- OCR boost: When the cleaned documents are fed to Tesseract and a modern deep OCR model, character error rates drop by 9 % and 7 % compared to the strongest baseline.
- Ablation studies: Removing the HFAM or the continuous matte each degrades performance by ~5 % in MAE, confirming that both high‑frequency amplification and matte guidance are essential.
- Speed: The added HFAM adds < 2 ms per 512 × 512 image on a single RTX 3080, keeping the overall inference time under 50 ms—fast enough for real‑time scanning apps.
Practical Implications
- Document digitization pipelines – Integrating MatteViT can dramatically improve the quality of scanned archives, legal documents, and receipts, reducing manual cleanup.
- Mobile scanning apps – The lightweight HFAM and efficient transformer design make it feasible to run on modern smartphones, delivering near‑instant shadow removal for users.
- Improved downstream AI – Cleaner inputs boost the reliability of OCR, layout analysis, and even downstream NLP tasks that ingest scanned text.
- Enterprise automation – Companies that automate invoice processing or contract analysis can expect higher extraction accuracy and lower error‑handling costs.
Limitations & Future Work
- Dataset bias – The custom matte dataset focuses on typical office lighting; performance on extreme outdoor shadows or highly textured paper may degrade.
- Model size – While inference is fast, the transformer backbone still requires ~120 MB of GPU memory, which could be a hurdle for low‑end edge devices.
- Future directions – The authors suggest exploring knowledge distillation to shrink the model, extending the matte generation to handle colored shadows, and adapting the framework to video streams for continuous document capture.
Authors
- Chaewon Kim
- Seoyeon Lee
- Jonghyuk Park
Paper Information
- arXiv ID: 2512.08789v1
- Categories: cs.CV, cs.AI
- Published: December 9, 2025
- PDF: Download PDF