[Paper] InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space

Published: 1 week ago (June 3, 2026 at 12:30 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2606.05071v1

Overview

The paper InstantRetouch tackles the growing demand for language‑guided photo editing by proposing a method that delivers high‑quality, instruction‑following retouches in real time. By moving the heavy lifting from pixel‑level diffusion to a compact bilateral‑grid representation, the authors achieve both visual fidelity (no unwanted content drift) and a dramatic speed boost, making the technique practical for everyday developer tools and consumer apps.

Key Contributions

Bilateral‑grid based retouching: Predicts a low‑resolution grid of affine transforms that are sliced and applied to the full‑resolution image, preserving geometry and texture.
Variational Score Distillation (VSD): A novel distillation pipeline that transfers the strong priors of a multi‑step diffusion model into the lightweight grid framework.
Prompt‑alignment loss: Ensures the generated edits follow natural‑language instructions accurately.
Comprehensive benchmark: Introduces a new evaluation suite covering fidelity, instruction adherence, and runtime efficiency.
State‑of‑the‑art performance: Beats recent diffusion‑based retouchers (e.g., Gemini‑2.5‑Flash) on content preservation and latency while delivering comparable visual quality.

Methodology

Bilateral Space Representation – Instead of editing each pixel or the latent vector of a diffusion model, the system predicts a coarse bilateral grid (think of a 3‑D lookup table) where each cell stores an affine color transform.
Guidance Map Slicing – A learned guidance map determines how to slice the grid for each pixel, effectively selecting the right transform based on local content (edges, textures, etc.).
Application to Full‑Resolution Image – The sliced transforms are applied back to the original high‑resolution image, yielding a retouched output without any down‑sampling artifacts.
Distillation from Diffusion – A pre‑trained diffusion model (the “teacher”) generates high‑quality retouch examples. Using Variational Score Distillation, the student grid model learns to mimic the teacher’s score function, inheriting its aesthetic priors while staying lightweight.
Instruction Alignment – A contrastive loss aligns the textual prompt embedding with the predicted grid, encouraging the model to respect the user’s natural‑language command (e.g., “make the sky warmer”).

The whole pipeline runs in a single forward pass, eliminating the iterative sampling that makes diffusion retouching slow.

Results & Findings

Metric	InstantRetouch	Gemini‑2.5‑Flash (Nano‑Banana)	Diffusion‑Only Baseline
FID (fidelity)	0.87 (lower is better)	1.34	1.12
Instruction Accuracy (CLIP‑based)	92%	78%	84%
Latency (per 1080p image)	≈ 45 ms (GPU)	320 ms	1.2 s
Content Drift	Negligible	Noticeable artifacts	Moderate

Visual quality: Users reported that InstantRetouch’s edits look as natural as diffusion outputs but without the occasional “hallucinated” details.
Speed: The bilateral‑grid approach yields a 7×‑10× speedup over competing diffusion methods, making it viable for interactive UI.
Instruction fidelity: The prompt‑alignment loss significantly improves the model’s ability to follow nuanced language cues.

Practical Implications

Real‑time photo editors: Mobile and web apps can now offer AI‑driven retouching that reacts instantly to voice or text commands, opening up new UX possibilities (e.g., “brighten my portrait” with immediate feedback).
Batch processing pipelines: Studios can integrate InstantRetouch into automated asset pipelines, achieving consistent color grading across thousands of images without the compute cost of full diffusion.
Edge‑device deployment: The lightweight grid model (≈ 10 MB) fits comfortably on modern smartphones and even some embedded GPUs, enabling on‑device privacy‑preserving editing.
Extension to video: Because the bilateral grid operates per‑frame and preserves geometry, it can be adapted for temporally consistent video retouching with minimal additional overhead.

Limitations & Future Work

Resolution of the bilateral grid: While the current low‑resolution grid works well for most edits, extremely fine‑grained texture manipulations (e.g., subtle grain addition) may still suffer.
Dependence on teacher diffusion model: The quality ceiling is bounded by the teacher’s capabilities; improvements in diffusion priors would directly benefit InstantRetouch.
Prompt generalization: Very complex or ambiguous instructions can lead to sub‑optimal alignment; future work could explore richer language models or multi‑turn dialog.
Extending beyond color/tonal edits: The authors note that the current affine‑transform grid is tailored to retouching; expanding to geometric transformations or style transfer would require a more expressive grid representation.

InstantRetouch demonstrates that clever architectural choices—here, a bilateral‑grid coupled with diffusion distillation—can bridge the gap between high‑fidelity AI editing and real‑world performance constraints, paving the way for the next generation of developer‑friendly, language‑driven imaging tools.

Authors

Jiarui Wu
Yujin Wang
Ruikang Li
Fan Zhang
Mingde Yao
Tianfan Xue

Paper Information

arXiv ID: 2606.05071v1
Categories: cs.CV
Published: June 3, 2026
PDF: Download PDF

[Paper] InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniSHARP: Universal Sharp Monocular View Synthesis

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Streaming Video Generation with Streaming Force Control

[Paper] Differences in Detection: Explainability Where it Matters