[Paper] InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space

Published: (June 3, 2026 at 12:30 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2606.05071v1

Overview

The paper InstantRetouch tackles the growing demand for language‑guided photo editing by proposing a method that delivers high‑quality, instruction‑following retouches in real time. By moving the heavy lifting from pixel‑level diffusion to a compact bilateral‑grid representation, the authors achieve both visual fidelity (no unwanted content drift) and a dramatic speed boost, making the technique practical for everyday developer tools and consumer apps.

Key Contributions

  • Bilateral‑grid based retouching: Predicts a low‑resolution grid of affine transforms that are sliced and applied to the full‑resolution image, preserving geometry and texture.
  • Variational Score Distillation (VSD): A novel distillation pipeline that transfers the strong priors of a multi‑step diffusion model into the lightweight grid framework.
  • Prompt‑alignment loss: Ensures the generated edits follow natural‑language instructions accurately.
  • Comprehensive benchmark: Introduces a new evaluation suite covering fidelity, instruction adherence, and runtime efficiency.
  • State‑of‑the‑art performance: Beats recent diffusion‑based retouchers (e.g., Gemini‑2.5‑Flash) on content preservation and latency while delivering comparable visual quality.

Methodology

  1. Bilateral Space Representation – Instead of editing each pixel or the latent vector of a diffusion model, the system predicts a coarse bilateral grid (think of a 3‑D lookup table) where each cell stores an affine color transform.
  2. Guidance Map Slicing – A learned guidance map determines how to slice the grid for each pixel, effectively selecting the right transform based on local content (edges, textures, etc.).
  3. Application to Full‑Resolution Image – The sliced transforms are applied back to the original high‑resolution image, yielding a retouched output without any down‑sampling artifacts.
  4. Distillation from Diffusion – A pre‑trained diffusion model (the “teacher”) generates high‑quality retouch examples. Using Variational Score Distillation, the student grid model learns to mimic the teacher’s score function, inheriting its aesthetic priors while staying lightweight.
  5. Instruction Alignment – A contrastive loss aligns the textual prompt embedding with the predicted grid, encouraging the model to respect the user’s natural‑language command (e.g., “make the sky warmer”).

The whole pipeline runs in a single forward pass, eliminating the iterative sampling that makes diffusion retouching slow.

Results & Findings

MetricInstantRetouchGemini‑2.5‑Flash (Nano‑Banana)Diffusion‑Only Baseline
FID (fidelity)0.87 (lower is better)1.341.12
Instruction Accuracy (CLIP‑based)92%78%84%
Latency (per 1080p image)≈ 45 ms (GPU)320 ms1.2 s
Content DriftNegligibleNoticeable artifactsModerate
  • Visual quality: Users reported that InstantRetouch’s edits look as natural as diffusion outputs but without the occasional “hallucinated” details.
  • Speed: The bilateral‑grid approach yields a 7×‑10× speedup over competing diffusion methods, making it viable for interactive UI.
  • Instruction fidelity: The prompt‑alignment loss significantly improves the model’s ability to follow nuanced language cues.

Practical Implications

  • Real‑time photo editors: Mobile and web apps can now offer AI‑driven retouching that reacts instantly to voice or text commands, opening up new UX possibilities (e.g., “brighten my portrait” with immediate feedback).
  • Batch processing pipelines: Studios can integrate InstantRetouch into automated asset pipelines, achieving consistent color grading across thousands of images without the compute cost of full diffusion.
  • Edge‑device deployment: The lightweight grid model (≈ 10 MB) fits comfortably on modern smartphones and even some embedded GPUs, enabling on‑device privacy‑preserving editing.
  • Extension to video: Because the bilateral grid operates per‑frame and preserves geometry, it can be adapted for temporally consistent video retouching with minimal additional overhead.

Limitations & Future Work

  • Resolution of the bilateral grid: While the current low‑resolution grid works well for most edits, extremely fine‑grained texture manipulations (e.g., subtle grain addition) may still suffer.
  • Dependence on teacher diffusion model: The quality ceiling is bounded by the teacher’s capabilities; improvements in diffusion priors would directly benefit InstantRetouch.
  • Prompt generalization: Very complex or ambiguous instructions can lead to sub‑optimal alignment; future work could explore richer language models or multi‑turn dialog.
  • Extending beyond color/tonal edits: The authors note that the current affine‑transform grid is tailored to retouching; expanding to geometric transformations or style transfer would require a more expressive grid representation.

InstantRetouch demonstrates that clever architectural choices—here, a bilateral‑grid coupled with diffusion distillation—can bridge the gap between high‑fidelity AI editing and real‑world performance constraints, paving the way for the next generation of developer‑friendly, language‑driven imaging tools.

Authors

  • Jiarui Wu
  • Yujin Wang
  • Ruikang Li
  • Fan Zhang
  • Mingde Yao
  • Tianfan Xue

Paper Information

  • arXiv ID: 2606.05071v1
  • Categories: cs.CV
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »