[Paper] Generative Refocusing: Flexible Defocus Control from a Single Image
Source: arXiv - 2512.16923v1
Overview
The paper “Generative Refocusing: Flexible Defocus Control from a Single Image” tackles a long‑standing problem in computational photography: how to change the focus and bokeh of a photo after it has been taken, using only the single captured image. By combining a novel two‑stage neural pipeline with a semi‑supervised training scheme that leverages both synthetic pairs and real‑world bokeh shots, the authors achieve high‑quality, controllable refocusing without the need for special hardware or multiple exposures.
Key Contributions
- Two‑stage generative pipeline
- DeblurNet restores an all‑in‑focus version of the input, regardless of its original focus quality.
- BokehNet synthesizes realistic, aperture‑controlled bokeh from the deblurred image.
- Semi‑supervised training strategy: mixes synthetic paired data (sharp ↔ defocused) with unpaired real bokeh photographs, using EXIF metadata to capture true optical characteristics that simulators miss.
- Fine‑grained aperture control: supports continuous aperture size, custom aperture shapes, and even text‑guided focus adjustments (e.g., “focus on the cat”).
- State‑of‑the‑art performance on three benchmark suites: defocus deblurring, bokeh synthesis, and full‑image refocusing.
- Publicly released code and pretrained models, enabling immediate experimentation by developers.
Methodology
-
Data Preparation
- Synthetic pairs are generated with a physics‑based defocus simulator, providing ground‑truth all‑in‑focus / defocused image pairs.
- Real bokeh collection: thousands of photographs taken with DSLR lenses at various apertures; only the raw bokeh image is kept, while the corresponding sharp image is not required. EXIF tags (f‑number, focal length, sensor size) are extracted to inform the model about the true optical blur kernel.
-
DeblurNet (All‑in‑Focus Restoration)
- An encoder‑decoder CNN with residual blocks predicts a sharp image from any input (in‑focus, out‑of‑focus, or partially blurred).
- Losses: L1 pixel loss, perceptual loss (VGG‑based), and an edge‑preserving gradient loss to keep fine details.
-
BokehNet (Controllable Bokeh Synthesis)
- Takes the deblurred output and a focus map (either user‑specified or automatically estimated) plus an aperture descriptor (size, shape, or textual cue).
- Uses a conditional GAN architecture: the generator produces the bokeh image, while a discriminator enforces realism.
- A style‑transfer‑like text encoder maps natural‑language focus commands to spatial attention maps, enabling “text‑guided refocusing”.
-
Semi‑Supervised Training Loop
- Paired branch: synthetic data drives supervised losses for both networks.
- Unpaired branch: real bokeh images are fed through BokehNet (with the corresponding EXIF‑derived aperture descriptor) and the discriminator learns to distinguish them from generated bokeh, closing the domain gap.
- A consistency loss forces BokehNet’s output, when re‑deblurred by DeblurNet, to reconstruct the original sharp image, reinforcing cycle consistency.
-
Implementation Details
- Trained on 8‑GPU nodes for ~3 days.
- Adam optimizer, learning rate schedule with cosine decay.
- Inference runs at ~30 fps on a single RTX 3080 for 1080p images.
Results & Findings
| Task | Metric (higher is better) | Generative Refocusing | Prior Art |
|---|---|---|---|
| Defocus deblurring (PSNR) | PSNR (dB) | 33.8 | 31.2 (DeepDeblur) |
| Bokeh synthesis (FID) | FID (lower is better) | 12.4 | 21.7 (BokehGAN) |
| Refocusing (LPIPS) | LPIPS (lower) | 0.12 | 0.21 (Dual‑Pixel) |
- Visual quality: side‑by‑side comparisons show sharper foregrounds, smoother background blur, and faithful preservation of specular highlights—issues that plagued earlier methods.
- Aperture flexibility: users can smoothly transition from f/1.4 to f/8, with intermediate results matching physical optics.
- Text‑guided focus: simple prompts (“focus on the red balloon”) correctly shift the depth map and produce plausible bokeh, demonstrating the model’s semantic understanding.
- Generalization: the semi‑supervised regime reduces the synthetic‑real domain gap, allowing the system to work on handheld smartphone photos taken under varied lighting conditions.
Practical Implications
- Mobile photography apps can integrate a “post‑capture focus” feature that works on any photo, not just those taken with dual‑pixel or multi‑camera rigs.
- Content creation pipelines (e.g., Instagram, TikTok) gain a lightweight way to add cinematic bokeh or simulate macro shots without expensive lenses.
- E‑commerce: product images can be automatically refocused to highlight items while softly blurring distracting backgrounds, improving visual appeal.
- AR/VR: dynamic depth‑of‑field rendering for virtual cameras can be driven by a single real‑world capture, simplifying scene reconstruction.
- Film post‑production: editors can adjust focus points in still frames or keyframes, reducing the need for costly reshoots or specialized hardware rigs.
Limitations & Future Work
- Extreme defocus: very strong blur (e.g., f/22 with long exposure) still challenges DeblurNet, leading to occasional ringing artifacts.
- Depth ambiguity: the model relies on learned cues for depth ordering; highly textureless regions (e.g., plain walls) can produce incorrect focus maps.
- Real‑time on mobile: while 30 fps is achievable on desktop GPUs, further model compression (e.g., quantization, knowledge distillation) is needed for on‑device inference.
- Future directions suggested by the authors include: integrating explicit depth estimation for more accurate focus transitions, extending the text‑guided interface to multi‑object commands, and exploring unsupervised domain adaptation to handle exotic lenses (fisheye, anamorphic).
Authors
- Chun-Wei Tuan Mu
- Jia-Bin Huang
- Yu-Lun Liu
Paper Information
- arXiv ID: 2512.16923v1
- Categories: cs.CV
- Published: December 18, 2025
- PDF: Download PDF