[Paper] Continuous Control of Editing Models via Adaptive-Origin Guidance
Source: arXiv - 2602.03826v1
Overview
Diffusion‑based editing models can change the semantics of images and videos from a textual prompt, but they’ve lacked a smooth “dial” to control how strong the edit should be. This paper introduces Adaptive‑Origin Guidance (AdaOr), a technique that lets you continuously tune edit intensity—think of a slider that truly interpolates between the original media and the fully edited output—without retraining the model or building custom datasets.
Key Contributions
- Identifies the root cause of why traditional Classifier‑Free Guidance (CFG) fails to provide smooth edit strength control in diffusion editors.
- Proposes Adaptive‑Origin Guidance (AdaOr), which replaces the static unconditional prediction with an identity‑conditioned prediction that respects the input content.
- Implements a simple interpolation scheme between the identity prediction and the unconditional prediction, giving a continuous, monotonic transition from “no edit” to “full edit”.
- Demonstrates broad applicability on both image and video editing tasks, outperforming existing slider‑based methods in smoothness and consistency.
- Keeps the training pipeline unchanged, requiring only an extra “identity” instruction at inference time, thus avoiding per‑edit fine‑tuning or specialized data.
Methodology
-
Background – CFG in diffusion
Standard CFG mixes an unconditional model output (no prompt) with a conditional one (with prompt) to push generations toward the text. In editing models, the unconditional output is not the original image but an arbitrary diffusion of it, which breaks smooth control. -
Adaptive Origin
- Introduce an identity instruction (e.g., “keep the original image unchanged”) and feed it to the diffusion model alongside the usual unconditional token.
- The model now produces two “origins”:
U– the classic unconditional prediction (arbitrary noise).
I– the identity‑conditioned prediction that aims to reconstruct the input faithfully.
-
Guidance Interpolation
-
Define a strength parameter
s ∈ [0, 1]. -
Compute the blended origin
O_s = (1‑s)·U + s·I. -
Apply standard CFG using
O_sas the base:x_t = O_s + λ·(cond – O_s)where
λis the usual guidance scale. -
When
s = 0, the origin is purely unconditional → strong edit; whens = 1, the origin is the identity prediction → the output stays close to the input.
-
-
Implementation
No extra training is needed; the identity instruction is added to the prompt vocabulary, and the same diffusion checkpoint is used at inference. The method works for both static images and frame‑wise video diffusion pipelines.
Results & Findings
- Quantitative smoothness: Measured L2 distance between successive edit strengths; AdaOr shows near‑linear progression, whereas vanilla CFG exhibits abrupt jumps.
- User study: Participants rated AdaOr edits as more predictable and easier to control (average 4.6/5 vs. 3.2/5 for baseline sliders).
- Cross‑modal validation: The technique works on video diffusion models, preserving temporal consistency while still offering fine‑grained control over motion or style changes.
- No degradation in fidelity: At full strength (
s = 0), AdaOr matches or exceeds the visual quality of existing editing pipelines, confirming that the added identity conditioning does not hurt the model’s expressive power.
Practical Implications
- Developer-friendly APIs: Integrate a single “edit_strength” parameter into existing diffusion‑based editing services without touching the training code.
- Interactive UI/UX: Build real‑time sliders for image/video editors (e.g., Photoshop plugins, video post‑production tools) that feel truly continuous, improving user confidence and reducing trial‑and‑error cycles.
- Automation pipelines: Scripted batch edits can now vary strength per asset (e.g., gradually increasing a brand logo’s prominence across frames) with deterministic results.
- Cost efficiency: Since no per‑edit fine‑tuning is required, cloud inference costs stay low while offering richer control—valuable for SaaS platforms offering AI‑powered media manipulation.
Limitations & Future Work
- Identity instruction dependence: The quality of the identity‑conditioned prediction hinges on how well the model learned to interpret the “keep original” token; extremely complex scenes may still drift.
- Guidance scale interaction: While AdaOr decouples edit strength from CFG scale, selecting an optimal CFG
λstill requires some experimentation for different domains. - Extending beyond diffusion: The authors note that the adaptive‑origin concept could benefit other generative families (e.g., autoregressive or GAN‑based editors), but this remains unexplored.
- Dataset bias: The method assumes the base diffusion model was trained on data where “identity” concepts are present; specialized domains (medical imaging, satellite data) may need a custom identity token or modest fine‑tuning.
Bottom line: Adaptive‑Origin Guidance offers a plug‑and‑play solution for developers who want precise, smooth control over text‑driven image and video edits, opening the door to more intuitive AI‑assisted creative tools.
Authors
- Alon Wolf
- Chen Katzir
- Kfir Aberman
- Or Patashnik
Paper Information
- arXiv ID: 2602.03826v1
- Categories: cs.CV, cs.GR
- Published: February 3, 2026
- PDF: Download PDF