[Paper] VIBE: Visual Instruction Based Editor

Published: (January 5, 2026 at 11:17 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02242v1

Overview

The paper introduces VIBE (Visual Instruction Based Editor), a lightweight yet high‑throughput pipeline for instruction‑driven image editing. By pairing a 2 B‑parameter multimodal LLM (Qwen3‑VL) with a 1.6 B‑parameter diffusion model (Sana1.5), VIBE delivers near‑state‑of‑the‑art quality while fitting into a single 24 GB GPU and running edits at 2K resolution in ~4 s on an NVIDIA H100.

Key Contributions

  • Compact architecture: Uses a 2 B‑parameter vision‑language model as the edit controller and a 1.6 B‑parameter diffusion backbone, dramatically reducing memory and compute compared with 6–20 B‑parameter baselines.
  • High‑throughput inference: Generates 2K‑resolution edits in ~4 seconds on a single H100 without extra optimizations (e.g., distillation, tensor‑parallelism).
  • Strong source‑consistency: Excels at edits that must preserve most of the original image (attribute tweaks, object removal, background changes, targeted replacements).
  • Benchmark‑level performance: Matches or surpasses larger models on ImgEdit and GEdit across all major edit categories.
  • Open‑source‑friendly design: Emphasizes low‑cost training and inference, making the pipeline accessible for research labs and production teams with limited GPU budgets.

Methodology

  1. Instruction Encoder (Qwen3‑VL) – A modern vision‑language transformer that ingests the user’s textual instruction together with the input image, producing a concise multimodal embedding that captures what to edit and where in the image.
  2. Conditioning Diffusion (Sana1.5) – A 1.6 B‑parameter latent diffusion model that receives the embedding from Qwen3‑VL as a cross‑attention conditioning signal. The diffusion process iteratively denoises a latent representation, guided by the instruction embedding to produce the edited output.
  3. Training Pipeline
    • Data preparation: Curated a mixed dataset of instruction‑image pairs (including synthetic edits and real‑world user edits) and applied aggressive augmentation to teach the model to respect source consistency.
    • Losses: Combined a standard diffusion reconstruction loss with a source‑preservation loss that penalizes unnecessary changes to unchanged regions.
    • Optimization: Trained on 8×A100 GPUs for ~48 h, using mixed‑precision BF16 and a cosine learning‑rate schedule.
  4. Inference Optimizations – Simple BF16 inference, no need for model sharding or pipeline parallelism; the entire pipeline fits in 24 GB VRAM, enabling single‑GPU deployment.

Results & Findings

BenchmarkMetric (higher is better)VIBEHeavy Baseline (e.g., 6‑B diffusion + 13‑B LLM)
ImgEdit – Attribute Edit0.840.840.78
ImgEdit – Object Removal0.810.820.80
GEdit – Background Change0.790.800.77
Overall FID (lower is better)12.313.5
  • Speed: 2K‑resolution edit in ~4 s on H100 (BF16).
  • Memory: Entire pipeline runs in 24 GB GPU memory.
  • Quality: Visual inspection shows VIBE preserves fine textures and lighting better than larger models, especially when only a small region should change.

The authors attribute these gains to the tight coupling of a vision‑language controller with a diffusion model that is explicitly regularized for source consistency.

Practical Implications

  • Productization: Companies can embed VIBE into photo‑editing SaaS tools, mobile apps, or AR pipelines without needing multi‑GPU clusters.
  • Real‑time workflows: The 4‑second latency at 2K resolution makes VIBE suitable for interactive UI experiences (e.g., “drag‑to‑edit” or “voice‑guided retouch”).
  • Cost‑effective research: Academic labs can experiment with instruction‑based editing without the budget for 20 B‑parameter models, accelerating prototyping of novel edit types (e.g., style transfer, domain‑specific adjustments).
  • Edge‑to‑cloud hybrid: Because the controller (Qwen3‑VL) is relatively small, a trimmed version could run on powerful edge devices, sending only latent diffusion steps to the cloud for final rendering, reducing bandwidth.
  • Open‑source ecosystem: The design choices (single‑GPU, BF16, no exotic kernels) lower the barrier for community contributions, model fine‑tuning on domain‑specific data, or integration with existing diffusion libraries (e.g., Diffusers, InvokeAI).

Limitations & Future Work

  • Edit scope: VIBE shines on edits that preserve most of the original image; large‑scale scene transformations (e.g., changing the entire layout) still lag behind heavyweight models.
  • Resolution ceiling: While 2K is fast, scaling to 4K+ requires more VRAM or multi‑GPU pipelines, which the current paper does not explore.
  • Instruction granularity: The model sometimes misinterprets ambiguous or highly compositional prompts; richer prompt parsing or hierarchical instruction decomposition could help.
  • Dataset bias: Training data is dominated by common objects and natural scenes; performance on niche domains (medical imaging, industrial CAD) is untested.
  • Future directions suggested by the authors include:
    1. Integrating a lightweight up‑sampler to push beyond 2K without extra memory,
    2. Exploring adapter‑based fine‑tuning for domain‑specific edits, and
    3. Adding a feedback loop where the model iteratively refines edits based on user‑provided correction prompts.

Authors

  • Grigorii Alekseenko
  • Aleksandr Gordeev
  • Irina Tolstykh
  • Bulat Suleimanov
  • Vladimir Dokholyan
  • Georgii Fedorov
  • Sergey Yakubson
  • Aleksandra Tsybina
  • Mikhail Chernyshov
  • Maksim Kuprashevich

Paper Information

  • arXiv ID: 2601.02242v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »