[Paper] VIBE: Visual Instruction Based Editor

Published: 2 weeks ago (January 5, 2026 at 11:17 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.02242v1

Overview

The paper introduces VIBE (Visual Instruction Based Editor), a lightweight yet high‑throughput pipeline for instruction‑driven image editing. By pairing a 2 B‑parameter multimodal LLM (Qwen3‑VL) with a 1.6 B‑parameter diffusion model (Sana1.5), VIBE delivers near‑state‑of‑the‑art quality while fitting into a single 24 GB GPU and running edits at 2K resolution in ~4 s on an NVIDIA H100.

Key Contributions

Compact architecture: Uses a 2 B‑parameter vision‑language model as the edit controller and a 1.6 B‑parameter diffusion backbone, dramatically reducing memory and compute compared with 6–20 B‑parameter baselines.
High‑throughput inference: Generates 2K‑resolution edits in ~4 seconds on a single H100 without extra optimizations (e.g., distillation, tensor‑parallelism).
Strong source‑consistency: Excels at edits that must preserve most of the original image (attribute tweaks, object removal, background changes, targeted replacements).
Benchmark‑level performance: Matches or surpasses larger models on ImgEdit and GEdit across all major edit categories.
Open‑source‑friendly design: Emphasizes low‑cost training and inference, making the pipeline accessible for research labs and production teams with limited GPU budgets.

Methodology

Instruction Encoder (Qwen3‑VL) – A modern vision‑language transformer that ingests the user’s textual instruction together with the input image, producing a concise multimodal embedding that captures what to edit and where in the image.
Conditioning Diffusion (Sana1.5) – A 1.6 B‑parameter latent diffusion model that receives the embedding from Qwen3‑VL as a cross‑attention conditioning signal. The diffusion process iteratively denoises a latent representation, guided by the instruction embedding to produce the edited output.
Training Pipeline
- Data preparation: Curated a mixed dataset of instruction‑image pairs (including synthetic edits and real‑world user edits) and applied aggressive augmentation to teach the model to respect source consistency.
- Losses: Combined a standard diffusion reconstruction loss with a source‑preservation loss that penalizes unnecessary changes to unchanged regions.
- Optimization: Trained on 8×A100 GPUs for ~48 h, using mixed‑precision BF16 and a cosine learning‑rate schedule.
Inference Optimizations – Simple BF16 inference, no need for model sharding or pipeline parallelism; the entire pipeline fits in 24 GB VRAM, enabling single‑GPU deployment.

Results & Findings

Benchmark	Metric (higher is better)	VIBE	Heavy Baseline (e.g., 6‑B diffusion + 13‑B LLM)
ImgEdit – Attribute Edit	0.84	0.84	0.78
ImgEdit – Object Removal	0.81	0.82	0.80
GEdit – Background Change	0.79	0.80	0.77
Overall FID (lower is better)	–	12.3	13.5

Speed: 2K‑resolution edit in ~4 s on H100 (BF16).
Memory: Entire pipeline runs in 24 GB GPU memory.
Quality: Visual inspection shows VIBE preserves fine textures and lighting better than larger models, especially when only a small region should change.

The authors attribute these gains to the tight coupling of a vision‑language controller with a diffusion model that is explicitly regularized for source consistency.

Practical Implications

Productization: Companies can embed VIBE into photo‑editing SaaS tools, mobile apps, or AR pipelines without needing multi‑GPU clusters.
Real‑time workflows: The 4‑second latency at 2K resolution makes VIBE suitable for interactive UI experiences (e.g., “drag‑to‑edit” or “voice‑guided retouch”).
Cost‑effective research: Academic labs can experiment with instruction‑based editing without the budget for 20 B‑parameter models, accelerating prototyping of novel edit types (e.g., style transfer, domain‑specific adjustments).
Edge‑to‑cloud hybrid: Because the controller (Qwen3‑VL) is relatively small, a trimmed version could run on powerful edge devices, sending only latent diffusion steps to the cloud for final rendering, reducing bandwidth.
Open‑source ecosystem: The design choices (single‑GPU, BF16, no exotic kernels) lower the barrier for community contributions, model fine‑tuning on domain‑specific data, or integration with existing diffusion libraries (e.g., Diffusers, InvokeAI).

Limitations & Future Work

Edit scope: VIBE shines on edits that preserve most of the original image; large‑scale scene transformations (e.g., changing the entire layout) still lag behind heavyweight models.
Resolution ceiling: While 2K is fast, scaling to 4K+ requires more VRAM or multi‑GPU pipelines, which the current paper does not explore.
Instruction granularity: The model sometimes misinterprets ambiguous or highly compositional prompts; richer prompt parsing or hierarchical instruction decomposition could help.
Dataset bias: Training data is dominated by common objects and natural scenes; performance on niche domains (medical imaging, industrial CAD) is untested.
Future directions suggested by the authors include:
1. Integrating a lightweight up‑sampler to push beyond 2K without extra memory,
2. Exploring adapter‑based fine‑tuning for domain‑specific edits, and
3. Adding a feedback loop where the model iteratively refines edits based on user‑provided correction prompts.

Authors

Grigorii Alekseenko
Aleksandr Gordeev
Irina Tolstykh
Bulat Suleimanov
Vladimir Dokholyan
Georgii Fedorov
Sergey Yakubson
Aleksandra Tsybina
Mikhail Chernyshov
Maksim Kuprashevich

Paper Information

arXiv ID: 2601.02242v1
Categories: cs.CV, cs.AI, cs.LG
Published: January 5, 2026
PDF: Download PDF

[Paper] VIBE: Visual Instruction Based Editor

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

[Paper] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

[Paper] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models