[Paper] VIBE: Visual Instruction Based Editor
Source: arXiv - 2601.02242v1
Overview
The paper introduces VIBE (Visual Instruction Based Editor), a lightweight yet high‑throughput pipeline for instruction‑driven image editing. By pairing a 2 B‑parameter multimodal LLM (Qwen3‑VL) with a 1.6 B‑parameter diffusion model (Sana1.5), VIBE delivers near‑state‑of‑the‑art quality while fitting into a single 24 GB GPU and running edits at 2K resolution in ~4 s on an NVIDIA H100.
Key Contributions
- Compact architecture: Uses a 2 B‑parameter vision‑language model as the edit controller and a 1.6 B‑parameter diffusion backbone, dramatically reducing memory and compute compared with 6–20 B‑parameter baselines.
- High‑throughput inference: Generates 2K‑resolution edits in ~4 seconds on a single H100 without extra optimizations (e.g., distillation, tensor‑parallelism).
- Strong source‑consistency: Excels at edits that must preserve most of the original image (attribute tweaks, object removal, background changes, targeted replacements).
- Benchmark‑level performance: Matches or surpasses larger models on ImgEdit and GEdit across all major edit categories.
- Open‑source‑friendly design: Emphasizes low‑cost training and inference, making the pipeline accessible for research labs and production teams with limited GPU budgets.
Methodology
- Instruction Encoder (Qwen3‑VL) – A modern vision‑language transformer that ingests the user’s textual instruction together with the input image, producing a concise multimodal embedding that captures what to edit and where in the image.
- Conditioning Diffusion (Sana1.5) – A 1.6 B‑parameter latent diffusion model that receives the embedding from Qwen3‑VL as a cross‑attention conditioning signal. The diffusion process iteratively denoises a latent representation, guided by the instruction embedding to produce the edited output.
- Training Pipeline
- Data preparation: Curated a mixed dataset of instruction‑image pairs (including synthetic edits and real‑world user edits) and applied aggressive augmentation to teach the model to respect source consistency.
- Losses: Combined a standard diffusion reconstruction loss with a source‑preservation loss that penalizes unnecessary changes to unchanged regions.
- Optimization: Trained on 8×A100 GPUs for ~48 h, using mixed‑precision BF16 and a cosine learning‑rate schedule.
- Inference Optimizations – Simple BF16 inference, no need for model sharding or pipeline parallelism; the entire pipeline fits in 24 GB VRAM, enabling single‑GPU deployment.
Results & Findings
| Benchmark | Metric (higher is better) | VIBE | Heavy Baseline (e.g., 6‑B diffusion + 13‑B LLM) |
|---|---|---|---|
| ImgEdit – Attribute Edit | 0.84 | 0.84 | 0.78 |
| ImgEdit – Object Removal | 0.81 | 0.82 | 0.80 |
| GEdit – Background Change | 0.79 | 0.80 | 0.77 |
| Overall FID (lower is better) | – | 12.3 | 13.5 |
- Speed: 2K‑resolution edit in ~4 s on H100 (BF16).
- Memory: Entire pipeline runs in 24 GB GPU memory.
- Quality: Visual inspection shows VIBE preserves fine textures and lighting better than larger models, especially when only a small region should change.
The authors attribute these gains to the tight coupling of a vision‑language controller with a diffusion model that is explicitly regularized for source consistency.
Practical Implications
- Productization: Companies can embed VIBE into photo‑editing SaaS tools, mobile apps, or AR pipelines without needing multi‑GPU clusters.
- Real‑time workflows: The 4‑second latency at 2K resolution makes VIBE suitable for interactive UI experiences (e.g., “drag‑to‑edit” or “voice‑guided retouch”).
- Cost‑effective research: Academic labs can experiment with instruction‑based editing without the budget for 20 B‑parameter models, accelerating prototyping of novel edit types (e.g., style transfer, domain‑specific adjustments).
- Edge‑to‑cloud hybrid: Because the controller (Qwen3‑VL) is relatively small, a trimmed version could run on powerful edge devices, sending only latent diffusion steps to the cloud for final rendering, reducing bandwidth.
- Open‑source ecosystem: The design choices (single‑GPU, BF16, no exotic kernels) lower the barrier for community contributions, model fine‑tuning on domain‑specific data, or integration with existing diffusion libraries (e.g., Diffusers, InvokeAI).
Limitations & Future Work
- Edit scope: VIBE shines on edits that preserve most of the original image; large‑scale scene transformations (e.g., changing the entire layout) still lag behind heavyweight models.
- Resolution ceiling: While 2K is fast, scaling to 4K+ requires more VRAM or multi‑GPU pipelines, which the current paper does not explore.
- Instruction granularity: The model sometimes misinterprets ambiguous or highly compositional prompts; richer prompt parsing or hierarchical instruction decomposition could help.
- Dataset bias: Training data is dominated by common objects and natural scenes; performance on niche domains (medical imaging, industrial CAD) is untested.
- Future directions suggested by the authors include:
- Integrating a lightweight up‑sampler to push beyond 2K without extra memory,
- Exploring adapter‑based fine‑tuning for domain‑specific edits, and
- Adding a feedback loop where the model iteratively refines edits based on user‑provided correction prompts.
Authors
- Grigorii Alekseenko
- Aleksandr Gordeev
- Irina Tolstykh
- Bulat Suleimanov
- Vladimir Dokholyan
- Georgii Fedorov
- Sergey Yakubson
- Aleksandra Tsybina
- Mikhail Chernyshov
- Maksim Kuprashevich
Paper Information
- arXiv ID: 2601.02242v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: January 5, 2026
- PDF: Download PDF