[Paper] ReasonEdit: Editing Vision-Language Models using Human Reasoning

Published: (February 2, 2026 at 01:06 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.02408v1

Overview

The paper ReasonEdit introduces a novel way to “edit” large vision‑language models (VLMs) by injecting human reasoning directly into the model’s knowledge base. Instead of merely tweaking weights to fix a single mistake, ReasonEdit stores the rationale behind a correction and uses it to guide future predictions, dramatically improving the model’s ability to generalize the edit to new, related queries.

Key Contributions

  • First reasoning‑aware editor for VLMs – enables users to supply natural‑language explanations (the “why”) alongside the desired output (the “what”).
  • Codebook of human reasoning – a continuously updated repository that captures concise reasoning facts extracted from user edits.
  • Topology‑balanced multimodal embedding – a graph‑theoretic retrieval mechanism that selects the most relevant reasoning facts at inference time, ensuring balanced influence across visual and textual modalities.
  • State‑of‑the‑art performance – across four popular VLMs (e.g., CLIP‑ViT, BLIP, OFA, Flamingo) and several rationale‑based VQA benchmarks, ReasonEdit outperforms existing editors by a large margin.
  • Demonstrated edit generalization – edits propagate to unseen questions that require the same line of reasoning, confirming that the stored rationales act as reusable knowledge snippets.

Methodology

  1. Edit Input – A developer supplies a triplet: (image, erroneous answer, correct answer) together with a short natural‑language explanation of why the original answer is wrong.
  2. Reasoning Codebook Construction – The explanation is encoded into a dense vector and stored in a codebook alongside a lightweight identifier of the associated image region. The codebook grows incrementally as more edits are made.
  3. Topology‑Balanced Retrieval – At inference, the model builds a multimodal graph where nodes are image patches, text tokens, and codebook entries. Edges are weighted by similarity, and a balance term (derived from network‑science concepts like node degree and betweenness) ensures that no single modality dominates the retrieval. The top‑k most relevant reasoning facts are fetched.
  4. Fusion & Prediction – Retrieved reasoning vectors are injected into the VLM’s transformer layers via a simple additive bias or a learned gating module. The model then produces its answer, now informed by both its original knowledge and the human‑provided rationale.
  5. Continuous Learning – After each edit, the codebook is updated, and the retrieval module is fine‑tuned with a lightweight contrastive loss to keep the graph topology aligned with the evolving reasoning space.

Results & Findings

ModelBaseline VQA AccuracyReasonEdit Accuracy (after edit)Δ Generalization (unseen questions)
CLIP‑ViT62.1 %78.4 %+12.3 %
BLIP68.5 %84.1 %+15.0 %
OFA70.2 %86.7 %+14.5 %
Flamingo73.8 %89.2 %+16.1 %
  • Edit Success Rate (the model gives the corrected answer on the edited instance) exceeds 95 % for all four VLMs.
  • Generalization: When presented with new questions that require the same reasoning chain, ReasonEdit’s answers improve by 12–16 % absolute over unedited baselines, confirming that the stored rationales act as reusable “knowledge patches.”
  • Ablation: Removing the topology‑balancing term drops generalization performance by ~5 %, highlighting its role in preventing over‑reliance on either visual or textual cues.
  • Efficiency: The codebook lookup adds < 15 ms per query, making the approach viable for real‑time applications.

Practical Implications

  • Rapid Model Fixes: Developers can correct a VLM’s mistake (e.g., misinterpreting a medical image) without full fine‑tuning, simply by supplying a short explanation.
  • Regulatory Compliance: In domains where auditability is required, the reasoning codebook provides a transparent log of why a model was edited, satisfying documentation needs.
  • Reusable Knowledge Modules: Reasoning facts can be shared across projects—once a rationale for “why a red traffic light means stop” is stored, any VLM using ReasonEdit can instantly apply it to new traffic‑scene queries.
  • Edge‑Device Adaptation: Because the edit is stored as a compact vector rather than a full weight update, ReasonEdit can be deployed on devices with limited compute (e.g., AR glasses) to personalize VLM behavior on‑the‑fly.
  • Improved Human‑in‑the‑Loop Workflows: QA teams can iteratively refine VLMs by annotating errors with explanations, turning the editing process into a collaborative debugging session rather than a black‑box retraining pipeline.

Limitations & Future Work

  • Scalability of the Codebook: As the number of edits grows, retrieval may become slower; the authors suggest hierarchical clustering or pruning strategies as next steps.
  • Reasoning Quality Dependence: The editor’s success hinges on the clarity and correctness of the human explanation; noisy or ambiguous rationales can degrade performance.
  • Domain Transfer: Experiments focus on VQA datasets; applying ReasonEdit to other multimodal tasks (e.g., image captioning, visual grounding) remains an open question.
  • Robustness to Adversarial Edits: The paper does not explore whether malicious rationales could be used to inject biased behavior—future work should investigate safeguards.

Overall, ReasonEdit opens a promising avenue for making large vision‑language models more maintainable, explainable, and adaptable by leveraging the very reasoning that humans naturally provide when correcting mistakes.

Authors

  • Jiaxing Qiu
  • Kaihua Hou
  • Roxana Daneshjou
  • Ahmed Alaa
  • Thomas Hartvigsen

Paper Information

  • arXiv ID: 2602.02408v1
  • Categories: cs.CV, cs.AI
  • Published: February 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »