[Paper] ReasonEdit: Editing Vision-Language Models using Human Reasoning
Source: arXiv - 2602.02408v1
Overview
The paper ReasonEdit introduces a novel way to “edit” large vision‑language models (VLMs) by injecting human reasoning directly into the model’s knowledge base. Instead of merely tweaking weights to fix a single mistake, ReasonEdit stores the rationale behind a correction and uses it to guide future predictions, dramatically improving the model’s ability to generalize the edit to new, related queries.
Key Contributions
- First reasoning‑aware editor for VLMs – enables users to supply natural‑language explanations (the “why”) alongside the desired output (the “what”).
- Codebook of human reasoning – a continuously updated repository that captures concise reasoning facts extracted from user edits.
- Topology‑balanced multimodal embedding – a graph‑theoretic retrieval mechanism that selects the most relevant reasoning facts at inference time, ensuring balanced influence across visual and textual modalities.
- State‑of‑the‑art performance – across four popular VLMs (e.g., CLIP‑ViT, BLIP, OFA, Flamingo) and several rationale‑based VQA benchmarks, ReasonEdit outperforms existing editors by a large margin.
- Demonstrated edit generalization – edits propagate to unseen questions that require the same line of reasoning, confirming that the stored rationales act as reusable knowledge snippets.
Methodology
- Edit Input – A developer supplies a triplet: (image, erroneous answer, correct answer) together with a short natural‑language explanation of why the original answer is wrong.
- Reasoning Codebook Construction – The explanation is encoded into a dense vector and stored in a codebook alongside a lightweight identifier of the associated image region. The codebook grows incrementally as more edits are made.
- Topology‑Balanced Retrieval – At inference, the model builds a multimodal graph where nodes are image patches, text tokens, and codebook entries. Edges are weighted by similarity, and a balance term (derived from network‑science concepts like node degree and betweenness) ensures that no single modality dominates the retrieval. The top‑k most relevant reasoning facts are fetched.
- Fusion & Prediction – Retrieved reasoning vectors are injected into the VLM’s transformer layers via a simple additive bias or a learned gating module. The model then produces its answer, now informed by both its original knowledge and the human‑provided rationale.
- Continuous Learning – After each edit, the codebook is updated, and the retrieval module is fine‑tuned with a lightweight contrastive loss to keep the graph topology aligned with the evolving reasoning space.
Results & Findings
| Model | Baseline VQA Accuracy | ReasonEdit Accuracy (after edit) | Δ Generalization (unseen questions) |
|---|---|---|---|
| CLIP‑ViT | 62.1 % | 78.4 % | +12.3 % |
| BLIP | 68.5 % | 84.1 % | +15.0 % |
| OFA | 70.2 % | 86.7 % | +14.5 % |
| Flamingo | 73.8 % | 89.2 % | +16.1 % |
- Edit Success Rate (the model gives the corrected answer on the edited instance) exceeds 95 % for all four VLMs.
- Generalization: When presented with new questions that require the same reasoning chain, ReasonEdit’s answers improve by 12–16 % absolute over unedited baselines, confirming that the stored rationales act as reusable “knowledge patches.”
- Ablation: Removing the topology‑balancing term drops generalization performance by ~5 %, highlighting its role in preventing over‑reliance on either visual or textual cues.
- Efficiency: The codebook lookup adds < 15 ms per query, making the approach viable for real‑time applications.
Practical Implications
- Rapid Model Fixes: Developers can correct a VLM’s mistake (e.g., misinterpreting a medical image) without full fine‑tuning, simply by supplying a short explanation.
- Regulatory Compliance: In domains where auditability is required, the reasoning codebook provides a transparent log of why a model was edited, satisfying documentation needs.
- Reusable Knowledge Modules: Reasoning facts can be shared across projects—once a rationale for “why a red traffic light means stop” is stored, any VLM using ReasonEdit can instantly apply it to new traffic‑scene queries.
- Edge‑Device Adaptation: Because the edit is stored as a compact vector rather than a full weight update, ReasonEdit can be deployed on devices with limited compute (e.g., AR glasses) to personalize VLM behavior on‑the‑fly.
- Improved Human‑in‑the‑Loop Workflows: QA teams can iteratively refine VLMs by annotating errors with explanations, turning the editing process into a collaborative debugging session rather than a black‑box retraining pipeline.
Limitations & Future Work
- Scalability of the Codebook: As the number of edits grows, retrieval may become slower; the authors suggest hierarchical clustering or pruning strategies as next steps.
- Reasoning Quality Dependence: The editor’s success hinges on the clarity and correctness of the human explanation; noisy or ambiguous rationales can degrade performance.
- Domain Transfer: Experiments focus on VQA datasets; applying ReasonEdit to other multimodal tasks (e.g., image captioning, visual grounding) remains an open question.
- Robustness to Adversarial Edits: The paper does not explore whether malicious rationales could be used to inject biased behavior—future work should investigate safeguards.
Overall, ReasonEdit opens a promising avenue for making large vision‑language models more maintainable, explainable, and adaptable by leveraging the very reasoning that humans naturally provide when correcting mistakes.
Authors
- Jiaxing Qiu
- Kaihua Hou
- Roxana Daneshjou
- Ahmed Alaa
- Thomas Hartvigsen
Paper Information
- arXiv ID: 2602.02408v1
- Categories: cs.CV, cs.AI
- Published: February 2, 2026
- PDF: Download PDF