[Paper] Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
Source: arXiv - 2605.08031v1
Overview
The paper introduces HFRU (Hallucination‑Free Reinforcement Unlearning), a new way to make vision‑language models (VLMs) forget specific visual concepts without leaving behind “ghost” objects or compromising the model’s overall abilities. By targeting the vision encoder rather than just the language decoder, the authors achieve deep, reliable erasure of unwanted knowledge while keeping the model useful for downstream tasks.
Key Contributions
- Deep‑encoder unlearning: First framework that directly modifies the vision encoder to remove visual semantics, avoiding superficial forgetting.
- Two‑stage reinforcement pipeline:
- Alignment disruption – breaks the tight coupling between visual features and textual tokens for the target concepts.
- GRPO‑based optimization – uses a composite reward (alignment, abstraction, and hallucination penalties) to guide the encoder toward a clean state.
- Abstraction reward: Encourages the model to replace erased objects with semantically valid alternatives (e.g., “a vehicle” instead of a specific car model), dramatically reducing object hallucination.
- Empirical breakthroughs: Demonstrates >98 % forgetting on object‑recognition and face‑identity benchmarks while preserving >95 % of the original performance on unrelated tasks.
- Open‑source release: Full code, pretrained checkpoints, and reproducibility scripts are provided.
Methodology
- Problem Setup – Given a pre‑trained VLM and a set of “sensitive” visual concepts (e.g., a particular person’s face or copyrighted artwork), the goal is to erase any trace of these concepts from the model’s internal representations.
- Stage 1 – Alignment Disruption
- The vision encoder’s output vectors for the target concepts are perturbed using a contrastive loss that pushes them away from their original textual embeddings.
- This step creates a “gap” between visual features and the language decoder, making the model less likely to retrieve the banned concept.
- Stage 2 – Reinforcement‑guided Optimization (GRPO)
- Reward Design:
- Alignment Reward – penalizes residual similarity between the altered visual features and the original text tokens.
- Abstraction Reward – gives credit when the model substitutes the erased concept with a higher‑level, semantically correct description (e.g., “animal” instead of “dog”).
- Hallucination Penalty – discourages the generation of unrelated objects that often appear after naive unlearning.
- A policy gradient algorithm (GRPO) updates the encoder weights to maximize the composite reward, effectively “re‑training” the encoder to forget while staying semantically coherent.
- Reward Design:
- Evaluation Protocol – The authors test forgetting on two fronts: (a) Object Recognition (e.g., ImageNet‑style classification) and (b) Face Identity Retrieval (matching faces across views). Retention is measured on a held‑out set of concepts that should remain intact.
Results & Findings
| Metric | HFRU | Prior Decoder‑Only Unlearning | Baseline (No Unlearning) |
|---|---|---|---|
| Forgetting (Top‑1 drop on target class) | 98.3 % | 71.4 % | 0 % |
| Retention (Accuracy on non‑target classes) | 95.7 % | 88.2 % | 96.1 % |
| Object Hallucination (spurious object rate) | 0.9 % | 6.8 % | 0.5 % |
| Face‑ID removal (verification AUC) | 0.12 (near random) | 0.34 | 0.99 |
- Deep forgetting: By operating on the encoder, HFRU eliminates the visual fingerprint of the target concepts, not just the textual label.
- Minimal side‑effects: The abstraction reward keeps the model’s output sensible, preventing the “hallucinated” objects that plagued earlier methods.
- Scalability: Experiments with up to 5 % of ImageNet classes removed show the same trend, indicating the approach can handle larger unlearning scopes.
Practical Implications
- Privacy‑compliant AI services: Companies can retroactively purge user‑submitted images (e.g., faces, copyrighted art) from their VLMs without rebuilding the whole model.
- Copyright enforcement: Media platforms can remove specific copyrighted objects from a model’s knowledge base, reducing legal risk while retaining overall performance.
- Bias mitigation: Sensitive demographic groups can be unlearned from a VLM, helping to curb inadvertent bias in downstream applications such as captioning or visual search.
- Developer workflow: HFRU can be integrated as a plug‑in step after fine‑tuning a VLM, requiring only a modest amount of additional compute (≈0.3× the original training cost).
- Open‑source tooling: The released repository includes scripts for defining custom “forget lists,” making it straightforward for engineers to adopt the method in production pipelines.
Limitations & Future Work
- Computational overhead: Although cheaper than full re‑training, the two‑stage reinforcement process still adds noticeable latency for large‑scale models (e.g., CLIP‑ViT‑L/14).
- Scope of abstraction: The abstraction reward works well for generic categories but may struggle with highly nuanced concepts (e.g., specific medical imaging findings).
- Evaluation breadth: The paper focuses on classification and face‑ID tasks; applying HFRU to generative VLMs (e.g., image‑to‑text generation) remains an open question.
- Future directions: The authors suggest exploring more efficient policy‑gradient variants, extending the framework to multimodal generative models, and automating the selection of abstraction vocabularies to further reduce hallucination risk.
Authors
- Kaidi Jia
- Yujie Lin
- Chengyi Yang
- Jiayao Ma
- Jinsong Su
Paper Information
- arXiv ID: 2605.08031v1
- Categories: cs.CV
- Published: May 8, 2026
- PDF: Download PDF