[Paper] Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion

Published: 1 month ago (January 6, 2026 at 12:52 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.03213v1

Overview

This paper tackles a practical problem that’s becoming increasingly important as text‑to‑image diffusion models (think Stable Diffusion, DALL·E 2, etc.) see real‑world deployment: how to “unlearn” a specific concept—for example a copyrighted style or a harmful visual motif—without breaking the model’s overall capabilities. The authors propose a reinforcement‑learning (RL) formulation that treats the diffusion denoising process as a sequential decision‑making problem and introduces a timestep‑aware critic to guide the unlearning more stably than prior RL attempts.

Key Contributions

RL‑based unlearning framework that models each denoising step as an action, enabling fine‑grained credit assignment.
Timestep‑aware critic built on a CLIP‑trained reward predictor that evaluates noisy latent representations at every diffusion step, providing dense per‑step feedback.
Policy‑gradient updates for the reverse diffusion kernel that can reuse off‑policy data, making the method compatible with existing diffusion pipelines.
Empirical validation across several target concepts showing equal or better forgetting compared to strong supervised baselines while preserving image fidelity and prompt compliance.
Open‑source release of code, evaluation scripts, and pretrained critics to accelerate reproducibility and future research.

Methodology

Sequential View of Diffusion – The reverse diffusion process (turning noise into an image) is cast as a Markov decision process:
- State: the current noisy latent at timestep t.
- Action: the model’s predicted denoising direction (the diffusion kernel’s output).
- Transition: applying the diffusion step to move to the next timestep.
Critic Design – A CLIP‑based network is fine‑tuned to predict a scalar “unlearning reward” from a noisy latent and the target concept text. Crucially, the critic receives the noisy latent (not the clean image), so it can provide a learning signal at every diffusion step.
Reward Signal – The reward is high when the latent is far from the unwanted concept (as judged by CLIP similarity) and low otherwise. Because the critic works on noisy latents, the reward is naturally noisy and varies across timesteps, which helps the policy learn where in the diffusion trajectory the concept is most vulnerable.
Policy Update – Using the per‑step rewards, the authors compute advantage estimates and apply a standard REINFORCE‑style policy gradient to adjust the diffusion kernel’s parameters. Off‑policy samples (e.g., latents generated by the original model) can be reused, improving sample efficiency.
Training Loop – The process alternates between:
- Sampling a batch of prompts containing the target concept.
- Running the diffusion process while collecting states, actions, and critic rewards.
- Updating the critic (periodically) and the diffusion policy via the computed advantages.

Results & Findings

Metric	Proposed RL‑Unlearn	Supervised Weight Edit	Global Penalty Baseline
Forgetting (CLIP similarity drop)	−0.78	−0.71	−0.65
Image quality (FID)	12.3	13.1	14.5
Prompt fidelity (text‑image alignment)	0.84	0.81	0.78

The timestep‑aware critic dramatically reduces variance in the gradient updates, leading to more stable training and faster convergence (≈30 % fewer diffusion steps to reach a target forgetting level).
Ablation studies confirm that (i) removing per‑step critics and (ii) using a clean‑image‑only reward both degrade performance, causing either under‑unlearning or noticeable artifacts.
Qualitative examples show that the model can erase a specific artist’s style while still generating high‑quality images for unrelated prompts.

Practical Implications

Compliance & IP Management – Companies can retroactively strip copyrighted or trademarked visual elements from a deployed diffusion model without re‑training from scratch.
Safety & Moderation – Harmful or disallowed visual concepts (e.g., extremist symbols) can be removed on‑the‑fly, reducing the risk of accidental generation.
Modular Updates – Because the method works as a plug‑in policy‑gradient layer on top of existing diffusion backbones, developers can integrate it into CI pipelines for continuous “concept hygiene.”
Sample Efficiency – Off‑policy reuse means you can leverage logs of previously generated images, lowering the compute cost compared to full supervised fine‑tuning.

Limitations & Future Work

Reward Dependence on CLIP – The critic inherits CLIP’s biases; if CLIP misclassifies a concept, the unlearning signal may be noisy or misdirected.
Scalability to Many Concepts – The current setup trains a separate critic per target concept; extending to simultaneous multi‑concept unlearning remains an open challenge.
Theoretical Guarantees – While empirical forgetting is strong, formal bounds on how much of a concept is removed are not provided.
Future Directions suggested by the authors include: exploring multi‑task critics, integrating more robust reward models (e.g., diffusion‑based classifiers), and studying the trade‑off between forgetting speed and downstream task performance.

Authors

Mykola Vysotskyi
Zahar Kohut
Mariia Shpir
Taras Rumezhak
Volodymyr Karpiv

Paper Information

arXiv ID: 2601.03213v1
Categories: cs.LG
Published: January 6, 2026
PDF: Download PDF

[Paper] Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Manifold limit for the training of shallow graph convolutional neural networks

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection

[Paper] Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem