[Paper] Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion

Published: (January 6, 2026 at 12:52 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.03213v1

Overview

This paper tackles a practical problem that’s becoming increasingly important as text‑to‑image diffusion models (think Stable Diffusion, DALL·E 2, etc.) see real‑world deployment: how to “unlearn” a specific concept—for example a copyrighted style or a harmful visual motif—without breaking the model’s overall capabilities. The authors propose a reinforcement‑learning (RL) formulation that treats the diffusion denoising process as a sequential decision‑making problem and introduces a timestep‑aware critic to guide the unlearning more stably than prior RL attempts.

Key Contributions

  • RL‑based unlearning framework that models each denoising step as an action, enabling fine‑grained credit assignment.
  • Timestep‑aware critic built on a CLIP‑trained reward predictor that evaluates noisy latent representations at every diffusion step, providing dense per‑step feedback.
  • Policy‑gradient updates for the reverse diffusion kernel that can reuse off‑policy data, making the method compatible with existing diffusion pipelines.
  • Empirical validation across several target concepts showing equal or better forgetting compared to strong supervised baselines while preserving image fidelity and prompt compliance.
  • Open‑source release of code, evaluation scripts, and pretrained critics to accelerate reproducibility and future research.

Methodology

  1. Sequential View of Diffusion – The reverse diffusion process (turning noise into an image) is cast as a Markov decision process:

    • State: the current noisy latent at timestep t.
    • Action: the model’s predicted denoising direction (the diffusion kernel’s output).
    • Transition: applying the diffusion step to move to the next timestep.
  2. Critic Design – A CLIP‑based network is fine‑tuned to predict a scalar “unlearning reward” from a noisy latent and the target concept text. Crucially, the critic receives the noisy latent (not the clean image), so it can provide a learning signal at every diffusion step.

  3. Reward Signal – The reward is high when the latent is far from the unwanted concept (as judged by CLIP similarity) and low otherwise. Because the critic works on noisy latents, the reward is naturally noisy and varies across timesteps, which helps the policy learn where in the diffusion trajectory the concept is most vulnerable.

  4. Policy Update – Using the per‑step rewards, the authors compute advantage estimates and apply a standard REINFORCE‑style policy gradient to adjust the diffusion kernel’s parameters. Off‑policy samples (e.g., latents generated by the original model) can be reused, improving sample efficiency.

  5. Training Loop – The process alternates between:

    • Sampling a batch of prompts containing the target concept.
    • Running the diffusion process while collecting states, actions, and critic rewards.
    • Updating the critic (periodically) and the diffusion policy via the computed advantages.

Results & Findings

MetricProposed RL‑UnlearnSupervised Weight EditGlobal Penalty Baseline
Forgetting (CLIP similarity drop)−0.78−0.71−0.65
Image quality (FID)12.313.114.5
Prompt fidelity (text‑image alignment)0.840.810.78
  • The timestep‑aware critic dramatically reduces variance in the gradient updates, leading to more stable training and faster convergence (≈30 % fewer diffusion steps to reach a target forgetting level).
  • Ablation studies confirm that (i) removing per‑step critics and (ii) using a clean‑image‑only reward both degrade performance, causing either under‑unlearning or noticeable artifacts.
  • Qualitative examples show that the model can erase a specific artist’s style while still generating high‑quality images for unrelated prompts.

Practical Implications

  • Compliance & IP Management – Companies can retroactively strip copyrighted or trademarked visual elements from a deployed diffusion model without re‑training from scratch.
  • Safety & Moderation – Harmful or disallowed visual concepts (e.g., extremist symbols) can be removed on‑the‑fly, reducing the risk of accidental generation.
  • Modular Updates – Because the method works as a plug‑in policy‑gradient layer on top of existing diffusion backbones, developers can integrate it into CI pipelines for continuous “concept hygiene.”
  • Sample Efficiency – Off‑policy reuse means you can leverage logs of previously generated images, lowering the compute cost compared to full supervised fine‑tuning.

Limitations & Future Work

  • Reward Dependence on CLIP – The critic inherits CLIP’s biases; if CLIP misclassifies a concept, the unlearning signal may be noisy or misdirected.
  • Scalability to Many Concepts – The current setup trains a separate critic per target concept; extending to simultaneous multi‑concept unlearning remains an open challenge.
  • Theoretical Guarantees – While empirical forgetting is strong, formal bounds on how much of a concept is removed are not provided.
  • Future Directions suggested by the authors include: exploring multi‑task critics, integrating more robust reward models (e.g., diffusion‑based classifiers), and studying the trade‑off between forgetting speed and downstream task performance.

Authors

  • Mykola Vysotskyi
  • Zahar Kohut
  • Mariia Shpir
  • Taras Rumezhak
  • Volodymyr Karpiv

Paper Information

  • arXiv ID: 2601.03213v1
  • Categories: cs.LG
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »