[Paper] Antidistillation Fingerprinting

Published: (February 3, 2026 at 01:15 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.03812v1

Overview

Distilling large language models (LLMs) into smaller “student” models is a common way to cut costs while preserving performance. But when a third‑party re‑uses a proprietary model’s outputs, the original owner needs a reliable way to prove that the student was trained on their data. The paper Antidistillation Fingerprinting proposes a new, principled fingerprinting technique that can be embedded in a teacher model’s generations and later detected in any student model—without sacrificing the quality of the generated text.

Key Contributions

  • Antidistillation Fingerprinting (ADFP): A gradient‑based method that designs fingerprints specifically to survive the student’s fine‑tuning process.
  • Proxy‑model optimization: Uses a lightweight surrogate model to predict how a student will internalize the fingerprint, then samples tokens that maximize expected detectability.
  • Pareto‑optimal trade‑off: Demonstrates markedly higher detection confidence than prior watermarking approaches while keeping utility loss (e.g., answer correctness, fluency) negligible.
  • Architecture‑agnostic detection: Works even when the attacker’s student model architecture is unknown, a first for fingerprinting research.
  • Empirical validation: Experiments on the GSM8K math‑reasoning benchmark and the OASST1 instruction‑following dataset show consistent gains across model sizes and training regimes.

Methodology

  1. Fingerprint Objective Alignment – Instead of sprinkling arbitrary perturbations into the teacher’s output (the classic watermark approach), ADFP defines a loss that directly measures how easy it will be for a downstream student to learn the fingerprint.
  2. Proxy Model Construction – A small, differentiable “shadow” student is trained on a batch of teacher outputs. Because gradients can be back‑propagated through this proxy, the method can evaluate how changes to the teacher’s token distribution affect the fingerprint’s learnability.
  3. Antidistillation Sampling – During generation, the teacher model selects tokens that increase the proxy’s fingerprint loss gradient, i.e., tokens that make the fingerprint more salient to the student. This is done via a simple beam‑search‑style scoring that adds a fingerprint term to the usual language‑model log‑probability.
  4. Detection Phase – After a suspect model is fine‑tuned, a verifier runs a short inference pass on a held‑out prompt set and runs a statistical test (e.g., likelihood‑ratio) to see if the fingerprint pattern appears more often than chance.

The whole pipeline is fully differentiable, allowing the fingerprint to be optimized end‑to‑end without manually tuning heuristics.

Results & Findings

BenchmarkTeacher → Student (size)Detection AUC (ADFP)Detection AUC (baseline watermark)Utility Δ (accuracy / BLEU)
GSM8K (math)13B → 2.7B0.920.71‑0.3 %
OASST1 (dialog)7B → 1.3B0.880.64‑0.2 %
  • Pareto improvement: ADFP consistently pushes the detection curve upward while keeping the drop in task performance under 0.5 %.
  • Robustness to architecture mismatch: Even when the student was a Transformer‑style encoder‑decoder while the proxy was a decoder‑only model, detection rates stayed > 0.85.
  • Low overhead: The additional computation during generation is roughly a 5 % increase in inference time, far less than the 20‑30 % overhead typical of watermarking schemes.

Practical Implications

  • IP protection for AI SaaS: Companies can embed ADFP fingerprints in their API responses, giving them forensic evidence if a competitor releases a “copycat” model.
  • Compliance & audit trails: Regulators could require fingerprinted outputs to verify that proprietary data isn’t being redistributed without consent.
  • Developer tooling: Open‑source libraries could expose a simple add_fingerprint() wrapper around any Hugging Face model, making adoption as easy as toggling a flag.
  • Mitigating model theft in the wild: Since the detection works without knowing the student’s architecture, it can be applied to black‑box models accessed only via an API, enabling third‑party auditors to spot illicit distillation.

Overall, ADFP offers a realistic path to “digital watermarks” for LLMs that don’t cripple the user experience—a key hurdle for industry adoption.

Limitations & Future Work

  • Assumes access to teacher outputs: The method requires the original model to generate fingerprinted text; it cannot retroactively protect already‑released corpora.
  • Proxy fidelity: While the proxy model works well in experiments, extreme architectural gaps (e.g., student using retrieval‑augmented generation) could reduce detection power.
  • Adversarial countermeasures: A determined attacker might fine‑tune with adversarial regularization to suppress the fingerprint; the authors suggest exploring robust training objectives as a next step.
  • Scalability to massive corpora: Extending ADFP to billions of tokens and multi‑modal data (e.g., vision‑language models) remains an open challenge.

The authors plan to release a toolkit for easy integration and to investigate adaptive fingerprints that evolve over time to stay ahead of adversarial stripping.

Authors

  • Yixuan Even Xu
  • John Kirchenbauer
  • Yash Savani
  • Asher Trockman
  • Alexander Robey
  • Tom Goldstein
  • Fei Fang
  • J. Zico Kolter

Paper Information

  • arXiv ID: 2602.03812v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: February 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »