[Paper] How Good is Post-Hoc Watermarking With Language Model Rephrasing?
Source: arXiv - 2512.16904v1
Overview
The paper investigates post‑hoc watermarking, a technique that lets a language model rewrite already‑written text while embedding a hidden statistical signal (a “watermark”). This approach could help protect copyrighted material, flag AI‑generated content used in training pipelines, or detect the presence of watermarked text in retrieval‑augmented generation (RAG) systems. By moving the watermarking step from generation time to a re‑phrasing stage, the authors explore new levers—larger re‑writer models, beam search, multi‑candidate generation, and entropy‑based filtering—that can improve the balance between text quality and watermark detectability.
Key Contributions
- Introduces post‑hoc watermarking as a practical alternative to generation‑time watermarking for existing documents.
- Systematically evaluates how compute allocation (model size, beam width, candidate count, detection‑time filtering) influences the quality‑detectability trade‑off.
- Shows that simple Gumbel‑max sampling outperforms more sophisticated watermarking schemes under nucleus sampling.
- Demonstrates strong detectability and semantic fidelity on long‑form, open‑ended text (e.g., books).
- Reveals a surprising limitation: for highly verifiable text like source code, smaller re‑writer models actually watermark more reliably than larger ones.
- Provides a set of practical recipes (beam search + entropy filtering, multi‑candidate voting) that can be adopted by developers today.
Methodology
- Baseline Generation‑Time Watermark – The authors start from a standard watermark that biases token selection during generation (e.g., “green‑list” vs. “red‑list” tokens).
- Post‑Hoc Re‑writing Pipeline – An LLM (the re‑writer) receives an existing passage and is instructed to paraphrase it while applying the same watermarking logic internally.
- Compute‑Allocation Strategies
- Model Size: Experiments with 0.7B‑to‑13B parameter models.
- Beam Search: Varying beam widths (1, 4, 8) to explore diverse yet high‑probability rewrites.
- Multi‑Candidate Generation: Produce several paraphrases per input and select the one with the strongest watermark signal.
- Entropy Filtering at Detection: At detection time, discard low‑entropy (high‑certainty) tokens that could dilute the watermark’s statistical signature.
- Evaluation Metrics
- Detectability: Measured by the “radioactivity” score (how strongly the watermark can be recovered).
- Semantic Fidelity: Assessed with BLEU, ROUGE, and human judgments on meaning preservation.
- Domain Split: Separate test sets for open‑ended prose (books) and highly verifiable code snippets.
Results & Findings
| Setting | Detectability (↑) | Semantic Fidelity (↑) | Notable Observation |
|---|---|---|---|
| Gumbel‑max + nucleus sampling | ★★★★★ | ★★★★☆ | Outperforms newer schemes despite its simplicity. |
| Beam search (beam = 8) | +15% radioactivity vs. greedy | +8% ROUGE | Beam search consistently boosts both signal and quality. |
| Multi‑candidate voting (k = 5) | +10% radioactivity | –2% BLEU (minor meaning drift) | Trade‑off: stronger watermark at slight fidelity loss. |
| Entropy filtering (threshold = 0.7) | +12% detection recall | No measurable fidelity loss | Effective “noise‑reduction” at detection time. |
| Code domain | Larger models (≥6B) ↓ detectability | Smaller models (≤1B) ↑ detectability | Counter‑intuitive: over‑parameterized rewrites introduce too much variance, breaking the watermark. |
Overall, the best‑performing recipe for prose was Gumbel‑max + beam = 8 + entropy filtering, achieving >90% detection recall while keeping BLEU >0.85 relative to the original text.
Practical Implications
- Copyright Protection: Publishers can run a lightweight re‑writer on their manuscripts before distribution, embedding a hidden tag that survives downstream transformations (e.g., OCR, summarization).
- Training‑Data Auditing: Companies can scan large corpora for “watermark radioactivity” to flag content that may have been derived from protected sources, helping enforce data‑use policies.
- RAG Safeguards: Retrieval‑augmented pipelines can discard or down‑weight documents that carry a strong watermark, reducing the risk of unintentionally leaking proprietary text into generated answers.
- Tooling Integration: The study’s recipes are compatible with existing open‑source LLM stacks (e.g., Hugging Face Transformers). Implementing beam search and entropy filtering adds negligible latency compared to a single forward pass.
- Code‑Specific Use Cases: For source‑code repositories, a smaller re‑writer (≈1B parameters) should be used to retain watermark detectability, suggesting a “dual‑model” strategy—large model for prose, small model for code.
Limitations & Future Work
- Domain Sensitivity: The approach struggles with highly deterministic text (e.g., code, legal clauses) where even minor paraphrasing can break functional correctness.
- Adversarial Removal: An attacker could apply aggressive paraphrasing or back‑translation to dilute the watermark; robustness against such attacks remains an open question.
- Scalability: While beam search improves results, it multiplies compute cost; real‑time services may need to balance latency vs. watermark strength.
- Evaluation Scope: Experiments were limited to English prose and Python code; multilingual and cross‑language scenarios need exploration.
Bottom line: Post‑hoc watermarking opens a practical pathway for embedding traceable signals into existing text, offering developers a new lever to protect IP and monitor data usage—provided they respect the method’s current constraints and continue to monitor emerging research on robustness and scalability.
Authors
- Pierre Fernandez
- Tom Sander
- Hady Elsahar
- Hongyan Chang
- Tomáš Souček
- Valeriu Lacatusu
- Tuan Tran
- Sylvestre‑Alvise Rebuffi
- Alexandre Mourachko
Paper Information
- arXiv ID: 2512.16904v1
- Categories: cs.CR, cs.CL
- Published: December 18, 2025
- PDF: Download PDF