[Paper] MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking
Source: arXiv - 2512.04044v1
Overview
The paper introduces MarkTune, a new way to watermark the outputs of open‑weight large language models (LLMs). By fine‑tuning the model itself rather than tweaking its inference process, MarkTune achieves a much better balance between preserving text quality and making the hidden watermark reliably detectable with a secret key.
Key Contributions
- On‑policy fine‑tuning framework that treats the existing GaussMark watermark signal as a reward while explicitly penalizing quality loss.
- Theoretical justification showing why MarkTune improves upon GaussMark’s weight‑perturbation approach.
- Empirical evidence that MarkTune pushes the quality‑detectability frontier close to that of inference‑time watermarks, even though the model weights are public.
- Robustness analysis demonstrating resistance to paraphrasing, downstream fine‑tuning attacks, and strong generalization to unseen datasets.
- Practical recipe for developers to embed durable watermarks into any open‑weight LLM without sacrificing generation fluency.
Methodology
- Start from GaussMark – a lightweight weight‑perturbation watermark that adds a Gaussian‑shaped bias to certain model parameters, creating a hidden signal detectable with a secret key.
- Define a reward function that measures how strongly the GaussMark signal appears in generated text (e.g., log‑likelihood under the watermark detector).
- Add a quality regularizer that penalizes deviations in standard language‑model metrics (perplexity, BLEU, or human‑rated fluency).
- Fine‑tune the model on‑policy: run the model, sample text, compute the combined reward (watermark strength – λ × quality loss), and back‑propagate to update the weights.
- Iterate until the watermark detection rate meets a target while the quality metric stays within an acceptable drop (typically < 2 % perplexity increase).
Because the fine‑tuning loop directly observes the watermark’s detectability, it can make fine‑grained adjustments in the representation space, avoiding the blunt, large‑scale weight changes that hurt fluency in earlier methods.
Results & Findings
| Metric | GaussMark (baseline) | MarkTune | Inference‑time watermark* |
|---|---|---|---|
| Detection accuracy (key‑known) | 78 % | 92 % | 95 % |
| Perplexity increase | +6 % | +1.8 % | +0.5 % |
| Robustness to paraphrase (drop in detection) | –15 % | –3 % | –2 % |
| Cross‑dataset transfer (trained on Wiki, tested on News) | 65 % | 84 % | 86 % |
*Inference‑time watermark refers to methods that modify token sampling at generation time (e.g., green‑list/red‑list schemes).
Key takeaways
- MarkTune narrows the gap to inference‑time watermarks while keeping the model’s generation quality virtually intact.
- The watermark survives common attacks such as paraphrasing or additional fine‑tuning on downstream tasks.
- A single MarkTune run on one corpus yields a watermark that remains detectable on completely different text domains.
Practical Implications
- Open‑source model distributors can embed a verifiable provenance tag directly into the model weights, giving downstream users a way to prove authenticity without altering runtime pipelines.
- Compliance & audit tools can query a model’s outputs with the secret key to confirm whether a piece of text was generated by a watermarked model, aiding in IP protection and misinformation detection.
- Deployments on edge devices (where inference‑time interventions may be costly or impossible) can rely on the pre‑watermarked model, simplifying integration.
- Fine‑tuning services (e.g., custom instruction‑tuning) can adopt MarkTune as a pre‑step to ensure that any derived model inherits the watermark, preserving traceability across model forks.
Overall, MarkTune offers a plug‑and‑play solution: run a short fine‑tuning job (often a few thousand steps) and obtain a model that behaves like the original but carries a strong, secret‑key‑detectable signature.
Limitations & Future Work
- Computation cost: Although far cheaper than full model retraining, MarkTune still requires an on‑policy fine‑tuning loop, which may be non‑trivial for very large models (e.g., > 70 B parameters).
- Secret‑key management: Detection hinges on keeping the key confidential; the paper does not address key rotation or revocation strategies.
- Adversarial adaptation: While robust to basic paraphrasing and fine‑tuning attacks, a determined adversary could train a dedicated “watermark‑removal” model; future work could explore adversarial training to harden the signal.
- Evaluation breadth: Experiments focus on English corpora; extending to multilingual models and code generation domains remains an open question.
The authors suggest exploring more efficient fine‑tuning algorithms (e.g., LoRA or adapters) and formalizing security guarantees against adaptive attackers as next steps.
Authors
- Yizhou Zhao
- Zhiwei Steven Wu
- Adam Block
Paper Information
- arXiv ID: 2512.04044v1
- Categories: cs.LG, cs.AI, cs.CR
- Published: December 3, 2025
- PDF: Download PDF