[Paper] TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data
Source: arXiv - 2511.21600v1
Overview
The paper introduces TAB‑DRW, a lightweight watermarking technique designed for synthetic tabular data generated by AI models. By embedding a hidden signal in the frequency domain of the data, the method makes it possible to prove provenance even after the data has been edited or transformed—an increasingly important capability for industries that share or sell synthetic datasets.
Key Contributions
- Frequency‑domain watermarking: Uses the discrete Fourier transform (DFT) on normalized tabular rows, tweaking the imaginary components to encode a pseudorandom bitstream.
- Mixed‑type support: Handles continuous, ordinal, and categorical columns in a single pipeline via Yeo‑Johnson transformation and standardization.
- Row‑wise, storage‑free retrieval: Introduces a rank‑based pseudorandom bit generator that lets a verifier reconstruct the watermark for any row on‑the‑fly, eliminating the need to store extra metadata.
- Robustness to post‑processing: Demonstrates resilience against common attacks such as rounding, scaling, noise injection, and even partial row deletion.
- Efficiency: The entire embedding and detection process runs in linear time with respect to the number of rows, avoiding the heavy compute cost of diffusion‑model‑based watermarks.
Methodology
- Pre‑processing
- Each column is transformed with the Yeo‑Johnson power transform (works for both positive and negative values) and then standardized (zero mean, unit variance).
- Frequency conversion
- The normalized row vector is fed to a 1‑D DFT, producing complex coefficients (real + imaginary parts).
- Bit embedding
- A rank‑based PRNG generates a deterministic pseudorandom bit for each row based on its sorted position in the dataset.
- Selected DFT coefficients (chosen adaptively to avoid low‑energy components) have their imaginary parts nudged up or down by a tiny epsilon, encoding the bit while keeping the overall row distribution unchanged.
- Inverse transform
- An inverse DFT brings the data back to the original space, followed by de‑standardization and inverse Yeo‑Johnson to obtain a watermarked synthetic table.
- Detection
- To verify a row, the same normalization and DFT steps are applied, the same coefficient indices are inspected, and the sign of the imaginary part is mapped back to the expected pseudorandom bit. A majority‑vote across rows yields the overall watermark presence.
Results & Findings
| Dataset (5) | Watermark detection rate | Robustness (post‑edit attacks) | Data fidelity (RMSE vs. original) |
|---|---|---|---|
| Health‑Care | 99.2 % | > 95 % after rounding, noise (σ=0.01), and 10 % row drop | 0.018 |
| Finance | 98.7 % | 93 % after column scaling (±5 %) | 0.022 |
| Public‑Policy | 99.5 % | 96 % after categorical label shuffling | 0.015 |
- Detectability stays above 98 % across all benchmarks, even when the synthetic data undergoes aggressive sanitization.
- Fidelity loss is negligible; downstream ML models trained on watermarked data show < 0.5 % drop in predictive performance compared to models trained on unwatermarked synthetic data.
- Runtime: Embedding a 100 k‑row table takes ~0.8 seconds on a single CPU core, orders of magnitude faster than diffusion‑based watermarking (which can require minutes per batch).
Practical Implications
- Data marketplaces can embed invisible provenance tags without inflating storage or slowing down data generation pipelines, enabling automated royalty tracking and misuse detection.
- Regulatory compliance: Organizations in healthcare or finance can prove that a synthetic dataset originated from an approved generator, satisfying audit requirements for data lineage.
- Model‑as‑a‑service (MaaS) providers can offer “watermarked‑as‑a‑feature” APIs, giving customers confidence that their synthetic data cannot be repurposed without attribution.
- Security tooling: The rank‑based PRNG means a verifier only needs the secret seed, not a per‑row key list, simplifying integration into CI pipelines that need to validate data integrity before deployment.
Limitations & Future Work
- Assumption of row order: The rank‑based PRNG relies on a stable sorting of rows; shuffling the dataset without preserving order can break detection unless the same sorting key is reapplied.
- Limited to linear transformations: Highly non‑linear post‑processing (e.g., training a downstream GAN on the watermarked data) may dilute the signal; the authors suggest exploring multi‑frequency embedding to improve resilience.
- Scalability to ultra‑high‑dimensional tables: While runtime is linear, the DFT on very wide tables (> 10 k columns) could become a bottleneck; future work may investigate block‑wise or wavelet‑based alternatives.
Overall, TAB‑DRW offers a pragmatic, low‑overhead path for developers to protect synthetic tabular assets, bridging a gap between academic watermarking research and real‑world data‑centric product pipelines.
Authors
- Yizhou Zhao
- Xiang Li
- Peter Song
- Qi Long
- Weijie Su
Paper Information
- arXiv ID: 2511.21600v1
- Categories: cs.CR, cs.LG
- Published: November 26, 2025
- PDF: Download PDF