[Paper] TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data

Published: 2 months ago (November 26, 2025 at 12:16 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21600v1

Overview

The paper introduces TAB‑DRW, a lightweight watermarking technique designed for synthetic tabular data generated by AI models. By embedding a hidden signal in the frequency domain of the data, the method makes it possible to prove provenance even after the data has been edited or transformed—an increasingly important capability for industries that share or sell synthetic datasets.

Key Contributions

Frequency‑domain watermarking: Uses the discrete Fourier transform (DFT) on normalized tabular rows, tweaking the imaginary components to encode a pseudorandom bitstream.
Mixed‑type support: Handles continuous, ordinal, and categorical columns in a single pipeline via Yeo‑Johnson transformation and standardization.
Row‑wise, storage‑free retrieval: Introduces a rank‑based pseudorandom bit generator that lets a verifier reconstruct the watermark for any row on‑the‑fly, eliminating the need to store extra metadata.
Robustness to post‑processing: Demonstrates resilience against common attacks such as rounding, scaling, noise injection, and even partial row deletion.
Efficiency: The entire embedding and detection process runs in linear time with respect to the number of rows, avoiding the heavy compute cost of diffusion‑model‑based watermarks.

Methodology

Pre‑processing
- Each column is transformed with the Yeo‑Johnson power transform (works for both positive and negative values) and then standardized (zero mean, unit variance).
Frequency conversion
- The normalized row vector is fed to a 1‑D DFT, producing complex coefficients (real + imaginary parts).
Bit embedding
- A rank‑based PRNG generates a deterministic pseudorandom bit for each row based on its sorted position in the dataset.
- Selected DFT coefficients (chosen adaptively to avoid low‑energy components) have their imaginary parts nudged up or down by a tiny epsilon, encoding the bit while keeping the overall row distribution unchanged.
Inverse transform
- An inverse DFT brings the data back to the original space, followed by de‑standardization and inverse Yeo‑Johnson to obtain a watermarked synthetic table.
Detection
- To verify a row, the same normalization and DFT steps are applied, the same coefficient indices are inspected, and the sign of the imaginary part is mapped back to the expected pseudorandom bit. A majority‑vote across rows yields the overall watermark presence.

Results & Findings

Dataset (5)	Watermark detection rate	Robustness (post‑edit attacks)	Data fidelity (RMSE vs. original)
Health‑Care	99.2 %	> 95 % after rounding, noise (σ=0.01), and 10 % row drop	0.018
Finance	98.7 %	93 % after column scaling (±5 %)	0.022
Public‑Policy	99.5 %	96 % after categorical label shuffling	0.015

Detectability stays above 98 % across all benchmarks, even when the synthetic data undergoes aggressive sanitization.
Fidelity loss is negligible; downstream ML models trained on watermarked data show < 0.5 % drop in predictive performance compared to models trained on unwatermarked synthetic data.
Runtime: Embedding a 100 k‑row table takes ~0.8 seconds on a single CPU core, orders of magnitude faster than diffusion‑based watermarking (which can require minutes per batch).

Practical Implications

Data marketplaces can embed invisible provenance tags without inflating storage or slowing down data generation pipelines, enabling automated royalty tracking and misuse detection.
Regulatory compliance: Organizations in healthcare or finance can prove that a synthetic dataset originated from an approved generator, satisfying audit requirements for data lineage.
Model‑as‑a‑service (MaaS) providers can offer “watermarked‑as‑a‑feature” APIs, giving customers confidence that their synthetic data cannot be repurposed without attribution.
Security tooling: The rank‑based PRNG means a verifier only needs the secret seed, not a per‑row key list, simplifying integration into CI pipelines that need to validate data integrity before deployment.

Limitations & Future Work

Assumption of row order: The rank‑based PRNG relies on a stable sorting of rows; shuffling the dataset without preserving order can break detection unless the same sorting key is reapplied.
Limited to linear transformations: Highly non‑linear post‑processing (e.g., training a downstream GAN on the watermarked data) may dilute the signal; the authors suggest exploring multi‑frequency embedding to improve resilience.
Scalability to ultra‑high‑dimensional tables: While runtime is linear, the DFT on very wide tables (> 10 k columns) could become a bottleneck; future work may investigate block‑wise or wavelet‑based alternatives.

Overall, TAB‑DRW offers a pragmatic, low‑overhead path for developers to protect synthetic tabular assets, bridging a gap between academic watermarking research and real‑world data‑centric product pipelines.

Authors

Yizhou Zhao
Xiang Li
Peter Song
Qi Long
Weijie Su

Paper Information

arXiv ID: 2511.21600v1
Categories: cs.CR, cs.LG
Published: November 26, 2025
PDF: Download PDF

[Paper] TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval