95% of PII Redaction Doesn't Need an LLM. The Other 5% Is Where Your Masker Leaks.

Published: (April 21, 2026 at 07:43 AM EDT)
4 min read
Source: Dev.to

Source: Dev.to

Cover image for 95% of PII Redaction Doesn't Need an LLM. The Other 5% Is Where Your Masker Leaks.

A VP at an SAP shop told me recently: “Every time we copy production to our lower environments, PII leaks. And no, we’re not throwing an LLM at it. That’s a thousand times the compute of what we already run.”

He’s right.

Most of the PII redaction problem in enterprise data isn’t a neural‑network problem. It’s a lookup‑table problem. The incumbents already solve it: SAP TDMS, Delphix, Informatica, IBM InfoSphere Optim. All are schema‑aware, row‑level, and deterministic.

The 95% Where Deterministic Wins

In a SAP production database, the schema tells you almost everything.

  • KNA1‑NAME1 → customer name
  • BSEG‑IBAN → bank account
  • USR02‑BNAME → user ID

A YAML rule says: “for this column type, replace with this pattern.” Done.

The math is brutal. A regex plus a lookup table costs microseconds per row. A 1.5 B‑parameter model costs 10–50 ms per row, even on a GPU – three to five orders of magnitude slower. A nightly batch copy that finishes by morning with TDMS would take weeks with an LLM in the loop.

Referential integrity is the real argument. “Anna Müller” must become “Person_47” consistently across 200 tables (KNA1, VBAK, VBKD, BSEG, …). Deterministic pseudonymisation with an HMAC and a scoped salt gives you that for free. Neural outputs drift.

Auditability matters too. A regulator may ask: “show me the rule that masked this column.” A YAML rule is defensible; a model output is not.

So for any SAP field with a known schema type, deterministic masking wins. Full stop. Don’t let anyone sell you a neural‑network‑powered “modernisation” of that layer.

Where a Fine‑Tuned Model Earns Its Compute

Here’s what deterministic tools silently miss.

  • Free‑text columns – e.g., BSEG‑SGTXT, where someone typed “Ansprechpartner Anna Müller, Tel +49‑170‑…”. Ticket descriptions from ServiceNow, email bodies stored as CLOBs, ADRC annotations. The column type is “text”, but the content is a gold‑mine of PII.

  • Unstructured attachments – PDFs, scanned invoices, OCR’d contracts pulled into dev via ArchiveLink. Names and IBANs appear mid‑prose, not in a column.

  • Schema drift – consultants add Z‑tables that the data steward hasn’t classified yet. Deterministic tools don’t know the column holds PII and either wipe the whole column (destroying test fidelity) or miss the PII entirely (causing compliance incidents).

A German‑specialised redactor earns its keep here because the alternative isn’t “faster regex”; it’s “no coverage at all.”

The Hybrid Architecture

This is the part that actually ships.

  1. Classifier pass on the SAP copy. Cheap heuristics (column‑name keywords, column type, sample‑value regex) flag each column as structured_pii, free_text, or safe.
  2. Deterministic masker handles structured_pii (TDMS or whatever you already run).
  3. Fine‑tuned LLM redactor runs only on free_text, attachments, and unclassified Z‑columns.
  4. Consistency bridge – both paths share a pseudonym table keyed by HMAC(value, tenant_salt). “Anna Müller” becomes “Person_47” whether caught by regex or by the model.

Compute budget: the LLM runs on maybe 1–5 % of the cells. Total cost is still dominated by the deterministic layer. You’re not replacing TDMS; you’re covering its blind spots.

What I Won’t Claim

Three things I won’t sell you:

  • The LLM is cheaper than a regex. It isn’t. Ever.
  • It replaces your incumbent masking vendor. It doesn’t.
  • A benchmark against TDMS on structured columns is meaningful. You lose that benchmark. Benchmark on free‑text and attachments, where deterministic tools score near zero.

The honest pitch to the VP was this: “You’re right. For the 90 % structured case, keep TDMS. The model is the long‑tail layer. It runs only over the free‑text fields and attachments your current tools silently leak. Small job. Different problem.”

That’s the conversation that lands. Not “replace your stack.” Not “AI‑powered everything.”

Regex for the schema. LLM for the shadows.

If you want the one‑page checklist I use to classify SAP columns into structured_pii / free_text / safe before a lower‑env copy, grab it below. Then ask yourself: where does your current masker leak?

0 views
Back to Blog

Related posts

Read more »