Most PDF Redaction Is Broken. Here's What 'Real' Redaction Actually Requires.
Source: Dev.to
The Problem with Fake Redaction
Drawing a black rectangle over text does not redact it. The underlying text remains in the PDF’s content stream, so anyone can:
- Select all → copy → paste into a plain‑text editor.
- Reveal the hidden information.
This mistake has led to classified documents being leaked from government agencies multiple times.
Fake redaction example
Page content stream: "Salary: $120,000" ← still here
Annotation layer: [black rectangle] ← just covering it
The content stream is untouched, and any PDF parser can read the original data.
What Real Redaction Requires
- Find the target text in the content stream.
- Remove it from the stream entirely.
- Replace the removed region with a filled black rectangle drawn directly into the content.
- Re‑serialize the page so that no original data survives.
Example Implementation (Rust)
pub fn redact_text(
doc: &mut Document,
page_id: ObjectId,
target: &str,
) -> Result {
let page = doc.get_object_mut(page_id)?;
if let Ok(stream) = page.as_stream_mut() {
let content = stream.decode_content()?;
// Remove text operators containing target
let cleaned = remove_text_from_content(content, target);
// Replace with black filled rectangle at same position
let redact_op = format!(
"q 0 0 0 rg {} {} {} {} re f Q\n",
x, y, width, height
);
stream.set_content(cleaned + redact_op.as_bytes());
}
Ok(())
}
Understanding PDF Content Streams
PDF content streams do not store text with explicit coordinates in a simple format. Text positioning depends on:
- The current transformation matrix (CTM)
- The text matrix (Tm)
- Font metrics
All of these are stateful, so correctly parsing and modifying a stream requires a full content‑stream interpreter—not a regular‑expression search over raw bytes. The lopdf crate provides raw streams; interpreting them is left to the developer.
Detecting PII Before Redaction
A typical workflow runs a pattern‑matching pass to locate personally identifiable information (PII) before any redaction occurs. Below is a Rust example that detects phone numbers and Japanese MyNumber identifiers:
pub fn detect_pii(text: &str) -> Vec {
let mut findings = Vec::new();
// Phone numbers (e.g., 123-456-7890)
let phone_re = Regex::new(r"\d{2,4}-\d{2,4}-\d{4}").unwrap();
for m in phone_re.find_iter(text) {
findings.push((m.start(), m.end(), PiiType::Phone));
}
// Japanese MyNumber (12 digits)
let mynumber_re = Regex::new(r"\b\d{12}\b").unwrap();
for m in mynumber_re.find_iter(text) {
findings.push((m.start(), m.end(), PiiType::MyNumber));
}
findings
}
The detections are reviewed manually before committing the redaction, because fully automated redaction without review introduces its own risks.
Resources
- Hiyoko PDF Vault – a tool for secure PDF handling (author: @hiyoyok)