Most PDF Redaction Is Broken. Here's What 'Real' Redaction Actually Requires.

Published: (April 25, 2026 at 10:05 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

The Problem with Fake Redaction

Drawing a black rectangle over text does not redact it. The underlying text remains in the PDF’s content stream, so anyone can:

  1. Select all → copy → paste into a plain‑text editor.
  2. Reveal the hidden information.

This mistake has led to classified documents being leaked from government agencies multiple times.

Fake redaction example

Page content stream: "Salary: $120,000"   ← still here
Annotation layer:    [black rectangle]   ← just covering it

The content stream is untouched, and any PDF parser can read the original data.

What Real Redaction Requires

  1. Find the target text in the content stream.
  2. Remove it from the stream entirely.
  3. Replace the removed region with a filled black rectangle drawn directly into the content.
  4. Re‑serialize the page so that no original data survives.

Example Implementation (Rust)

pub fn redact_text(
    doc: &mut Document,
    page_id: ObjectId,
    target: &str,
) -> Result {
    let page = doc.get_object_mut(page_id)?;

    if let Ok(stream) = page.as_stream_mut() {
        let content = stream.decode_content()?;

        // Remove text operators containing target
        let cleaned = remove_text_from_content(content, target);

        // Replace with black filled rectangle at same position
        let redact_op = format!(
            "q 0 0 0 rg {} {} {} {} re f Q\n",
            x, y, width, height
        );

        stream.set_content(cleaned + redact_op.as_bytes());
    }

    Ok(())
}

Understanding PDF Content Streams

PDF content streams do not store text with explicit coordinates in a simple format. Text positioning depends on:

  • The current transformation matrix (CTM)
  • The text matrix (Tm)
  • Font metrics

All of these are stateful, so correctly parsing and modifying a stream requires a full content‑stream interpreter—not a regular‑expression search over raw bytes. The lopdf crate provides raw streams; interpreting them is left to the developer.

Detecting PII Before Redaction

A typical workflow runs a pattern‑matching pass to locate personally identifiable information (PII) before any redaction occurs. Below is a Rust example that detects phone numbers and Japanese MyNumber identifiers:

pub fn detect_pii(text: &str) -> Vec {
    let mut findings = Vec::new();

    // Phone numbers (e.g., 123-456-7890)
    let phone_re = Regex::new(r"\d{2,4}-\d{2,4}-\d{4}").unwrap();
    for m in phone_re.find_iter(text) {
        findings.push((m.start(), m.end(), PiiType::Phone));
    }

    // Japanese MyNumber (12 digits)
    let mynumber_re = Regex::new(r"\b\d{12}\b").unwrap();
    for m in mynumber_re.find_iter(text) {
        findings.push((m.start(), m.end(), PiiType::MyNumber));
    }

    findings
}

The detections are reviewed manually before committing the redaction, because fully automated redaction without review introduces its own risks.

Resources

  • Hiyoko PDF Vault – a tool for secure PDF handling (author: @hiyoyok)
0 views
Back to Blog

Related posts

Read more »