Benchmarking LLM Context Awareness Without Sending Raw PII

Published: 3 weeks ago (January 14, 2026 at 11:40 AM EST)

5 min read

Source: Dev.to

TL;DR: I measured whether an LLM can still understand relationships and context when raw identifiers never enter the prompt. Turns out – simple redaction is not working well, but with a little tweak it nearly matches full context!

I compared three approaches

Full Context (baseline)
Standard Redaction (everything becomes “)
Semantic Masking (my own simple package built on top of spaCy that generates context‑aware placeholders with IDs to keep relations, e.g. {Person_A})

The results were surprising: in a stress test for relationship reasoning, standard redaction collapsed to 27 % accuracy. Semantic masking achieved 91 % accuracy – matching the unmasked baseline almost perfectly while keeping direct identifiers local.

Scope note: This is not anonymisation. The goal is narrower but practical: keep direct identifiers (names, emails, IDs) local, while giving the model enough structure to reason intelligently.

All source code is linked at the end.

Why this matters (beyond just RAG)

People love using AI interfaces, but we often forget that an LLM is a general‑purpose engine, not a secure vault. Whether you are building a chatbot, an agent, or a RAG pipeline, passing raw data carries risks:

Prompt logging & tracing
Vector‑DB storage (embedding raw PII)
Debugging screenshots
“Fallback” calls to external providers

As a developer in the EU, I wanted to explore a mask‑first approach: transform data locally, prompt on masked text, and (optionally) re‑hydrate the response locally.

The Problem: Context Collapse

The issue with standard redaction isn’t that the tools are bad—it’s that they destroy the information the model needs to understand who is doing what.

The “Anna & Emma” scenario

Original text: “Anna calls Emma.”

Standard Redaction → calls .

Issue: The model has zero way to distinguish who called whom. Reasoning collapses.

Semantic Masking → {Person_A} calls {Person_B}.

Win: The model knows A and B are different people, preserving the relationship. When the answer comes back ({Person_A} initiated the call), we can swap the real names back in locally.

I wanted to measure: Exactly how much reasoning do we lose with redaction, and can we fix it by adding some semantics?

Benchmarks

I ran two experiments to test this hypothesis.

#	Benchmark	Description
1	“Who is Who” Stress Test (N = 11)	Small synthetic dataset designed to test context‑awareness of LLMs using different PII‑removal tools. Features multiple people interacting in one story and relational reasoning (e.g., “Who is the manager?”).
2	RAG QA Benchmark	Simulation of a retrieval pipeline: 1. Take a private document. 2. Mask it. 3. Ask the LLM questions based only on the masked text.

Setup

Model: gpt‑4o‑mini (temperature = 0)
Evaluator: gpt‑4o‑mini used as an LLM judge in a separate evaluation prompt (temperature = 0)
Metric: Accuracy on relationship‑extraction questions

Note: Small‑N benchmarks are meant to expose failure modes, not claim statistical perfection. They are a “vibe check” for logic.

Comparing the Approaches

Full Context (Baseline) – raw text (high privacy risk, perfect context).
Standard Redaction – replace entities with generic tags (“).
Semantic Masking – my approach, which does three things differently:
- Consistency: “Anna” becomes {Person_hxg3}; every subsequent occurrence uses the same placeholder.
- Entity Linking: “Anna Smith” and “Anna” are detected as the same entity and receive the same placeholder.
- Semantic Hints: Dates aren’t just “ but {Date_October_2000}, preserving timeline information without revealing the exact day.

The Results

Benchmark 1 – Coreference Stress Test (N = 11)

Strategy	Accuracy	Why?
Full Context	90.9 % (10/11)	Baseline (one error due to model hallucination).
Standard Redaction	27.3 % (3/11)	Total collapse – the model guessed blindly because everyone was “.
Semantic Masking	90.9 % (10/11)	Context restored – performance matches raw data.

Benchmark 2 – RAG QA

Strategy	Context Retention
Original (Baseline)	100 %
Standard Redaction	≈ 10 %
Semantic Masking	92–100 %

Takeaway: You don’t need real names to reason. You just need structure.

What I Learned

Structure > Content: For most AI tasks, the model doesn’t care who someone is; it cares about the graph of relationships (e.g., Person A → boss of → Person B).
Entity Linking is Critical: Naïve find‑and‑replace fails on “Anna” vs. “Anna Smith”. You need logic that links these to the same ID, otherwise the model thinks they are two different people.
Privacy Enablement: This opens up use cases (HR, detailed customer support, legal) where we previously thought “we can’t use LLMs because we can’t send the data.”

Reproducibility vs. Privacy

In Production: Use ephemeral IDs (random per session). “Anna” is {Person_X} today and {Person_Y} tomorrow, preventing cross‑session profiling.
For Benchmarking: I used a fixed seed to make the runs comparable.

Resources & Code

If you want to reproduce this or stress‑test my semantic‑masking approach yourself, check out the libraries:

# Clone the repo
git clone https://github.com/Privalyse/privalyse-research.git
cd privalyse-research

# Install dependencies
pip install -r requirements.txt

# Run the coreference stress test
python benchmarks/coreference_stress_test.py

# Run the RAG QA benchmark
python benchmarks/rag_qa.py

Coreference Benchmark (`context_research/01_coreference_benchmark.py`)

# (your coreference benchmark code here)

RAG QA Benchmark (`context_research/02_rag_qa_benchmark.py`)

# (your RAG QA benchmark code here)

Semantic Masking Library

Privalyse/privalyse-mask

pip install privalyse-mask

Limitations / Threat Model

To be fully transparent:

✅ Direct Identifiers are gone: Names, emails, phone numbers are masked locally.
❌ Re‑identification is possible: If the context (except PII) is unique enough (e.g., “The CEO of Apple in 2010”), the model might infer a real person.
❌ No Differential Privacy: This is a utility‑first approach, not a mathematical guarantee.

This approach is about minimizing data exposure while maximizing model intelligence, not about achieving perfect anonymity.

Discussion

I’d love to hear from others working on privacy‑preserving AI:

Are there other tools that handle entity linking during masking?
Do you know of standard datasets for “privacy‑preserving reasoning”?
Are there common benchmarks for that kind of context awareness? (I only found some for long contexts)

Let’s chat in the comments! 👇

Benchmarking LLM Context Awareness Without Sending Raw PII

Why this matters (beyond just RAG)

The Problem: Context Collapse

The “Anna & Emma” scenario

Benchmarks

Setup

Comparing the Approaches

The Results

Benchmark 1 – Coreference Stress Test (N = 11)

Benchmark 2 – RAG QA

What I Learned

Reproducibility vs. Privacy

Resources & Code

Coreference Benchmark (`context_research/01_coreference_benchmark.py`)

RAG QA Benchmark (`context_research/02_rag_qa_benchmark.py`)

Semantic Masking Library

Limitations / Threat Model

Discussion

Related posts

SLMs - A Very Different Form of AI

The assistant axis: situating and stabilizing the character of LLMs

GLM-4.7-Flash

Accelerating AI Inference Workflows with the Atomic Inference Boilerplate

Why this matters (beyond just RAG)

The Problem: Context Collapse

The “Anna & Emma” scenario

Benchmarks

Setup

Comparing the Approaches

The Results

Benchmark 1 – Coreference Stress Test (N = 11)

Benchmark 2 – RAG QA

What I Learned

Reproducibility vs. Privacy

Resources & Code

Coreference Benchmark (context_research/01_coreference_benchmark.py)

RAG QA Benchmark (context_research/02_rag_qa_benchmark.py)

Semantic Masking Library

Limitations / Threat Model

Discussion

Related posts

SLMs - A Very Different Form of AI

The assistant axis: situating and stabilizing the character of LLMs

GLM-4.7-Flash

Accelerating AI Inference Workflows with the Atomic Inference Boilerplate

The “Anna & Emma” scenario

Benchmark 1 – Coreference Stress Test (N = 11)

Benchmark 2 – RAG QA

Coreference Benchmark (`context_research/01_coreference_benchmark.py`)

RAG QA Benchmark (`context_research/02_rag_qa_benchmark.py`)