Benchmarking LLM Context Awareness Without Sending Raw PII
Source: Dev.to
TL;DR: I measured whether an LLM can still understand relationships and context when raw identifiers never enter the prompt. Turns out – simple redaction is not working well, but with a little tweak it nearly matches full context!
I compared three approaches
- Full Context (baseline)
- Standard Redaction (everything becomes “)
- Semantic Masking (my own simple package built on top of spaCy that generates context‑aware placeholders with IDs to keep relations, e.g.
{Person_A})
The results were surprising: in a stress test for relationship reasoning, standard redaction collapsed to 27 % accuracy. Semantic masking achieved 91 % accuracy – matching the unmasked baseline almost perfectly while keeping direct identifiers local.
Scope note: This is not anonymisation. The goal is narrower but practical: keep direct identifiers (names, emails, IDs) local, while giving the model enough structure to reason intelligently.
All source code is linked at the end.
Why this matters (beyond just RAG)
People love using AI interfaces, but we often forget that an LLM is a general‑purpose engine, not a secure vault. Whether you are building a chatbot, an agent, or a RAG pipeline, passing raw data carries risks:
- Prompt logging & tracing
- Vector‑DB storage (embedding raw PII)
- Debugging screenshots
- “Fallback” calls to external providers
As a developer in the EU, I wanted to explore a mask‑first approach: transform data locally, prompt on masked text, and (optionally) re‑hydrate the response locally.
The Problem: Context Collapse
The issue with standard redaction isn’t that the tools are bad—it’s that they destroy the information the model needs to understand who is doing what.
The “Anna & Emma” scenario
Original text: “Anna calls Emma.”
Standard Redaction → calls .
- Issue: The model has zero way to distinguish who called whom. Reasoning collapses.
Semantic Masking → {Person_A} calls {Person_B}.
- Win: The model knows A and B are different people, preserving the relationship. When the answer comes back (
{Person_A} initiated the call), we can swap the real names back in locally.
I wanted to measure: Exactly how much reasoning do we lose with redaction, and can we fix it by adding some semantics?
Benchmarks
I ran two experiments to test this hypothesis.
| # | Benchmark | Description |
|---|---|---|
| 1 | “Who is Who” Stress Test (N = 11) | Small synthetic dataset designed to test context‑awareness of LLMs using different PII‑removal tools. Features multiple people interacting in one story and relational reasoning (e.g., “Who is the manager?”). |
| 2 | RAG QA Benchmark | Simulation of a retrieval pipeline: 1. Take a private document. 2. Mask it. 3. Ask the LLM questions based only on the masked text. |
Setup
- Model:
gpt‑4o‑mini(temperature = 0) - Evaluator:
gpt‑4o‑miniused as an LLM judge in a separate evaluation prompt (temperature = 0) - Metric: Accuracy on relationship‑extraction questions
Note: Small‑N benchmarks are meant to expose failure modes, not claim statistical perfection. They are a “vibe check” for logic.
Comparing the Approaches
-
Full Context (Baseline) – raw text (high privacy risk, perfect context).
-
Standard Redaction – replace entities with generic tags (“).
-
Semantic Masking – my approach, which does three things differently:
- Consistency: “Anna” becomes
{Person_hxg3}; every subsequent occurrence uses the same placeholder. - Entity Linking: “Anna Smith” and “Anna” are detected as the same entity and receive the same placeholder.
- Semantic Hints: Dates aren’t just “ but
{Date_October_2000}, preserving timeline information without revealing the exact day.
- Consistency: “Anna” becomes
The Results
Benchmark 1 – Coreference Stress Test (N = 11)
| Strategy | Accuracy | Why? |
|---|---|---|
| Full Context | 90.9 % (10/11) | Baseline (one error due to model hallucination). |
| Standard Redaction | 27.3 % (3/11) | Total collapse – the model guessed blindly because everyone was “. |
| Semantic Masking | 90.9 % (10/11) | Context restored – performance matches raw data. |
Benchmark 2 – RAG QA
| Strategy | Context Retention |
|---|---|
| Original (Baseline) | 100 % |
| Standard Redaction | ≈ 10 % |
| Semantic Masking | 92–100 % |
Takeaway: You don’t need real names to reason. You just need structure.
What I Learned
- Structure > Content: For most AI tasks, the model doesn’t care who someone is; it cares about the graph of relationships (e.g., Person A → boss of → Person B).
- Entity Linking is Critical: Naïve find‑and‑replace fails on “Anna” vs. “Anna Smith”. You need logic that links these to the same ID, otherwise the model thinks they are two different people.
- Privacy Enablement: This opens up use cases (HR, detailed customer support, legal) where we previously thought “we can’t use LLMs because we can’t send the data.”
Reproducibility vs. Privacy
- In Production: Use ephemeral IDs (random per session). “Anna” is
{Person_X}today and{Person_Y}tomorrow, preventing cross‑session profiling. - For Benchmarking: I used a fixed seed to make the runs comparable.
Resources & Code
If you want to reproduce this or stress‑test my semantic‑masking approach yourself, check out the libraries:
# Clone the repo
git clone https://github.com/Privalyse/privalyse-research.git
cd privalyse-research
# Install dependencies
pip install -r requirements.txt
# Run the coreference stress test
python benchmarks/coreference_stress_test.py
# Run the RAG QA benchmark
python benchmarks/rag_qa.py
Coreference Benchmark (context_research/01_coreference_benchmark.py)
# (your coreference benchmark code here)
RAG QA Benchmark (context_research/02_rag_qa_benchmark.py)
# (your RAG QA benchmark code here)
Semantic Masking Library
pip install privalyse-mask
Limitations / Threat Model
To be fully transparent:
- ✅ Direct Identifiers are gone: Names, emails, phone numbers are masked locally.
- ❌ Re‑identification is possible: If the context (except PII) is unique enough (e.g., “The CEO of Apple in 2010”), the model might infer a real person.
- ❌ No Differential Privacy: This is a utility‑first approach, not a mathematical guarantee.
This approach is about minimizing data exposure while maximizing model intelligence, not about achieving perfect anonymity.
Discussion
I’d love to hear from others working on privacy‑preserving AI:
- Are there other tools that handle entity linking during masking?
- Do you know of standard datasets for “privacy‑preserving reasoning”?
- Are there common benchmarks for that kind of context awareness? (I only found some for long contexts)
Let’s chat in the comments! 👇