Benchmarking LLM Context Awareness Without Sending Raw PII

Published: (January 14, 2026 at 11:40 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

TL;DR: I measured whether an LLM can still understand relationships and context when raw identifiers never enter the prompt. Turns out – simple redaction is not working well, but with a little tweak it nearly matches full context!

I compared three approaches

  1. Full Context (baseline)
  2. Standard Redaction (everything becomes “)
  3. Semantic Masking (my own simple package built on top of spaCy that generates context‑aware placeholders with IDs to keep relations, e.g. {Person_A})

The results were surprising: in a stress test for relationship reasoning, standard redaction collapsed to 27 % accuracy. Semantic masking achieved 91 % accuracy – matching the unmasked baseline almost perfectly while keeping direct identifiers local.

Scope note: This is not anonymisation. The goal is narrower but practical: keep direct identifiers (names, emails, IDs) local, while giving the model enough structure to reason intelligently.

All source code is linked at the end.

Why this matters (beyond just RAG)

People love using AI interfaces, but we often forget that an LLM is a general‑purpose engine, not a secure vault. Whether you are building a chatbot, an agent, or a RAG pipeline, passing raw data carries risks:

  • Prompt logging & tracing
  • Vector‑DB storage (embedding raw PII)
  • Debugging screenshots
  • “Fallback” calls to external providers

As a developer in the EU, I wanted to explore a mask‑first approach: transform data locally, prompt on masked text, and (optionally) re‑hydrate the response locally.

The Problem: Context Collapse

The issue with standard redaction isn’t that the tools are bad—it’s that they destroy the information the model needs to understand who is doing what.

The “Anna & Emma” scenario

Original text: “Anna calls Emma.”

Standard Redaction calls .

  • Issue: The model has zero way to distinguish who called whom. Reasoning collapses.

Semantic Masking{Person_A} calls {Person_B}.

  • Win: The model knows A and B are different people, preserving the relationship. When the answer comes back ({Person_A} initiated the call), we can swap the real names back in locally.

I wanted to measure: Exactly how much reasoning do we lose with redaction, and can we fix it by adding some semantics?

Benchmarks

I ran two experiments to test this hypothesis.

#BenchmarkDescription
1“Who is Who” Stress Test (N = 11)Small synthetic dataset designed to test context‑awareness of LLMs using different PII‑removal tools. Features multiple people interacting in one story and relational reasoning (e.g., “Who is the manager?”).
2RAG QA BenchmarkSimulation of a retrieval pipeline:
1. Take a private document.
2. Mask it.
3. Ask the LLM questions based only on the masked text.

Setup

  • Model: gpt‑4o‑mini (temperature = 0)
  • Evaluator: gpt‑4o‑mini used as an LLM judge in a separate evaluation prompt (temperature = 0)
  • Metric: Accuracy on relationship‑extraction questions

Note: Small‑N benchmarks are meant to expose failure modes, not claim statistical perfection. They are a “vibe check” for logic.

Comparing the Approaches

  1. Full Context (Baseline) – raw text (high privacy risk, perfect context).

  2. Standard Redaction – replace entities with generic tags (“).

  3. Semantic Masking – my approach, which does three things differently:

    • Consistency: “Anna” becomes {Person_hxg3}; every subsequent occurrence uses the same placeholder.
    • Entity Linking: “Anna Smith” and “Anna” are detected as the same entity and receive the same placeholder.
    • Semantic Hints: Dates aren’t just “ but {Date_October_2000}, preserving timeline information without revealing the exact day.

The Results

Benchmark 1 – Coreference Stress Test (N = 11)

StrategyAccuracyWhy?
Full Context90.9 % (10/11)Baseline (one error due to model hallucination).
Standard Redaction27.3 % (3/11)Total collapse – the model guessed blindly because everyone was “.
Semantic Masking90.9 % (10/11)Context restored – performance matches raw data.

Benchmark 2 – RAG QA

StrategyContext Retention
Original (Baseline)100 %
Standard Redaction≈ 10 %
Semantic Masking92–100 %

Takeaway: You don’t need real names to reason. You just need structure.

What I Learned

  • Structure > Content: For most AI tasks, the model doesn’t care who someone is; it cares about the graph of relationships (e.g., Person A → boss of → Person B).
  • Entity Linking is Critical: Naïve find‑and‑replace fails on “Anna” vs. “Anna Smith”. You need logic that links these to the same ID, otherwise the model thinks they are two different people.
  • Privacy Enablement: This opens up use cases (HR, detailed customer support, legal) where we previously thought “we can’t use LLMs because we can’t send the data.”

Reproducibility vs. Privacy

  • In Production: Use ephemeral IDs (random per session). “Anna” is {Person_X} today and {Person_Y} tomorrow, preventing cross‑session profiling.
  • For Benchmarking: I used a fixed seed to make the runs comparable.

Resources & Code

If you want to reproduce this or stress‑test my semantic‑masking approach yourself, check out the libraries:

# Clone the repo
git clone https://github.com/Privalyse/privalyse-research.git
cd privalyse-research

# Install dependencies
pip install -r requirements.txt

# Run the coreference stress test
python benchmarks/coreference_stress_test.py

# Run the RAG QA benchmark
python benchmarks/rag_qa.py

Coreference Benchmark (context_research/01_coreference_benchmark.py)

# (your coreference benchmark code here)

RAG QA Benchmark (context_research/02_rag_qa_benchmark.py)

# (your RAG QA benchmark code here)

Semantic Masking Library

Privalyse/privalyse-mask

pip install privalyse-mask

Limitations / Threat Model

To be fully transparent:

  • Direct Identifiers are gone: Names, emails, phone numbers are masked locally.
  • Re‑identification is possible: If the context (except PII) is unique enough (e.g., “The CEO of Apple in 2010”), the model might infer a real person.
  • No Differential Privacy: This is a utility‑first approach, not a mathematical guarantee.

This approach is about minimizing data exposure while maximizing model intelligence, not about achieving perfect anonymity.

Discussion

I’d love to hear from others working on privacy‑preserving AI:

  • Are there other tools that handle entity linking during masking?
  • Do you know of standard datasets for “privacy‑preserving reasoning”?
  • Are there common benchmarks for that kind of context awareness? (I only found some for long contexts)

Let’s chat in the comments! 👇

Back to Blog

Related posts

Read more »

SLMs - A Very Different Form of AI

Local Small Language Models: A Different Kind of Agency For the last few years, most discussions about local small language models SLMs have focused on common...

AI-Radar.it

!Cover image for AI-Radar.ithttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazona...