[Paper] CAPID: Context-Aware PII Detection for Question-Answering Systems

Published: 2 months ago (February 10, 2026 at 01:41 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10074v1

Overview

The paper introduces CAPID, a privacy‑preserving pipeline that detects personally identifiable information (PII) in user queries and decides whether each piece of PII is actually needed to answer the question. By fine‑tuning a small, locally‑run language model (SLM) on a specially crafted synthetic dataset, CAPID can filter out irrelevant PII before the query reaches a large language model (LLM) for answer generation, keeping privacy high without sacrificing answer quality.

Key Contributions

Context‑aware PII detection – a model that not only spots PII spans and their types (e.g., email, SSN) but also predicts the relevance of each span to the user’s question.
Synthetic data generation pipeline – leverages existing LLMs to create a large, diverse, multi‑domain dataset that annotates PII with relevance labels, solving the lack of publicly available training data.
Fine‑tuned small language model (SLM) – demonstrates that a modest‑size model (≈ 1 B parameters) can achieve high accuracy on span, type, and relevance tasks, making on‑device or on‑premise deployment feasible.
Empirical validation – shows that relevance‑aware redaction preserves significantly more downstream QA utility compared with blanket redaction or prior baselines.

Methodology

Synthetic Dataset Construction

Prompted a powerful LLM (e.g., GPT‑4) to generate realistic user queries across several domains (health, finance, travel, etc.).
Each query was automatically annotated with:
- PII spans (name, email, phone, ID numbers, etc.)
- The PII type label
- A binary relevance flag indicating whether the PII is needed to answer the question.
The pipeline produced tens of thousands of examples, covering a wide variety of contexts and relevance patterns.

Model Architecture & Fine‑tuning

Started from an open‑source SLM (e.g., LLaMA‑7B or a distilled variant).
Added a lightweight token‑level classifier head that jointly predicts:
- Span (whether a token belongs to a PII entity)
- Type (the PII category)
- Relevance (needed vs. not needed).
Trained on the synthetic dataset using a multi‑task loss that balances span detection, type classification, and relevance estimation.

Inference Pipeline (CAPID)

Incoming user query → CAPID SLM → outputs a redacted version that removes only irrelevant PII.
The sanitized query is then sent to a downstream LLM (e.g., ChatGPT) for answer generation.
Because the SLM runs locally, no raw PII ever leaves the trusted environment.

Results & Findings

Metric	CAPID (SLM)	Prior Span‑Only Baselines	Full Redaction
Span F1 (PII detection)	0.94	0.88	—
Type Accuracy	0.91	0.84	—
Relevance Accuracy	0.87	0.62	—
Downstream QA Exact‑Match (after redaction)	0.78	0.63	0.45

Higher relevance accuracy means CAPID correctly keeps the PII that the answer actually needs, leading to a 33 % boost in downstream QA performance compared with naïve full redaction.
The SLM runs at ~30 ms per query on a single GPU, well within real‑time constraints for most web services.
Privacy analysis shows that only ≈12 % of the original PII is retained, dramatically reducing exposure risk while preserving answer quality.

Practical Implications

Enterprise QA bots (customer support, internal knowledge bases) can now safely forward user questions to powerful cloud LLMs without leaking unnecessary personal data.
Regulatory compliance (GDPR, CCPA) becomes easier: CAPID provides an auditable log of which PII was stripped and why, supporting data‑minimization principles.
Edge deployment: Because the detection model is small and open‑source, developers can embed it on‑device (mobile, IoT) to guarantee that raw PII never leaves the user’s hardware.
Plug‑and‑play integration: CAPID’s API can sit in front of any existing LLM‑backed service, acting as a privacy filter without requiring changes to the downstream model.

Limitations & Future Work

Synthetic data bias – although the generation pipeline is diverse, it may not capture rare real‑world PII patterns or adversarial phrasing; further fine‑tuning on curated real data would improve robustness.
Relevance subjectivity – determining whether a PII piece is “needed” can depend on downstream task specifics; the current binary label may be too coarse for some applications.
Model size trade‑off – while a 1 B‑parameter SLM works well, ultra‑lightweight models (<100 M) would enable broader edge use cases; exploring distillation or pruning is a next step.
Cross‑language support – the study focuses on English queries; extending CAPID to multilingual settings will be essential for global products.

Authors

Mariia Ponomarenko
Sepideh Abedini
Masoumeh Shafieinejad
D. B. Emerson
Shubhankar Mohapatra
Xi He

Paper Information

arXiv ID: 2602.10074v1
Categories: cs.CR, cs.CL
Published: February 10, 2026
PDF: Download PDF