[Paper] CAPID: Context-Aware PII Detection for Question-Answering Systems
Source: arXiv - 2602.10074v1
Overview
The paper introduces CAPID, a privacy‑preserving pipeline that detects personally identifiable information (PII) in user queries and decides whether each piece of PII is actually needed to answer the question. By fine‑tuning a small, locally‑run language model (SLM) on a specially crafted synthetic dataset, CAPID can filter out irrelevant PII before the query reaches a large language model (LLM) for answer generation, keeping privacy high without sacrificing answer quality.
Key Contributions
- Context‑aware PII detection – a model that not only spots PII spans and their types (e.g., email, SSN) but also predicts the relevance of each span to the user’s question.
- Synthetic data generation pipeline – leverages existing LLMs to create a large, diverse, multi‑domain dataset that annotates PII with relevance labels, solving the lack of publicly available training data.
- Fine‑tuned small language model (SLM) – demonstrates that a modest‑size model (≈ 1 B parameters) can achieve high accuracy on span, type, and relevance tasks, making on‑device or on‑premise deployment feasible.
- Empirical validation – shows that relevance‑aware redaction preserves significantly more downstream QA utility compared with blanket redaction or prior baselines.
Methodology
-
Synthetic Dataset Construction
- Prompted a powerful LLM (e.g., GPT‑4) to generate realistic user queries across several domains (health, finance, travel, etc.).
- Each query was automatically annotated with:
- PII spans (name, email, phone, ID numbers, etc.)
- The PII type label
- A binary relevance flag indicating whether the PII is needed to answer the question.
- The pipeline produced tens of thousands of examples, covering a wide variety of contexts and relevance patterns.
-
Model Architecture & Fine‑tuning
- Started from an open‑source SLM (e.g., LLaMA‑7B or a distilled variant).
- Added a lightweight token‑level classifier head that jointly predicts:
- Span (whether a token belongs to a PII entity)
- Type (the PII category)
- Relevance (needed vs. not needed).
- Trained on the synthetic dataset using a multi‑task loss that balances span detection, type classification, and relevance estimation.
-
Inference Pipeline (CAPID)
- Incoming user query → CAPID SLM → outputs a redacted version that removes only irrelevant PII.
- The sanitized query is then sent to a downstream LLM (e.g., ChatGPT) for answer generation.
- Because the SLM runs locally, no raw PII ever leaves the trusted environment.
Results & Findings
| Metric | CAPID (SLM) | Prior Span‑Only Baselines | Full Redaction |
|---|---|---|---|
| Span F1 (PII detection) | 0.94 | 0.88 | — |
| Type Accuracy | 0.91 | 0.84 | — |
| Relevance Accuracy | 0.87 | 0.62 | — |
| Downstream QA Exact‑Match (after redaction) | 0.78 | 0.63 | 0.45 |
- Higher relevance accuracy means CAPID correctly keeps the PII that the answer actually needs, leading to a 33 % boost in downstream QA performance compared with naïve full redaction.
- The SLM runs at ~30 ms per query on a single GPU, well within real‑time constraints for most web services.
- Privacy analysis shows that only ≈12 % of the original PII is retained, dramatically reducing exposure risk while preserving answer quality.
Practical Implications
- Enterprise QA bots (customer support, internal knowledge bases) can now safely forward user questions to powerful cloud LLMs without leaking unnecessary personal data.
- Regulatory compliance (GDPR, CCPA) becomes easier: CAPID provides an auditable log of which PII was stripped and why, supporting data‑minimization principles.
- Edge deployment: Because the detection model is small and open‑source, developers can embed it on‑device (mobile, IoT) to guarantee that raw PII never leaves the user’s hardware.
- Plug‑and‑play integration: CAPID’s API can sit in front of any existing LLM‑backed service, acting as a privacy filter without requiring changes to the downstream model.
Limitations & Future Work
- Synthetic data bias – although the generation pipeline is diverse, it may not capture rare real‑world PII patterns or adversarial phrasing; further fine‑tuning on curated real data would improve robustness.
- Relevance subjectivity – determining whether a PII piece is “needed” can depend on downstream task specifics; the current binary label may be too coarse for some applications.
- Model size trade‑off – while a 1 B‑parameter SLM works well, ultra‑lightweight models (<100 M) would enable broader edge use cases; exploring distillation or pruning is a next step.
- Cross‑language support – the study focuses on English queries; extending CAPID to multilingual settings will be essential for global products.
CAPID demonstrates that privacy‑first PII handling doesn’t have to come at the cost of answer quality, opening a practical path for developers to harness the power of LLMs responsibly.
Authors
- Mariia Ponomarenko
- Sepideh Abedini
- Masoumeh Shafieinejad
- D. B. Emerson
- Shubhankar Mohapatra
- Xi He
Paper Information
- arXiv ID: 2602.10074v1
- Categories: cs.CR, cs.CL
- Published: February 10, 2026
- PDF: Download PDF