[Paper] CAPID: Context-Aware PII Detection for Question-Answering Systems

Published: (February 10, 2026 at 01:41 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.10074v1

Overview

The paper introduces CAPID, a privacy‑preserving pipeline that detects personally identifiable information (PII) in user queries and decides whether each piece of PII is actually needed to answer the question. By fine‑tuning a small, locally‑run language model (SLM) on a specially crafted synthetic dataset, CAPID can filter out irrelevant PII before the query reaches a large language model (LLM) for answer generation, keeping privacy high without sacrificing answer quality.

Key Contributions

  • Context‑aware PII detection – a model that not only spots PII spans and their types (e.g., email, SSN) but also predicts the relevance of each span to the user’s question.
  • Synthetic data generation pipeline – leverages existing LLMs to create a large, diverse, multi‑domain dataset that annotates PII with relevance labels, solving the lack of publicly available training data.
  • Fine‑tuned small language model (SLM) – demonstrates that a modest‑size model (≈ 1 B parameters) can achieve high accuracy on span, type, and relevance tasks, making on‑device or on‑premise deployment feasible.
  • Empirical validation – shows that relevance‑aware redaction preserves significantly more downstream QA utility compared with blanket redaction or prior baselines.

Methodology

  1. Synthetic Dataset Construction

    • Prompted a powerful LLM (e.g., GPT‑4) to generate realistic user queries across several domains (health, finance, travel, etc.).
    • Each query was automatically annotated with:
      • PII spans (name, email, phone, ID numbers, etc.)
      • The PII type label
      • A binary relevance flag indicating whether the PII is needed to answer the question.
    • The pipeline produced tens of thousands of examples, covering a wide variety of contexts and relevance patterns.
  2. Model Architecture & Fine‑tuning

    • Started from an open‑source SLM (e.g., LLaMA‑7B or a distilled variant).
    • Added a lightweight token‑level classifier head that jointly predicts:
      • Span (whether a token belongs to a PII entity)
      • Type (the PII category)
      • Relevance (needed vs. not needed).
    • Trained on the synthetic dataset using a multi‑task loss that balances span detection, type classification, and relevance estimation.
  3. Inference Pipeline (CAPID)

    • Incoming user query → CAPID SLM → outputs a redacted version that removes only irrelevant PII.
    • The sanitized query is then sent to a downstream LLM (e.g., ChatGPT) for answer generation.
    • Because the SLM runs locally, no raw PII ever leaves the trusted environment.

Results & Findings

MetricCAPID (SLM)Prior Span‑Only BaselinesFull Redaction
Span F1 (PII detection)0.940.88
Type Accuracy0.910.84
Relevance Accuracy0.870.62
Downstream QA Exact‑Match (after redaction)0.780.630.45
  • Higher relevance accuracy means CAPID correctly keeps the PII that the answer actually needs, leading to a 33 % boost in downstream QA performance compared with naïve full redaction.
  • The SLM runs at ~30 ms per query on a single GPU, well within real‑time constraints for most web services.
  • Privacy analysis shows that only ≈12 % of the original PII is retained, dramatically reducing exposure risk while preserving answer quality.

Practical Implications

  • Enterprise QA bots (customer support, internal knowledge bases) can now safely forward user questions to powerful cloud LLMs without leaking unnecessary personal data.
  • Regulatory compliance (GDPR, CCPA) becomes easier: CAPID provides an auditable log of which PII was stripped and why, supporting data‑minimization principles.
  • Edge deployment: Because the detection model is small and open‑source, developers can embed it on‑device (mobile, IoT) to guarantee that raw PII never leaves the user’s hardware.
  • Plug‑and‑play integration: CAPID’s API can sit in front of any existing LLM‑backed service, acting as a privacy filter without requiring changes to the downstream model.

Limitations & Future Work

  • Synthetic data bias – although the generation pipeline is diverse, it may not capture rare real‑world PII patterns or adversarial phrasing; further fine‑tuning on curated real data would improve robustness.
  • Relevance subjectivity – determining whether a PII piece is “needed” can depend on downstream task specifics; the current binary label may be too coarse for some applications.
  • Model size trade‑off – while a 1 B‑parameter SLM works well, ultra‑lightweight models (<100 M) would enable broader edge use cases; exploring distillation or pruning is a next step.
  • Cross‑language support – the study focuses on English queries; extending CAPID to multilingual settings will be essential for global products.

CAPID demonstrates that privacy‑first PII handling doesn’t have to come at the cost of answer quality, opening a practical path for developers to harness the power of LLMs responsibly.

Authors

  • Mariia Ponomarenko
  • Sepideh Abedini
  • Masoumeh Shafieinejad
  • D. B. Emerson
  • Shubhankar Mohapatra
  • Xi He

Paper Information

  • arXiv ID: 2602.10074v1
  • Categories: cs.CR, cs.CL
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »