[Paper] Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization
Source: arXiv - 2602.15028v1
Overview
Shangding Gu’s new paper uncovers a hidden weakness in today’s large language models (LLMs): when they are fed very long prompts (up to 256 K tokens), they become both less personalized and more privacy‑leaky. By introducing a massive benchmark called PAPerBench, the study quantifies how context length simultaneously hurts personalization quality and amplifies the risk of exposing private information—an insight that matters for any product that relies on long‑form interactions with LLMs.
Key Contributions
- PAPerBench benchmark: ~29 K test instances covering 1 K–256 K token contexts, totaling 377 K evaluation questions that jointly measure personalization performance and privacy leakage.
- Systematic empirical study: Evaluation of several state‑of‑the‑art LLMs (e.g., GPT‑4, Claude, LLaMA‑2) across the full context spectrum, revealing a consistent degradation trend.
- Theoretical analysis of attention dilution: Formal proof that soft‑attention in fixed‑capacity Transformers spreads its focus thin as context grows, explaining the observed “long‑context, less focus” phenomenon.
- Open‑source release: Full dataset, evaluation scripts, and analysis notebooks are publicly available to enable reproducibility and further research.
Methodology
- Benchmark construction – The authors curated real‑world personalization scenarios (e.g., user‑specific recommendations, code style adaptation) and privacy‑sensitive tasks (e.g., extracting personal identifiers). Each scenario is instantiated with varying prompt lengths, from a short 1 K token snippet up to a massive 256 K token context.
- Dual‑metric evaluation –
- Personalization: measured by task‑specific accuracy or relevance scores (e.g., BLEU for style transfer, hit‑rate for recommendation).
- Privacy: measured by the model’s ability to unintentionally reveal protected information, using metrics such as extraction recall and membership inference success rate.
- Model testing – The benchmark is run on multiple closed‑source and open‑source LLMs, all using their default inference settings (no fine‑tuning or retrieval augmentation).
- Theoretical work – The paper derives an “attention dilution factor” that grows with context size, showing that the soft‑max attention distribution becomes increasingly uniform, which mathematically limits the model’s capacity to focus on the most relevant tokens.
Results & Findings
| Context Length | Personalization Score (↓) | Privacy Leakage (↑) |
|---|---|---|
| 1 K tokens | Baseline (high) | Near‑zero leakage |
| 16 K tokens | ~10 % drop | 2–3× higher leakage |
| 64 K tokens | ~25 % drop | 5–7× higher leakage |
| 256 K tokens | >40 % drop | >10× higher leakage |
- Consistent trend across all tested LLMs: longer contexts lead to weaker personalization and stronger privacy risks.
- Attention dilution explains the trend: as the number of tokens grows, each token receives a smaller share of the attention budget, making it harder for the model to “lock onto” user‑specific cues while simultaneously increasing the chance that irrelevant (potentially sensitive) tokens are attended to.
- No simple fix: naïvely increasing model size or context window does not eliminate the gap; the core limitation stems from the soft‑attention mechanism itself.
Practical Implications
- Product design – Developers building chatbots, code assistants, or recommendation engines should limit the effective context window used for personalization, perhaps by summarizing or chunking older conversation turns instead of feeding the raw transcript.
- Privacy engineering – Long‑form prompts should be scrubbed or redacted before being sent to LLM APIs, especially when the model will also be asked to generate personalized output.
- Retrieval‑augmented generation (RAG) – The findings motivate a shift toward retrieval‑first pipelines where only the most relevant snippets are retrieved and fed to the model, keeping the context size manageable while preserving personalization quality.
- Model selection – When privacy compliance (e.g., GDPR, HIPAA) is a hard requirement, choosing models that internally enforce context truncation or that support privacy‑preserving attention mechanisms becomes a competitive differentiator.
- Monitoring & testing – PAPerBench can be integrated into CI pipelines to continuously monitor how new model releases or prompt‑engineering changes affect both personalization and privacy leakage.
Limitations & Future Work
- Benchmark scope – While PAPerBench covers a broad range of tasks, it still focuses on English‑centric scenarios; multilingual or multimodal contexts may exhibit different scaling behavior.
- Fixed inference settings – The study does not explore fine‑tuning, instruction‑tuning, or specialized attention variants (e.g., sparse or linear‑complexity attention) that could mitigate dilution.
- Theoretical model – The attention dilution analysis assumes a standard soft‑max attention; extending the theory to newer architectures (e.g., FlashAttention, Routing Transformers) remains open.
- User‑level privacy – The privacy metrics are based on synthetic or semi‑synthetic data; real‑world deployment studies would be needed to confirm the magnitude of leakage in production systems.
The authors invite the community to build on PAPerBench, experiment with attention‑efficient designs, and develop tooling that keeps LLMs both personal and private even as context windows keep growing.
Authors
- Shangding Gu
Paper Information
- arXiv ID: 2602.15028v1
- Categories: cs.LG, cs.AI
- Published: February 16, 2026
- PDF: Download PDF