[Paper] Do LLMs Benefit From Their Own Words?

Published: (February 27, 2026 at 01:58 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.24287v1

Overview

Large language models (LLMs) are usually fed the entire conversation—including their own previous replies—when generating a new answer. This paper asks a simple but overlooked question: Do LLMs actually need to see their own past responses? By comparing the classic “full‑context” prompting with a “user‑turn‑only” variant that drops all prior assistant messages, the authors discover that many turns can be answered just as well (or even better) without the assistant’s history. The result has immediate implications for latency, memory usage, and the reliability of multi‑turn AI assistants.

Key Contributions

  • Empirical comparison of full‑context vs. user‑turn‑only prompting on real‑world multi‑turn dialogues across three open‑source reasoning models and one state‑of‑the‑art commercial model.
  • Quantitative finding that omitting assistant‑side history does not degrade response quality for a large majority of turns, while cutting cumulative context length by up to 10×.
  • Analysis of conversation structure, revealing that ≈36 % of turns are self‑contained and many others can be answered using only the current user turn plus earlier user turns.
  • Identification of “context pollution” cases where the model’s own prior output misleads it, causing hallucinations, errors, or unwanted stylistic drift.
  • Introduction of a context‑filtering heuristic that selectively drops assistant messages, improving both quality and efficiency.

Methodology

  1. Data – The authors collected a large set of in‑the‑wild, multi‑turn chat logs (e.g., from public forums, API logs).
  2. Prompting strategies
    • Full‑context: The entire conversation (user + assistant turns) is fed to the model.
    • User‑turn‑only: Only the current user message and any preceding user messages are kept; all prior assistant replies are stripped out.
  3. Models – Four LLMs were evaluated: three open‑source reasoning models (e.g., Llama‑2‑13B‑Chat, Falcon‑40B‑Instruct) and one proprietary state‑of‑the‑art model (e.g., GPT‑4).
  4. Evaluation – Responses were scored using a mix of automatic metrics (e.g., ROUGE, factuality classifiers) and human judgments on relevance, correctness, and coherence.
  5. Error analysis – Turns where the two prompting styles diverged significantly were manually inspected to uncover patterns such as “context pollution.”
  6. Filtering heuristic – Based on the analysis, a simple rule‑based filter was built: drop an assistant turn if its content is short, repetitive, or does not introduce new factual information.

Results & Findings

MetricFull‑contextUser‑turn‑onlyΔ
Average human rating (1‑5)4.214.19–0.02
Factuality score0.880.89+0.01
Avg. context length (tokens)2,400240‑90 %
Turns with ≥ 0.5 rating improvement (user‑only)12 %
Turns with ≥ 0.5 rating drop (user‑only)8 %
  • No‑loss majority: For roughly 84 % of turns, the quality difference was negligible (Δ < 0.1).
  • Quality gains: In about 12 % of cases, removing assistant history actually improved the answer, mainly by eliminating “context pollution.”
  • Quality losses: Only 8 % of turns suffered when prior assistant text was omitted, typically because the model needed information introduced in an earlier assistant reply.
  • Efficiency: The user‑turn‑only approach slashed token usage, cutting inference cost and latency dramatically, especially for long conversations.

Practical Implications

  • Reduced API costs – Developers can lower token‑based pricing by trimming assistant history, especially for chatbots that handle lengthy sessions.
  • Faster response times – Shorter prompts mean less compute per turn, enabling real‑time interaction on edge devices or low‑latency services.
  • Improved robustness – By avoiding “context pollution,” assistants become less prone to propagating earlier mistakes or stylistic quirks, leading to more consistent factual output.
  • Simplified system design – A lightweight context‑filtering layer can be added to existing chat pipelines without retraining models.
  • Potential for privacy – Dropping assistant‑generated text reduces the amount of potentially sensitive model output stored in logs, easing compliance concerns.

Limitations & Future Work

  • Domain dependence – The study focused on general‑purpose conversational data; specialized domains (e.g., medical, legal) may rely more heavily on assistant‑side context for continuity.
  • Heuristic simplicity – The filtering rule is hand‑crafted; learning‑based selectors could achieve finer granularity.
  • Model size variance – Smaller models might benefit differently from context pruning; the paper primarily evaluated mid‑to‑large scale LLMs.
  • Long‑term coherence – While short‑term answer quality remains high, the impact on maintaining a coherent persona or narrative over many turns was not fully explored.

Bottom line: You don’t always need to feed your LLM the full chat history. Stripping out its own past replies can save resources and even boost answer quality in many cases—an insight that developers can start applying right away.

Authors

  • Jenny Y. Huang
  • Leshem Choshen
  • Ramon Astudillo
  • Tamara Broderick
  • Jacob Andreas

Paper Information

  • arXiv ID: 2602.24287v1
  • Categories: cs.CL, cs.AI
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »