[Paper] The Limits of Long-Context Reasoning in Automated Bug Fixing

Published: (February 17, 2026 at 05:51 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.16069v1

Overview

The paper investigates a hot claim in the AI‑for‑coding community: that today’s large language models (LLMs) can “see” and reason over entire codebases thanks to their massive context windows. By rigorously testing both agentic workflows and single‑shot generation on artificially inflated contexts, the authors reveal a stark gap between nominal token limits (64 k–128 k) and the models’ actual ability to produce correct bug‑fix patches.

Key Contributions

  • Systematic benchmark using SWE‑bench Verified to compare agentic versus single‑shot long‑context debugging.
  • Empirical evidence that successful agentic runs stay under ~20 k tokens, despite models advertising much larger windows.
  • Controlled long‑context experiment that inflates input size while guaranteeing perfect file retrieval, isolating pure reasoning capacity.
  • Quantitative results showing a dramatic drop in resolve rates (e.g., GPT‑5‑nano 0 % at 64 k tokens, Qwen3‑Coder‑30B‑A3B 7 %).
  • Qualitative failure taxonomy (hallucinated diffs, wrong file targets, malformed patch headers) that clarifies why longer contexts break down.
  • Critical insight that current agentic coding benchmarks do not truly measure long‑context reasoning.

Methodology

  1. Agentic Harness (mini‑SWE‑agent) – The authors wrap state‑of‑the‑art LLMs (GPT‑5‑nano, Deepseek‑R1‑0528, etc.) in a loop that can retrieve files, run tests, and iteratively refine patches.
  2. Token‑level tracking – For each successful run, they record the cumulative token count to see how much context the model actually consumes.
  3. Long‑Context Pipeline – They construct a “stretched” version of each bug‑fix task by concatenating all repository files (up to 128 k tokens) while ensuring the relevant file is present, thus removing retrieval errors.
  4. Single‑Shot Generation – The model receives the massive context and is asked to output a patch in one go, without any iterative feedback.
  5. Evaluation – Resolve rate (percentage of tasks where the generated patch passes the test suite) is the primary metric; additional manual inspection surfaces systematic error patterns.

Results & Findings

ModelAgentic Resolve Rate (≤20 k tokens)Single‑Shot Resolve Rate (64 k)Single‑Shot Resolve Rate (128 k)
GPT‑5‑nano31 % (best)0 %0 %
Deepseek‑R1‑0528≈28 %~5 %~3 %
Qwen3‑Coder‑30B‑A3B7 %<5 %
  • Agentic success is tied to short‑context steps: even the best runs never exceed ~20 k tokens, suggesting the “long‑context” advantage is not being leveraged.
  • Performance collapses as context grows: once the input surpasses ~64 k tokens, the models’ ability to emit a correct diff plummets.
  • Failure modes: hallucinated or syntactically invalid diffs, patches applied to the wrong file, and missing/garbled patch headers become common as context length increases.

Practical Implications

  • Tooling designers should not assume “big context = better debugging.” Agentic pipelines need to keep each reasoning step concise and prioritize effective retrieval over dumping the whole repo.
  • LLM providers may need to rethink context window marketing. Real‑world coding assistants should expose a “usable context” metric rather than raw token limits.
  • Developers can still benefit from LLMs by structuring prompts to focus on the relevant files and using iterative test‑feedback loops, rather than attempting monolithic “all‑code‑in‑one‑prompt” approaches.
  • Open‑source model users can achieve competitive results (e.g., Deepseek‑R1) without waiting for proprietary giants, provided they adopt agentic decomposition strategies.
  • Benchmark designers should incorporate long‑context stress tests that isolate retrieval from reasoning, ensuring future evaluations truly measure the intended capability.

Limitations & Future Work

  • The study uses a single benchmark (SWE‑bench Verified); results may differ on other domains (e.g., systems code, UI frameworks).
  • Artificially inflating context guarantees perfect file recall, but real‑world retrieval errors and noisy repositories are not captured.
  • Only a handful of LLMs were evaluated; newer models with more sophisticated memory mechanisms could behave differently.
  • Future research could explore hybrid approaches (e.g., external vector stores + LLMs) or fine‑tuning strategies aimed specifically at long‑context code reasoning.

Authors

  • Ravi Raju
  • Mengmeng Ji
  • Shubhangi Upasani
  • Bo Li
  • Urmish Thakker

Paper Information

  • arXiv ID: 2602.16069v1
  • Categories: cs.SE, cs.LG
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »