[Paper] The Limits of Long-Context Reasoning in Automated Bug Fixing

Published: 2 days ago (February 17, 2026 at 05:51 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16069v1

Overview

The paper investigates a hot claim in the AI‑for‑coding community: that today’s large language models (LLMs) can “see” and reason over entire codebases thanks to their massive context windows. By rigorously testing both agentic workflows and single‑shot generation on artificially inflated contexts, the authors reveal a stark gap between nominal token limits (64 k–128 k) and the models’ actual ability to produce correct bug‑fix patches.

Key Contributions

Systematic benchmark using SWE‑bench Verified to compare agentic versus single‑shot long‑context debugging.
Empirical evidence that successful agentic runs stay under ~20 k tokens, despite models advertising much larger windows.
Controlled long‑context experiment that inflates input size while guaranteeing perfect file retrieval, isolating pure reasoning capacity.
Quantitative results showing a dramatic drop in resolve rates (e.g., GPT‑5‑nano 0 % at 64 k tokens, Qwen3‑Coder‑30B‑A3B 7 %).
Qualitative failure taxonomy (hallucinated diffs, wrong file targets, malformed patch headers) that clarifies why longer contexts break down.
Critical insight that current agentic coding benchmarks do not truly measure long‑context reasoning.

Methodology

Agentic Harness (mini‑SWE‑agent) – The authors wrap state‑of‑the‑art LLMs (GPT‑5‑nano, Deepseek‑R1‑0528, etc.) in a loop that can retrieve files, run tests, and iteratively refine patches.
Token‑level tracking – For each successful run, they record the cumulative token count to see how much context the model actually consumes.
Long‑Context Pipeline – They construct a “stretched” version of each bug‑fix task by concatenating all repository files (up to 128 k tokens) while ensuring the relevant file is present, thus removing retrieval errors.
Single‑Shot Generation – The model receives the massive context and is asked to output a patch in one go, without any iterative feedback.
Evaluation – Resolve rate (percentage of tasks where the generated patch passes the test suite) is the primary metric; additional manual inspection surfaces systematic error patterns.

Results & Findings

Model	Agentic Resolve Rate (≤20 k tokens)	Single‑Shot Resolve Rate (64 k)	Single‑Shot Resolve Rate (128 k)
GPT‑5‑nano	31 % (best)	0 %	0 %
Deepseek‑R1‑0528	≈28 %	~5 %	~3 %
Qwen3‑Coder‑30B‑A3B	–	7 %	<5 %

Agentic success is tied to short‑context steps: even the best runs never exceed ~20 k tokens, suggesting the “long‑context” advantage is not being leveraged.
Performance collapses as context grows: once the input surpasses ~64 k tokens, the models’ ability to emit a correct diff plummets.
Failure modes: hallucinated or syntactically invalid diffs, patches applied to the wrong file, and missing/garbled patch headers become common as context length increases.

Practical Implications

Tooling designers should not assume “big context = better debugging.” Agentic pipelines need to keep each reasoning step concise and prioritize effective retrieval over dumping the whole repo.
LLM providers may need to rethink context window marketing. Real‑world coding assistants should expose a “usable context” metric rather than raw token limits.
Developers can still benefit from LLMs by structuring prompts to focus on the relevant files and using iterative test‑feedback loops, rather than attempting monolithic “all‑code‑in‑one‑prompt” approaches.
Open‑source model users can achieve competitive results (e.g., Deepseek‑R1) without waiting for proprietary giants, provided they adopt agentic decomposition strategies.
Benchmark designers should incorporate long‑context stress tests that isolate retrieval from reasoning, ensuring future evaluations truly measure the intended capability.

Limitations & Future Work

The study uses a single benchmark (SWE‑bench Verified); results may differ on other domains (e.g., systems code, UI frameworks).
Artificially inflating context guarantees perfect file recall, but real‑world retrieval errors and noisy repositories are not captured.
Only a handful of LLMs were evaluated; newer models with more sophisticated memory mechanisms could behave differently.
Future research could explore hybrid approaches (e.g., external vector stores + LLMs) or fine‑tuning strategies aimed specifically at long‑context code reasoning.

Authors

Ravi Raju
Mengmeng Ji
Shubhangi Upasani
Bo Li
Urmish Thakker

Paper Information

arXiv ID: 2602.16069v1
Categories: cs.SE, cs.LG
Published: February 17, 2026
PDF: Download PDF

[Paper] The Limits of Long-Context Reasoning in Automated Bug Fixing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Sink-Aware Pruning for Diffusion Language Models

[Paper] CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

[Paper] MARS: Margin-Aware Reward-Modeling with Self-Refinement

[Paper] Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval