[Paper] Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Published: 2 months ago (February 19, 2026 at 04:05 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.17183v1

Overview

The paper investigates how well today’s large language models (LLMs) can reason about code when they have to process long contexts—think thousands of lines of source files, mixed‑language projects, or legacy codebases. By systematically tweaking the input (shuffling answer choices, removing the multiple‑choice format, and sprinkling irrelevant “distractor” code), the authors expose hidden brittleness that standard benchmarks often miss. Their extended benchmark, which now includes Python, Java, and even COBOL, offers a more realistic yardstick for developers who rely on LLMs for code review, debugging, or automated assistance.

Key Contributions

Long‑context robustness study: First systematic evaluation of LLMs on code Q&A when the context stretches to several thousand tokens.
Controlled ablations: Three stress‑test scenarios—shuffled multiple‑choice options, open‑ended answers, and “needle‑in‑a‑haystack” contexts with adversarial distractors.
Dataset expansion: Added 1,200+ Java and 800+ COBOL question‑answer pairs to the existing LongCodeBench Python suite, covering both modern and legacy languages.
Empirical findings: Demonstrated sizable performance drops (up to ~30 % absolute) across state‑of‑the‑art models (GPT‑4, Claude, LLaMA‑2) under the ablations.
Benchmark release: Publicly released the extended dataset and evaluation scripts to encourage more rigorous long‑context testing in the community.

Methodology

Dataset preparation
- Started from the LongCodeBench Python benchmark (≈2 k Q&A pairs).
- Curated additional Java and COBOL tasks, each paired with a short natural‑language question and a set of four multiple‑choice answers.
- For each task, the context consists of the full source file (often >4 k tokens) plus any required imports or build scripts.
Ablation designs
- Shuffled MC: Randomly reorder the four answer options to break any positional bias the model might have learned.
- Open‑ended: Remove the multiple‑choice list; the model must generate the answer string itself.
- Needle‑in‑a‑haystack: Insert unrelated code snippets (e.g., dead functions, unrelated language modules) that increase context length and act as distractors.
Model suite
- Evaluated GPT‑4 (Chat), Claude‑2, LLaMA‑2‑70B, and an open‑source CodeLlama model, all prompted with a consistent “answer the question given the code below” template.
Metrics
- Accuracy for multiple‑choice, exact‑match for open‑ended, and a robustness score that penalizes correct answers when distractors are present.

Results & Findings

Setting	GPT‑4	Claude‑2	LLaMA‑2‑70B	CodeLlama
Original MC (ordered)	78 %	73 %	61 %	55 %
Shuffled MC	62 %	58 %	44 %	38 %
Open‑ended	55 %	51 %	38 %	32 %
Needle‑in‑a‑haystack	49 %	45 %	33 %	27 %

Shuffling alone caused a 15‑20 % absolute drop, indicating that models rely heavily on answer‑position cues.
Open‑ended generation further reduced performance, exposing difficulties in free‑form code reasoning.
Distractor code led to the steepest decline; even the strongest model (GPT‑4) missed the correct answer in half the cases.
The degradation was consistent across languages, showing that the problem isn’t limited to Python but also affects legacy (COBOL) and mainstream (Java) codebases.

Practical Implications

Tooling reliability: IDE plugins or CI‑integrated LLM assistants that present multiple‑choice suggestions may appear more accurate than they truly are when the answer list is reordered or hidden.
Security & auditing: In code‑review automation, irrelevant code snippets (e.g., generated files, third‑party libraries) can mislead the model, potentially causing missed bugs or false positives.
Prompt engineering: Developers should avoid relying on positional hints; explicit labeling (e.g., “Option A: …”) and verification steps become essential.
Legacy system support: The inclusion of COBOL demonstrates that LLMs are not yet ready for large‑scale modernization projects without additional fine‑tuning or retrieval‑augmented pipelines.
Benchmarking standards: Companies evaluating LLMs for code assistance should adopt the extended LongCodeBench suite (or similar “stress‑test” setups) rather than only short‑context, clean‑code benchmarks.

Limitations & Future Work

Scale of distractors: The study used a fixed number of irrelevant snippets; real‑world codebases may contain orders of magnitude more noise.
Prompt uniformity: All models received the same prompt template; exploring model‑specific prompting could mitigate some brittleness.
Fine‑tuning not explored: The authors evaluated off‑the‑shelf models; future work could assess whether domain‑specific fine‑tuning or retrieval‑augmented generation improves robustness.
User‑study missing: The paper focuses on automated metrics; a follow‑up user study would clarify how these failures translate into developer productivity loss.

Authors

Kishan Maharaj
Nandakishore Menon
Ashita Saxena
Srikanth Tamilselvam

Paper Information

arXiv ID: 2602.17183v1
Categories: cs.SE, cs.AI
Published: February 19, 2026
PDF: Download PDF

[Paper] Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Unifying approach to uniform expressivity of graph neural networks

[Paper] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges