[Paper] Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering
Source: arXiv - 2602.17183v1
Overview
The paper investigates how well today’s large language models (LLMs) can reason about code when they have to process long contexts—think thousands of lines of source files, mixed‑language projects, or legacy codebases. By systematically tweaking the input (shuffling answer choices, removing the multiple‑choice format, and sprinkling irrelevant “distractor” code), the authors expose hidden brittleness that standard benchmarks often miss. Their extended benchmark, which now includes Python, Java, and even COBOL, offers a more realistic yardstick for developers who rely on LLMs for code review, debugging, or automated assistance.
Key Contributions
- Long‑context robustness study: First systematic evaluation of LLMs on code Q&A when the context stretches to several thousand tokens.
- Controlled ablations: Three stress‑test scenarios—shuffled multiple‑choice options, open‑ended answers, and “needle‑in‑a‑haystack” contexts with adversarial distractors.
- Dataset expansion: Added 1,200+ Java and 800+ COBOL question‑answer pairs to the existing LongCodeBench Python suite, covering both modern and legacy languages.
- Empirical findings: Demonstrated sizable performance drops (up to ~30 % absolute) across state‑of‑the‑art models (GPT‑4, Claude, LLaMA‑2) under the ablations.
- Benchmark release: Publicly released the extended dataset and evaluation scripts to encourage more rigorous long‑context testing in the community.
Methodology
-
Dataset preparation
- Started from the LongCodeBench Python benchmark (≈2 k Q&A pairs).
- Curated additional Java and COBOL tasks, each paired with a short natural‑language question and a set of four multiple‑choice answers.
- For each task, the context consists of the full source file (often >4 k tokens) plus any required imports or build scripts.
-
Ablation designs
- Shuffled MC: Randomly reorder the four answer options to break any positional bias the model might have learned.
- Open‑ended: Remove the multiple‑choice list; the model must generate the answer string itself.
- Needle‑in‑a‑haystack: Insert unrelated code snippets (e.g., dead functions, unrelated language modules) that increase context length and act as distractors.
-
Model suite
- Evaluated GPT‑4 (Chat), Claude‑2, LLaMA‑2‑70B, and an open‑source CodeLlama model, all prompted with a consistent “answer the question given the code below” template.
-
Metrics
- Accuracy for multiple‑choice, exact‑match for open‑ended, and a robustness score that penalizes correct answers when distractors are present.
Results & Findings
| Setting | GPT‑4 | Claude‑2 | LLaMA‑2‑70B | CodeLlama |
|---|---|---|---|---|
| Original MC (ordered) | 78 % | 73 % | 61 % | 55 % |
| Shuffled MC | 62 % | 58 % | 44 % | 38 % |
| Open‑ended | 55 % | 51 % | 38 % | 32 % |
| Needle‑in‑a‑haystack | 49 % | 45 % | 33 % | 27 % |
- Shuffling alone caused a 15‑20 % absolute drop, indicating that models rely heavily on answer‑position cues.
- Open‑ended generation further reduced performance, exposing difficulties in free‑form code reasoning.
- Distractor code led to the steepest decline; even the strongest model (GPT‑4) missed the correct answer in half the cases.
- The degradation was consistent across languages, showing that the problem isn’t limited to Python but also affects legacy (COBOL) and mainstream (Java) codebases.
Practical Implications
- Tooling reliability: IDE plugins or CI‑integrated LLM assistants that present multiple‑choice suggestions may appear more accurate than they truly are when the answer list is reordered or hidden.
- Security & auditing: In code‑review automation, irrelevant code snippets (e.g., generated files, third‑party libraries) can mislead the model, potentially causing missed bugs or false positives.
- Prompt engineering: Developers should avoid relying on positional hints; explicit labeling (e.g., “Option A: …”) and verification steps become essential.
- Legacy system support: The inclusion of COBOL demonstrates that LLMs are not yet ready for large‑scale modernization projects without additional fine‑tuning or retrieval‑augmented pipelines.
- Benchmarking standards: Companies evaluating LLMs for code assistance should adopt the extended LongCodeBench suite (or similar “stress‑test” setups) rather than only short‑context, clean‑code benchmarks.
Limitations & Future Work
- Scale of distractors: The study used a fixed number of irrelevant snippets; real‑world codebases may contain orders of magnitude more noise.
- Prompt uniformity: All models received the same prompt template; exploring model‑specific prompting could mitigate some brittleness.
- Fine‑tuning not explored: The authors evaluated off‑the‑shelf models; future work could assess whether domain‑specific fine‑tuning or retrieval‑augmented generation improves robustness.
- User‑study missing: The paper focuses on automated metrics; a follow‑up user study would clarify how these failures translate into developer productivity loss.
Authors
- Kishan Maharaj
- Nandakishore Menon
- Ashita Saxena
- Srikanth Tamilselvam
Paper Information
- arXiv ID: 2602.17183v1
- Categories: cs.SE, cs.AI
- Published: February 19, 2026
- PDF: Download PDF