[Paper] Exploring the Potential and Limitations of Large Language Models for Novice Program Fault Localization

Published: (December 2, 2025 at 10:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.03421v1

Overview

This paper investigates how well large language models (LLMs) can help novice programmers locate bugs in their code. By comparing a suite of commercial and open‑source LLMs against classic fault‑localization techniques, the authors show that modern LLMs can provide more context‑aware hints—especially when the models are equipped with reasoning abilities—while also exposing new challenges such as over‑explanation and high compute costs.

Key Contributions

  • Comprehensive benchmark of 13 LLMs (6 closed‑source, 7 open‑source) on three fault‑localization datasets, including a newly curated “BugT” set that eliminates data‑leakage concerns.
  • Empirical evidence that reasoning‑enabled models (e.g., OpenAI o3, DeepSeek‑R1) outperform both traditional SBFL/MBFL tools and non‑reasoning LLMs (e.g., GPT‑4) with minimal prompt engineering.
  • Human‑centered evaluation showing that novice developers (≈1 year of experience) rate LLM‑generated explanations highly, confirming the educational value of the output.
  • Identification of failure modes such as “over‑reasoning” (excessive, noisy explanations) and performance degradation on harder bugs.
  • Cost analysis that quantifies the computational expense of running LLMs in a real‑time debugging workflow.

Methodology

  1. Datasets – The authors used three publicly available bug collections:
    • Codeflaws (C programs with synthetic faults)
    • Condefects (Java programs)
    • BugT – a new dataset built from real‑world student submissions, carefully filtered to avoid any overlap with LLM training data.
  2. LLM selection – Six proprietary models (e.g., OpenAI o3, GPT‑4) and seven open‑source alternatives (e.g., Llama‑2, DeepSeek‑R1) were evaluated.
  3. Prompt design – Two prompting strategies were compared: a baseline prompt (simple “find the bug”) and a reasoning prompt that asks the model to explain its thought process step‑by‑step.
  4. Metrics – Fault‑localization accuracy (top‑1, top‑3, top‑5), explanation usefulness (via a Likert‑scale survey with novice programmers), and inference latency / cost.
  5. Statistical analysis – Paired t‑tests and effect‑size calculations were used to assess significance across models and datasets.

Results & Findings

Model (Reasoning)Top‑1 Accuracy (BugT)Top‑3 Accuracy (BugT)Avg. Latency (s)Cost per 1 000 calls
OpenAI o3 (R)78 %92 %1.8$0.45
DeepSeek‑R1 (R)74 %89 %2.1$0.38
GPT‑4 (NR)62 %81 %1.2$0.60
Llama‑2‑13B (NR)55 %73 %3.4$0.12
  • Reasoning prompts dramatically boost performance for models that support chain‑of‑thought reasoning; non‑reasoning models need carefully crafted prompts to close the gap.
  • Accuracy drops as bug difficulty rises, but the best reasoning models retain >70 % top‑1 success even on the hardest BugT cases.
  • Over‑reasoning appears in ~15 % of outputs from GPT‑4, where explanations become verbose and obscure the actual fault line.
  • User study: 48 novice programmers rated LLM explanations an average of 4.3/5 for clarity and helpfulness, compared to 2.9/5 for SBFL tool output.
  • Compute cost: Deploying a high‑performing LLM in an interactive IDE adds ~2 seconds of latency per query and incurs non‑trivial cloud expenses, limiting feasibility for large teams.

Practical Implications

  • IDE plug‑ins: Embedding a reasoning‑enabled LLM (e.g., OpenAI o3) can give beginners instant, context‑rich hints, reducing the time spent on trial‑and‑error debugging.
  • Educational platforms: Automated tutoring systems can leverage the explanatory power of LLMs to teach debugging strategies, not just point out the faulty line.
  • Hybrid pipelines: Combining cheap, fast static analysis (SBFL) for coarse localization with an LLM for fine‑grained, natural‑language explanations can balance cost and accuracy.
  • Team onboarding: New hires can use LLM‑driven assistants to accelerate familiarization with legacy codebases, especially when documentation is sparse.
  • Cost‑aware deployment: For real‑time use, caching frequent queries, batching requests, or running smaller open‑source models locally (with fine‑tuning) can mitigate latency and expense.

Limitations & Future Work

  • Dataset bias: Although BugT mitigates leakage, the benchmark still leans heavily toward academic exercises; performance on large, industrial codebases remains untested.
  • Model transparency: The “black‑box” nature of LLM reasoning makes it hard to verify whether the suggested fix is truly sound, raising trust concerns.
  • Scalability: Current inference costs limit large‑scale adoption; future work should explore model distillation, quantization, or on‑device inference.
  • Prompt robustness: The study shows sensitivity to prompt phrasing for non‑reasoning models; developing standardized prompting templates could improve consistency.
  • User interaction design: Further research is needed on how to present explanations (e.g., inline comments vs. separate panels) to maximize novice comprehension without overwhelming them.

Authors

  • Hexiang Xu
  • Hengyuan Liu
  • Yonghao Wu
  • Xiaolan Kang
  • Xiang Chen
  • Yong Liu

Paper Information

  • arXiv ID: 2512.03421v1
  • Categories: cs.SE
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »