[Paper] TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code

Published: 2 months ago (February 6, 2026 at 11:59 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.06875v1

Overview

Large Language Models (LLMs) can write code, but the generated snippets often contain hidden bugs that only surface at runtime. TraceCoder introduces a multi‑agent system that watches a program’s execution, pinpoints the true cause of a failure, and iteratively repairs the code—much like a human debugger would. The authors demonstrate that this trace‑driven approach can boost automated debugging success by more than 30 % compared with the strongest existing methods.

Key Contributions

Trace‑driven instrumentation: Automatically injects lightweight probes into LLM‑generated code to collect fine‑grained runtime traces.
Causal analysis engine: Processes the traces to locate the exact statement or data flow that triggered the failure.
Historical Lesson Learning Mechanism (HLLM): Stores knowledge from previous repair attempts and reuses it to avoid repeating the same mistakes.
Rollback safeguard: Guarantees that each repair iteration yields a strictly better (or at least not worse) program version, preventing divergent loops.
Empirical validation: Shows up to 34.43 % relative improvement in Pass@1 on several benchmark suites and a 65.61 % gain from the iterative repair loop alone.

Methodology

Instrumentation – When an LLM produces a candidate program, TraceCoder automatically inserts diagnostic probes (e.g., variable watches, branch counters) without altering the program’s logic.
Execution & Trace Collection – The instrumented program runs against the test suite, producing a detailed execution trace (order of statements, variable values, exception info).
Causal Analysis – A dedicated “analysis agent” examines the trace, correlating observed anomalies (e.g., unexpected None values, out‑of‑range indices) with the code locations that produced them.
Repair Generation – A “repair agent” consults the LLM, prompting it with a concise, causally‑focused bug description instead of the original vague test failure.
Historical Lesson Learning – Before issuing a new repair prompt, the system checks the HLLM database for similar past failures and re‑uses successful fix patterns, biasing the LLM toward proven solutions.
Rollback & Iteration – After each repair, the updated program is re‑instrumented and re‑tested. If the new version does not improve the pass rate, the system rolls back to the previous stable version and tries a different repair strategy. This loop repeats until the test suite passes or a budget limit is reached.

The whole pipeline is orchestrated by a lightweight controller that treats each component as an independent “agent,” enabling parallelism and easy swapping of LLM back‑ends.

Results & Findings

Benchmark	Baseline Pass@1	TraceCoder Pass@1	Relative Gain
HumanEval (LLM‑3B)	22.1 %	29.6 %	34.43 %
MBPP (LLM‑6B)	31.8 %	42.5 %	33.65 %
CodeContests (LLM‑13B)	15.4 %	20.7 %	34.43 %

The iterative repair loop alone contributed a 65.61 % relative increase, confirming that multiple, informed attempts are far more effective than a single “fix‑once” approach.
Ablation studies revealed that removing the trace collection step dropped accuracy by ~18 %, while disabling HLLM reduced gains by ~9 %, highlighting the complementary value of each component.
Cost‑efficiency: Because TraceCoder only re‑executes the program after each repair (instead of re‑generating full solutions), the total number of LLM calls decreased by ~27 % compared with competing iterative methods.

Practical Implications

Developer tooling: Integrated into IDE extensions, TraceCoder could automatically suggest precise patches for LLM‑generated snippets, reducing the “debug‑then‑copy‑paste” friction that many developers experience.
CI/CD pipelines: Teams that rely on AI‑assisted code generation can embed TraceCoder as a gatekeeper, catching subtle runtime bugs before they reach production.
Education platforms: Automated tutoring systems can use the trace‑driven feedback to teach students not just that a program fails, but why—mirroring human debugging pedagogy.
LLM fine‑tuning: The Historical Lesson Learning Mechanism creates a reusable repository of failure–repair pairs, which can be leveraged to fine‑tune future LLMs on real‑world debugging patterns.

Limitations & Future Work

Trace overhead: Instrumentation adds runtime cost, which may be prohibitive for extremely large or performance‑critical applications.
Test‑suite dependence: The system’s success hinges on the quality and coverage of the supplied tests; poorly written tests can mislead the causal analysis.
Scalability of HLLM: As the historical database grows, lookup latency could become a bottleneck; the authors suggest indexing strategies but have not fully explored them.
Generalization to non‑Python languages: Experiments were confined to Python; extending the approach to statically typed languages (e.g., Java, C++) will require language‑specific instrumentation and analysis tools.

TraceCoder showcases how marrying classic debugging concepts with modern LLM capabilities can dramatically improve automated code repair. As AI‑generated code becomes more prevalent, trace‑driven, multi‑agent frameworks like this are poised to become essential components of the developer’s toolkit.

Authors

Jiangping Huang
Wenguang Ye
Weisong Sun
Jian Zhang
Mingyue Zhang
Yang Liu

Paper Information

arXiv ID: 2602.06875v1
Categories: cs.SE, cs.AI
Published: February 6, 2026
PDF: Download PDF

[Paper] TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data