[Paper] Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect
Source: arXiv - 2511.18854v1
Overview
The paper introduces “Time Travel,” a framework that plugs Large Language Models (LLMs) into the classic git bisect workflow to make fault localization more robust when dealing with flaky tests, non‑monotonic regressions, and upstream code changes. By giving the bisect process a “chain‑of‑thought” reasoning layer, the authors show that developers can pinpoint the offending commit faster and with higher success rates—even in noisy, real‑world repositories.
Key Contributions
- LLM‑augmented bisect: Extends the deterministic
git bisectalgorithm with semantic reasoning from LLMs, allowing it to handle ambiguous or flaky test outcomes. - Commit‑level chain‑of‑thought prompting: Designs prompts that let the model explain why a particular commit might cause a failure, improving interpretability.
- Weak‑supervision pipeline: Uses a mix of automatically generated labels, human‑in‑the‑loop corrections, and self‑consistency filtering to create a training set of semantically labeled diffs with minimal manual effort.
- Fine‑tuning recipe: Demonstrates effective fine‑tuning of DeepSeekCoderV2 (via QLoRA) on the curated diff dataset, outperforming off‑the‑shelf LLMs.
- Empirical gains: Achieves a 6.4 % absolute improvement in bisect success rate (74.2 % → 80.6 %) and up to a 2× reduction in average bisect time across several open‑source projects.
- Practical guidelines: Provides insights on prompt engineering, temporal reasoning, and model selection for commit‑level behavior analysis.
Methodology
- Problem framing: Traditional
git bisectassumes a binary, deterministic predicate (pass/fail). The authors model the predicate as a noisy signal and let an LLM interpret the semantic impact of each commit. - Data collection:
- Extracted diffs from multiple repositories along with the test outcomes of each commit.
- Applied weak supervision: heuristic rules label obvious “bug‑introducing” commits; ambiguous cases are sent to developers for quick validation.
- Human reviewers correct mislabeled examples; the corrected set is fed back into the pipeline.
- Model preparation:
- Started with DeepSeekCoderV2, a code‑oriented LLM.
- Fine‑tuned using QLoRA (low‑rank adaptation) on the curated diff‑label pairs, keeping GPU requirements modest.
- Prompt design: Each bisect step sends the LLM a prompt containing:
- The current commit’s diff.
- The test command and its recent outcome (pass/fail/flaky).
- A request for a chain‑of‑thought explanation (e.g., “Explain why this change could cause the failure”).
- Bisect loop integration:
- The LLM’s confidence score (derived from its explanation) is combined with the raw test result to decide the next commit to test.
- If the LLM is uncertain, the algorithm falls back to the classic binary bisect decision, ensuring safety.
- Evaluation: Ran the augmented bisect on 8 open‑source projects (both Java and Python) and on a proprietary internal codebase, comparing success rate, number of traversed commits, and total wall‑clock time against vanilla
git bisect.
Results & Findings
| Metric | Vanilla git bisect | Time‑Travel (LLM‑augmented) |
|---|---|---|
| Success rate (finding the buggy commit) | 74.2 % | 80.6 % |
| Avg. commits examined per run | 12.4 | 7.1 |
| Avg. bisect wall‑time | 4.8 min | 2.3 min |
| Max speed‑up observed | – | 2× |
- Noise tolerance: The LLM correctly identified flaky test patterns and avoided dead‑ends that would stall a pure binary bisect.
- Semantic insight: When the failing test was unrelated to the changed lines (e.g., indirect API usage), the LLM’s explanation helped skip irrelevant commits.
- Model comparison: Fine‑tuned DeepSeekCoderV2 outperformed GPT‑4‑Turbo and Claude‑2 on this task, confirming the value of domain‑specific fine‑tuning.
Practical Implications
- Faster debugging cycles: Teams can shave minutes—or even hours—from regression investigations, especially in large monorepos where bisect traversals can be costly.
- Reduced reliance on flaky‑test mitigation: By reasoning about test noise, developers spend less time stabilizing flaky suites before bisecting.
- Better CI integration: The framework can be wrapped as a CI‑friendly tool (
git bisect-llm) that automatically runs when a regression is detected, returning a ranked list of suspect commits with explanations. - Explainability for code reviews: The chain‑of‑thought output doubles as a lightweight code‑review comment, helping reviewers understand why a change might be risky.
- Low‑resource fine‑tuning: Using QLoRA means teams can adapt the model to their own codebase without massive GPU clusters, making the approach accessible to midsize companies.
Limitations & Future Work
- Model hallucination risk: Occasionally the LLM produced plausible‑sounding but incorrect rationales, which could misguide the bisect direction; the fallback to binary decisions mitigates but does not eliminate this risk.
- Language coverage: Experiments focused on Java and Python; extending to compiled languages (C/C++) may require additional diff preprocessing.
- Scalability of annotation pipeline: While weak supervision reduces manual effort, building a high‑quality labeled diff set still demands some developer time, especially for niche domains.
- Temporal reasoning depth: Current prompts handle immediate commit effects; deeper historical context (e.g., multi‑commit feature interactions) remains an open challenge.
Future research directions include: integrating static analysis signals to further ground LLM reasoning, exploring multi‑modal models that ingest test logs and stack traces, and automating prompt optimization via reinforcement learning.
Authors
- Yujing Wang
- Weize Hong
Paper Information
- arXiv ID: 2511.18854v1
- Categories: cs.SE, cs.AI
- Published: November 24, 2025
- PDF: Download PDF