[Paper] Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect

Published: 1 week ago (November 24, 2025 at 02:49 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.18854v1

Overview

The paper introduces “Time Travel,” a framework that plugs Large Language Models (LLMs) into the classic git bisect workflow to make fault localization more robust when dealing with flaky tests, non‑monotonic regressions, and upstream code changes. By giving the bisect process a “chain‑of‑thought” reasoning layer, the authors show that developers can pinpoint the offending commit faster and with higher success rates—even in noisy, real‑world repositories.

Key Contributions

LLM‑augmented bisect: Extends the deterministic git bisect algorithm with semantic reasoning from LLMs, allowing it to handle ambiguous or flaky test outcomes.
Commit‑level chain‑of‑thought prompting: Designs prompts that let the model explain why a particular commit might cause a failure, improving interpretability.
Weak‑supervision pipeline: Uses a mix of automatically generated labels, human‑in‑the‑loop corrections, and self‑consistency filtering to create a training set of semantically labeled diffs with minimal manual effort.
Fine‑tuning recipe: Demonstrates effective fine‑tuning of DeepSeekCoderV2 (via QLoRA) on the curated diff dataset, outperforming off‑the‑shelf LLMs.
Empirical gains: Achieves a 6.4 % absolute improvement in bisect success rate (74.2 % → 80.6 %) and up to a 2× reduction in average bisect time across several open‑source projects.
Practical guidelines: Provides insights on prompt engineering, temporal reasoning, and model selection for commit‑level behavior analysis.

Methodology

Problem framing: Traditional git bisect assumes a binary, deterministic predicate (pass/fail). The authors model the predicate as a noisy signal and let an LLM interpret the semantic impact of each commit.
Data collection:
- Extracted diffs from multiple repositories along with the test outcomes of each commit.
- Applied weak supervision: heuristic rules label obvious “bug‑introducing” commits; ambiguous cases are sent to developers for quick validation.
- Human reviewers correct mislabeled examples; the corrected set is fed back into the pipeline.
Model preparation:
- Started with DeepSeekCoderV2, a code‑oriented LLM.
- Fine‑tuned using QLoRA (low‑rank adaptation) on the curated diff‑label pairs, keeping GPU requirements modest.
Prompt design: Each bisect step sends the LLM a prompt containing:
- The current commit’s diff.
- The test command and its recent outcome (pass/fail/flaky).
- A request for a chain‑of‑thought explanation (e.g., “Explain why this change could cause the failure”).
Bisect loop integration:
- The LLM’s confidence score (derived from its explanation) is combined with the raw test result to decide the next commit to test.
- If the LLM is uncertain, the algorithm falls back to the classic binary bisect decision, ensuring safety.
Evaluation: Ran the augmented bisect on 8 open‑source projects (both Java and Python) and on a proprietary internal codebase, comparing success rate, number of traversed commits, and total wall‑clock time against vanilla git bisect.

Results & Findings

Metric	Vanilla `git bisect`	Time‑Travel (LLM‑augmented)
Success rate (finding the buggy commit)	74.2 %	80.6 %
Avg. commits examined per run	12.4	7.1
Avg. bisect wall‑time	4.8 min	2.3 min
Max speed‑up observed	–	2×

Noise tolerance: The LLM correctly identified flaky test patterns and avoided dead‑ends that would stall a pure binary bisect.
Semantic insight: When the failing test was unrelated to the changed lines (e.g., indirect API usage), the LLM’s explanation helped skip irrelevant commits.
Model comparison: Fine‑tuned DeepSeekCoderV2 outperformed GPT‑4‑Turbo and Claude‑2 on this task, confirming the value of domain‑specific fine‑tuning.

Practical Implications

Faster debugging cycles: Teams can shave minutes—or even hours—from regression investigations, especially in large monorepos where bisect traversals can be costly.
Reduced reliance on flaky‑test mitigation: By reasoning about test noise, developers spend less time stabilizing flaky suites before bisecting.
Better CI integration: The framework can be wrapped as a CI‑friendly tool (git bisect-llm) that automatically runs when a regression is detected, returning a ranked list of suspect commits with explanations.
Explainability for code reviews: The chain‑of‑thought output doubles as a lightweight code‑review comment, helping reviewers understand why a change might be risky.
Low‑resource fine‑tuning: Using QLoRA means teams can adapt the model to their own codebase without massive GPU clusters, making the approach accessible to midsize companies.

Limitations & Future Work

Model hallucination risk: Occasionally the LLM produced plausible‑sounding but incorrect rationales, which could misguide the bisect direction; the fallback to binary decisions mitigates but does not eliminate this risk.
Language coverage: Experiments focused on Java and Python; extending to compiled languages (C/C++) may require additional diff preprocessing.
Scalability of annotation pipeline: While weak supervision reduces manual effort, building a high‑quality labeled diff set still demands some developer time, especially for niche domains.
Temporal reasoning depth: Current prompts handle immediate commit effects; deeper historical context (e.g., multi‑commit feature interactions) remains an open challenge.

Future research directions include: integrating static analysis signals to further ground LLM reasoning, exploring multi‑modal models that ingest test logs and stack traces, and automating prompt optimization via reinforcement learning.

Authors

Yujing Wang
Weize Hong

Paper Information

arXiv ID: 2511.18854v1
Categories: cs.SE, cs.AI
Published: November 24, 2025
PDF: Download PDF

[Paper] Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Universal Weight Subspace Hypothesis

[Paper] Value Gradient Guidance for Flow Matching Alignment

[Paper] Deep infant brain segmentation from multi-contrast MRI

[Paper] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation