[Paper] Neurosymbolic Repo-level Code Localization

Published: (April 17, 2026 at 08:49 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.16021v1

Overview

The paper “Neurosymbolic Repo-level Code Localization” shines a light on a hidden flaw in today’s code‑localization benchmarks: they are riddled with obvious keyword cues (file names, function identifiers) that let models cheat by simple string matching. To push models toward genuine structural reasoning, the authors introduce a Keyword‑Agnostic Logical Code Localization (KA‑LCL) challenge and a new benchmark, KA‑LogicQuery, that strips away all naming hints. Their proposed solution, LogicLoc, fuses large language models (LLMs) with the deterministic power of Datalog, delivering accurate, verifiable localization while using far fewer LLM tokens.

Key Contributions

  • Identification of the “Keyword Shortcut” bias in existing issue‑driven code‑localization datasets.
  • Formalization of KA‑LCL and creation of the KA‑LogicQuery benchmark that forces models to reason about program structure rather than lexical cues.
  • LogicLoc framework: a neurosymbolic pipeline that (1) extracts factual predicates from a repository, (2) uses an LLM to synthesize Datalog rules, (3) validates and refines those rules via parser‑gated checks and mutation‑based diagnostics, and (4) executes the verified Datalog program on a high‑performance inference engine.
  • Empirical evidence that state‑of‑the‑art (SOTA) LLM‑only approaches suffer catastrophic performance loss on KA‑LogicQuery, while LogicLoc regains high accuracy.
  • Efficiency gains: LogicLoc achieves comparable or better results on traditional benchmarks with substantially lower token usage and faster end‑to‑end latency thanks to offloading structural traversal to a deterministic engine.

Methodology

  1. Benchmark Construction (KA‑LogicQuery)

    • Curated a set of real‑world issues where the target code cannot be inferred from file paths or function names.
    • Each query requires reasoning about control‑flow, data‑flow, or API usage patterns (e.g., “Find the method that opens a socket and writes JSON”).
  2. Neurosymbolic Pipeline (LogicLoc)

    • Fact Extraction: A static‑analysis front‑end parses the entire repository, emitting Datalog facts such as calls(MethodA, MethodB), defines(ClassX, MethodY), usesVar(MethodZ, VarV).
    • LLM Prompting: The natural‑language issue description is fed to a large language model (e.g., GPT‑4) together with a template that asks it to generate Datalog rules that capture the required logical constraints.
    • Parser‑Gated Validation: Generated rules are first checked for syntactic correctness; ill‑formed rules are rejected before any execution.
    • Mutation‑Based Diagnostic Feedback: The system mutates parts of the rule (e.g., flips a predicate) and observes the impact on the result set, providing the LLM with concrete counter‑examples to refine its program.
    • Deterministic Execution: Once validated, the Datalog program runs on an optimized inference engine (e.g., Soufflé), producing the exact set of source files/methods that satisfy the logical query.
  3. Evaluation

    • Compared LogicLoc against leading LLM‑only baselines (e.g., Codex, GPT‑4) on both KA‑LogicQuery and standard issue‑driven datasets (e.g., Defects4J, CodeSearchNet).
    • Measured accuracy, token consumption, and wall‑clock latency.

Results & Findings

DatasetMetricSOTA LLM‑OnlyLogicLoc
KA‑LogicQueryTop‑1 Localization Accuracy22 %71 %
KA‑LogicQueryTokens per Query (≈)1,200340
Traditional Issue BenchmarksAccuracy (average)84 %82 %
End‑to‑End Latency (ms)1,800620
  • Catastrophic drop: When keyword cues are removed, SOTA LLMs plummet to near‑random performance, confirming they rely heavily on lexical shortcuts.
  • LogicLoc’s recovery: By delegating structural reasoning to Datalog, LogicLoc restores a > 3× boost in accuracy on the hard benchmark.
  • Efficiency: Because the Datalog engine handles graph traversal deterministically, the LLM only needs to generate concise rule snippets, slashing token usage and inference time.
  • Robustness: On conventional benchmarks that still contain keyword hints, LogicLoc remains competitive, showing that the neurosymbolic approach does not sacrifice performance where lexical clues exist.

Practical Implications

  • More Reliable Automated Debugging: Tools that suggest the file or method responsible for a bug can now operate on deeper program semantics, reducing false positives caused by naming coincidences.
  • Scalable Code Search: Enterprises can index massive codebases with symbolic facts and run logical queries on demand, enabling developers to locate implementations of complex patterns (e.g., “all places where a JWT token is verified”).
  • Reduced Cloud Costs: By cutting LLM token consumption dramatically, LogicLoc makes large‑scale code‑localization services cheaper to run, especially in CI/CD pipelines that need to process thousands of tickets daily.
  • Explainability & Auditing: The generated Datalog rules serve as an explicit, human‑readable specification of why a particular snippet was selected, facilitating compliance reviews and knowledge transfer.
  • Foundation for Hybrid AI IDEs: IDE extensions could combine LLM‑driven natural‑language assistance with deterministic symbolic back‑ends, offering both creativity and correctness.

Limitations & Future Work

  • Static‑Analysis Dependency: LogicLoc’s fact extraction relies on accurate parsing; languages with dynamic features (e.g., JavaScript’s runtime eval) may yield incomplete facts.
  • Rule Synthesis Complexity: While the LLM handles most cases, extremely intricate logical constraints can still produce malformed Datalog, requiring more sophisticated feedback loops.
  • Benchmark Scope: KA‑LogicQuery, though carefully curated, covers a limited set of reasoning patterns; broader benchmarks (e.g., cross‑language or multi‑repo queries) are needed to fully assess generality.

Future Directions

  • Extending the fact model to capture runtime information (e.g., type inference, taint analysis).
  • Integrating reinforcement learning to let the LLM iteratively improve rule generation based on execution outcomes.
  • Exploring other symbolic engines (e.g., Prolog, SMT solvers) for richer constraint domains.

Authors

  • Xiufeng Xu
  • Xiufeng Wu
  • Zejun Zhang
  • Yi Li

Paper Information

  • arXiv ID: 2604.16021v1
  • Categories: cs.SE, cs.AI
  • Published: April 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »