[Paper] Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis

Published: 14 hours ago (April 23, 2026 at 10:51 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21746v1

Overview

The paper investigates how much a large language model (LLM) should be “in charge” when it translates natural‑language questions into Joern’s static‑analysis query language, CPGQL. By systematically varying the degree of LLM involvement, the authors show that a modest, schema‑constrained intermediate format can dramatically boost accuracy—often more than letting the model generate the final query outright or letting it run an iterative, tool‑augmented “agent” loop.

Key Contributions

Three‑tiered architecture comparison – Direct query generation, JSON‑based schema‑constrained intermediate representation, and a tool‑augmented agentic approach.
Comprehensive benchmark – 20 static‑analysis tasks split into low, medium, and high complexity, evaluated with four open‑weight LLMs (two families × two sizes) in a 2 × 2 experimental design, each run three times.
Empirical finding – The structured intermediate (Approach 2) outperforms direct generation by 15‑25 pp on large models and beats the agentic method while using ~⅛ of the token budget.
Scalability insight – Benefits of the intermediate representation grow with model size; small models hit a “schema‑compliance” ceiling.
Guidance for static‑analysis tooling – Demonstrates that constraining LLM output to a well‑typed format and delegating deterministic query construction yields the best trade‑off between accuracy and cost.

Methodology

Task Set – 20 realistic code‑analysis queries (e.g., “find all functions that read from a network socket”) grouped into three difficulty levels.
LLM Families & Scales – Two open‑source families (e.g., LLaMA‑style and Mistral‑style) each evaluated at a “small” (≈7 B parameters) and a “large” (≈30 B) checkpoint.
Three Architectures
- Approach 1 (Direct) – Prompt the LLM to output a CPGQL string directly from the natural‑language request.
- Approach 2 (Intermediate JSON) – Prompt the LLM to emit a JSON object that conforms to a predefined schema (e.g., { "entity": "function", "property": "calls", "target": "socket" }). A deterministic post‑processor then translates this JSON into valid CPGQL.
- Approach 3 (Agentic) – Give the LLM access to Joern as a tool, allowing it to iteratively query, observe results, and refine its answer (similar to ReAct or tool‑use loops).
Evaluation Metric – Result‑match rate: proportion of generated queries that return exactly the same set of code elements as the ground‑truth query.
Repetitions – Each configuration run three times to smooth out stochastic variance; results aggregated across runs.

The design treats “degree of delegation” as an independent variable, enabling a clean comparison of how much reasoning should stay inside the LLM versus in deterministic code.

Results & Findings

Architecture	Small Model	Large Model
Direct (A1)	~45 % match	~55 % match
Intermediate JSON (A2)	~60 % match	~80 % match
Agentic (A3)	~58 % match	~78 % match

A2 beats A1 by 15‑25 pp on large models, confirming that a typed intermediate representation dramatically reduces hallucination and syntax errors.
A2 rivals A3 while consuming roughly 1/8 of the token budget, making it far cheaper to run at scale.
Complexity matters – Gains are most visible on medium/high‑complexity tasks where query structure is non‑trivial.
Model size effect – Large models can better respect the JSON schema, so A2’s advantage widens; small models often produce malformed JSON, limiting the approach.

Overall, the study suggests that “less is more”: giving the LLM a narrow, well‑defined output space leads to higher fidelity than letting it generate free‑form code or orchestrate a multi‑step tool‑use loop.

Practical Implications

Tool Builders – When adding natural‑language front‑ends to static‑analysis engines (Joern, CodeQL, etc.), implement a schema‑validated JSON or protobuf intermediate rather than direct code generation.
Cost Savings – Developers can achieve near‑agentic performance with a fraction of the API token usage, lowering operational expenses for SaaS products.
Reliability – Structured intermediates make it easier to add static checks (JSON schema validation) before invoking the deterministic query compiler, reducing runtime crashes.
Extensibility – The intermediate format can be versioned and enriched (e.g., adding “confidence” fields) without touching the LLM prompt logic.
Team Workflow – Security‑oriented teams can audit the JSON payloads more readily than raw generated queries, facilitating compliance reviews.

In short, the findings give a concrete recipe for building LLM‑assisted code‑analysis assistants that are both accurate and economical.

Limitations & Future Work

Schema Rigidity – The JSON schema must be manually crafted for each analysis domain; extending to new query types may require engineering effort.
Small‑Model Bottleneck – For models under ~7 B parameters, schema compliance drops sharply, suggesting a need for better prompting or fine‑tuning.
Benchmark Scope – The study uses 20 tasks on a single static‑analysis platform (Joern). Broader evaluations across other tools (e.g., CodeQL, Semgrep) would strengthen generalizability.
Agentic Baseline – The agentic approach was implemented with a simple ReAct loop; more sophisticated planning or memory mechanisms could narrow the gap.
User Study – The paper focuses on automated metrics; future work could assess developer satisfaction and productivity when interacting with each architecture.

By addressing these points, the community can refine the “structured‑intermediate” paradigm and apply it to a wider array of developer‑centric AI assistants.

Authors

Krishna Narasimhan

Paper Information

arXiv ID: 2604.21746v1
Categories: cs.SE
Published: April 23, 2026
PDF: Download PDF

[Paper] Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

[Paper] Institutionalizing Best Practices in Research Computing: A Framework and Case Study for Improving User Onboarding

[Paper] Generalizing Test Cases for Comprehensive Test Scenario Coverage

[Paper] Autonomous LLM-generated Feedback for Student Exercises in Introductory Software Engineering Courses