[Paper] Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation

Published: 1 week ago (May 29, 2026 at 12:06 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.31478v1

Overview

Large language models (LLMs) are being tapped to write code for power‑system simulations, but utilities often need to run these models on‑premise for security and compliance reasons. This paper uncovers why “first‑pass” code generation fails most often—not because the model can’t reason, but because it bumps into API‑knowledge boundaries (e.g., inventing function names or mis‑using library parameters). The authors present a benchmark, a probing framework, and a lightweight “demand‑guided” intervention that dramatically boosts the reliability of open‑weight LLMs without any fine‑tuning.

Key Contributions

PowerCodeBench: an execution‑validated benchmark that couples natural‑language queries with real pandapower code and numerical ground‑truth results.
Documentation‑driven probing (L0‑L3): a systematic procedure to map each model’s knowledge of versioned simulation APIs, exposing where hallucinations occur.
Boundary‑aware intervention: a two‑stage technique that (1) estimates the API calls a query will need, injects concise, targeted documentation into the prompt, and (2) applies a reactive correction step if the generated code still violates the API contract.
Comprehensive evaluation: 2 000 tasks tested on ten open‑weight LLMs (1.5 B–480 B parameters) and four commercial mid‑tier APIs, showing consistent accuracy gains of 32–56 percentage points for models ≥7 B parameters.
Efficiency gains: the intervention preserves the model’s full‑context reasoning while cutting prompt‑token usage to ~41 % of a naïve “full‑documentation” prompt.

Methodology

Benchmark Construction – The authors collected realistic operator queries (e.g., “run a short‑circuit analysis for bus 5”) and paired each with a hand‑crafted pandapower script and the expected numerical output. All scripts are executed to verify correctness, turning the benchmark into a ground‑truth execution check rather than a pure text match.
API Knowledge Probing – Using a four‑level (L0‑L3) probing ladder, they ask models progressively more detailed questions about the pandapower library (e.g., “What arguments does create_bus accept?”). The responses are compared against the official documentation to build a knowledge profile for each model.
Demand‑Guided Intervention –
- Demand Estimation: A lightweight classifier predicts which pandapower functions a user query will need.
- Proactive Injection: Only the relevant snippets of the official API docs are appended to the prompt, keeping the token budget low.
- Reactive Correction: After the model generates code, a quick static check (e.g., signature verification) flags mismatches; if found, a second‑stage prompt asks the model to fix the specific error.
Evaluation Loop – Each model runs the 2 000 tasks twice (baseline vs. intervention). Generated scripts are executed; success is measured by matching the numerical results within a tight tolerance.

Results & Findings

Model (Params)	Baseline Accuracy*	Post‑Intervention Accuracy	Δ Accuracy
Llama‑2‑7B	38 %	71 %	+33 pp
Llama‑2‑13B	45 %	78 %	+33 pp
Llama‑3.1‑405B	68 %	92 %	+24 pp
Qwen3‑Coder‑480B	71 %	95 %	+24 pp
Commercial API (mid‑tier)	55 %	87 %	+32 pp

*Accuracy = percentage of tasks where the generated code produced the correct numerical result.

Open‑weight models in the 70 B–120 B range now match the performance of commercial mid‑tier APIs.
The intervention adds no noticeable latency (the extra verification step runs in <200 ms) and reduces prompt cost by ~59 % compared with dumping the entire API reference.
Even the largest models (405 B/480 B) still benefit, indicating that raw model size alone does not solve API‑boundary errors.

Practical Implications

On‑premise deployment becomes viable: Utilities can run a 70 B open‑weight model locally and achieve commercial‑grade code‑generation reliability without paying for cloud inference or fine‑tuning.
Reduced engineering overhead: The demand‑guided prompting can be wrapped into a thin “LLM‑assistant” library that automatically injects the right docs, letting developers focus on high‑level analysis rather than debugging generated scripts.
Cost‑effective scaling: Because only a fraction of the documentation is sent per request, token‑based pricing (for hosted APIs) or memory bandwidth (for local inference) is dramatically lowered, making large‑scale batch simulations affordable.
Safety & compliance: Execution‑validated generation ensures that any code shipped to a grid‑operation environment has been run against a known ground truth, satisfying regulatory audit trails.
Transferability: The probing + intervention pipeline is not limited to pandapower; any domain with a versioned, well‑documented Python (or other) API (e.g., power‑flow libraries, control‑system toolkits) can adopt the same approach.

Limitations & Future Work

Benchmark scope: PowerCodeBench focuses on pandapower; other power‑system tools (e.g., OpenDSS, GridLAB‑D) may expose different API patterns that need separate probing.
Static verification only: The reactive correction step checks signatures but does not execute the code before returning it; deeper semantic checks (e.g., unit consistency) could catch subtler bugs.
Model‑agnostic demand estimator: The current estimator is trained on the same benchmark data; future work could explore zero‑shot demand prediction to avoid any task‑specific training.
Fine‑tuning vs. prompting: While the paper shows prompting works well, combining it with lightweight fine‑tuning might push accuracy even higher, especially for niche utility‑specific extensions.

Overall, the study offers a pragmatic, deployment‑time recipe for turning open‑weight LLMs into trustworthy code assistants for power‑system engineers—bridging the gap between cutting‑edge AI research and real‑world grid operations.

Authors

Hui Wu
Xiaoyang Wang
Zhong Fan

Paper Information

arXiv ID: 2605.31478v1
Categories: cs.SE, cs.CL, eess.SY
Published: May 29, 2026
PDF: Download PDF

[Paper] Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

[Paper] LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

[Paper] What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection