[Paper] Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

Published: 6 days ago (June 4, 2026 at 01:49 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.06454v1

Overview

Large language models increasingly write, review, and judge code, and a fast‑growing practice equips them with prompt “skills” that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code. But these gains are almost always read off an LLM‑as‑a‑judge, an instrument with documented positional, self‑preference, and stylistic biases.

We ask: if it appears to help, is the gain from the skill’s Popperian content, or from the structure any scaffold imposes? We pre‑register a two‑tier ablation with three controls:

a length‑matched placebo,
a labels‑only scaffold that keeps the Popperian headers but strips the procedure, and
an execution oracle (HumanEval+ unit tests),

plus a vocabulary‑halo sentinel and a same‑model self‑judge audit.

Frontier model (Claude Sonnet 4.6, N = 163): All conditions sit near the benchmark ceiling and do not separate, so the pre‑registered +5‑point improvement is not supported (a ceiling‑limited non‑detection).
Small model (Qwen2.5‑Coder‑0.5B, N = 164): Structured arms lift best‑of‑eight correctness by 20–22 points, but the full skill shows no separable benefit over a labels‑only scaffold (aggregate F@8 = L@8 vs V@8 = 34.8 %). The placebo trails by only 2.4 points.
A 0.5B self‑judge applying the Popperian rubric does not beat random selection and concentrates 60 % of its picks on one index.

In the two settings tested, the skill’s Popperian procedural content adds no separable execution‑correctness benefit beyond a labels‑only scaffold, so the gains track scaffold structure. We contribute a calibrated negative result and a reusable disambiguation protocol; the finding bounds an engineering claim about one prompt‑skill family and is not an evaluation of Popperian methodology in general.

Key Contributions

cs.SE
cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.SE.

Authors

Mehmet Iscan

Paper Information

arXiv ID: 2606.06454v1
Categories: cs.SE, cs.CL
Published: June 4, 2026
PDF: Download PDF

[Paper] Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] How reliable are LLMs when it comes to playing dice?

[Paper] Agentopia: Long-Term Life Simulation and Learning in Agent Societies

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings