[Paper] From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Published: 3 weeks ago (April 16, 2026 at 10:55 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.15097v1

Overview

The paper investigates how to package and reuse “experience” from previous runs of scientific code‑solving systems so that it can be leveraged at test time and evolved iteratively. By running 4,590 controlled experiments across 45 problem domains, the authors discover that a compact, gene‑like representation of experience consistently outperforms larger, documentation‑style “skill” packages. In short, how you encode past knowledge matters far more than how much you give the system.

Key Contributions

Empirical benchmark: 4,590 trials on 45 scientific code‑solving tasks, providing a rare large‑scale evaluation of experience reuse.
Comparison of representations: Shows that “Skill” packages (rich documentation) are unstable and often degrade performance, while a minimal “Gene” encoding yields the best average results.
Evolution‑ready design: Demonstrates that Genes are superior carriers for iterative learning—failure histories, compact warnings, and editable structure all boost downstream performance.
Quantified gains: Gene‑evolved systems improve baseline performance on the CritPt benchmark from 9.1 % → 18.57 % and 17.7 % → 27.14 %.
Design insight: Highlights that the core challenge is encoding experience as a compact, control‑oriented object rather than simply adding more data.

Methodology

Task suite – 45 scientific code‑solving scenarios (e.g., symbolic integration, differential equation solving).
Experience formats
- Skill: A documentation‑style bundle containing free‑form text, examples, and auxiliary code.
- Gene: A tightly‑structured, low‑dimensional vector/record that captures essential control signals (e.g., parameter tweaks, concise warnings).
Controlled trials – For each scenario, the authors run multiple test‑time runs with either Skill or Gene attached, measuring success rates, runtime, and stability under structural perturbations (e.g., shuffling fields, adding noise).
Iterative evolution – After an initial run, failure information is recorded and fed back into the experience object. The authors compare three ways of doing this: naive text append, structured failure logs, and compact warning tokens.
Metrics – Primary metric is the average success improvement over a baseline model; secondary metrics include robustness to representation changes and the cost of encoding (size, parsing overhead).

Results & Findings

Representation	Avg. Success ↑ (vs. baseline)	Robustness to Perturbation	Effect of Adding Docs
Gene	+10.2 % (overall)	High – minimal drop when fields are shuffled	Adding extra docs degrades performance
Skill (full)	+3.4 % (average)	Low – performance collapses with minor noise	More docs → no benefit or negative impact
Skill (fragment)	+5.1 %	Moderate	Same trend

Iterative accumulation: When failure histories are encoded as compact warnings inside a Gene, subsequent runs improve an additional ~5 % over using raw text logs.
Structural edits matter: Changing the order or nesting of Gene fields has a smaller impact than doing the same to Skill bundles, confirming that the Gene’s design is inherently more control‑oriented.
CritPt benchmark: Gene‑evolved models achieve 18.57 % and 27.14 % success rates, roughly doubling the baseline scores.

Practical Implications

Tooling for developers: When building AI‑assisted scientific software (e.g., symbolic math assistants, automated theorem provers), expose a compact “experience API” rather than dumping large documentation blobs.
Runtime efficiency: Genes are tiny (often < KB) compared to Skill packages (often > MB), reducing parsing time and memory footprint—critical for edge or cloud‑function deployments.
Continuous improvement pipelines: Systems can automatically ingest failure warnings (e.g., “division‑by‑zero at step 3”) into the Gene, enabling online refinement without retraining the whole model.
Version control & reproducibility: Because Genes are structured, they can be diff‑tracked (like code) and rolled back, making audit trails for scientific computation more manageable.
Cross‑domain transfer: A well‑designed Gene can be ported between related problem families (e.g., from ODE solving to PDE discretization) with minimal adaptation, accelerating productization of research prototypes.

Limitations & Future Work

Domain scope: Experiments focus on scientific code‑solving; results may not directly translate to NLP or vision tasks without further validation.
Gene design heuristics: The paper proposes a specific Gene schema; discovering optimal schemas for other domains remains an open question.
Scalability of evolution: While compact warnings work well now, the authors note that handling large, heterogeneous failure logs may require hierarchical Gene structures.
Human interpretability: Genes are intentionally terse, which can make manual debugging harder; future work could explore hybrid representations that retain compactness while offering richer explanations.

Bottom line: For developers building AI systems that need to learn from past runs, the takeaway is clear—encode experience as a small, well‑structured “gene” rather than a bulky documentation package. This not only yields better immediate performance but also sets the stage for efficient, iterative improvement in production environments.

Authors

Junjie Wang
Yiming Ren
Haoyang Zhang

Paper Information

arXiv ID: 2604.15097v1
Categories: cs.SE, cs.CL
Published: April 16, 2026
PDF: Download PDF

[Paper] From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text