[Paper] From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
Source: arXiv - 2604.15097v1
Overview
The paper investigates how to package and reuse “experience” from previous runs of scientific code‑solving systems so that it can be leveraged at test time and evolved iteratively. By running 4,590 controlled experiments across 45 problem domains, the authors discover that a compact, gene‑like representation of experience consistently outperforms larger, documentation‑style “skill” packages. In short, how you encode past knowledge matters far more than how much you give the system.
Key Contributions
- Empirical benchmark: 4,590 trials on 45 scientific code‑solving tasks, providing a rare large‑scale evaluation of experience reuse.
- Comparison of representations: Shows that “Skill” packages (rich documentation) are unstable and often degrade performance, while a minimal “Gene” encoding yields the best average results.
- Evolution‑ready design: Demonstrates that Genes are superior carriers for iterative learning—failure histories, compact warnings, and editable structure all boost downstream performance.
- Quantified gains: Gene‑evolved systems improve baseline performance on the CritPt benchmark from 9.1 % → 18.57 % and 17.7 % → 27.14 %.
- Design insight: Highlights that the core challenge is encoding experience as a compact, control‑oriented object rather than simply adding more data.
Methodology
- Task suite – 45 scientific code‑solving scenarios (e.g., symbolic integration, differential equation solving).
- Experience formats
- Skill: A documentation‑style bundle containing free‑form text, examples, and auxiliary code.
- Gene: A tightly‑structured, low‑dimensional vector/record that captures essential control signals (e.g., parameter tweaks, concise warnings).
- Controlled trials – For each scenario, the authors run multiple test‑time runs with either Skill or Gene attached, measuring success rates, runtime, and stability under structural perturbations (e.g., shuffling fields, adding noise).
- Iterative evolution – After an initial run, failure information is recorded and fed back into the experience object. The authors compare three ways of doing this: naive text append, structured failure logs, and compact warning tokens.
- Metrics – Primary metric is the average success improvement over a baseline model; secondary metrics include robustness to representation changes and the cost of encoding (size, parsing overhead).
Results & Findings
| Representation | Avg. Success ↑ (vs. baseline) | Robustness to Perturbation | Effect of Adding Docs |
|---|---|---|---|
| Gene | +10.2 % (overall) | High – minimal drop when fields are shuffled | Adding extra docs degrades performance |
| Skill (full) | +3.4 % (average) | Low – performance collapses with minor noise | More docs → no benefit or negative impact |
| Skill (fragment) | +5.1 % | Moderate | Same trend |
- Iterative accumulation: When failure histories are encoded as compact warnings inside a Gene, subsequent runs improve an additional ~5 % over using raw text logs.
- Structural edits matter: Changing the order or nesting of Gene fields has a smaller impact than doing the same to Skill bundles, confirming that the Gene’s design is inherently more control‑oriented.
- CritPt benchmark: Gene‑evolved models achieve 18.57 % and 27.14 % success rates, roughly doubling the baseline scores.
Practical Implications
- Tooling for developers: When building AI‑assisted scientific software (e.g., symbolic math assistants, automated theorem provers), expose a compact “experience API” rather than dumping large documentation blobs.
- Runtime efficiency: Genes are tiny (often < KB) compared to Skill packages (often > MB), reducing parsing time and memory footprint—critical for edge or cloud‑function deployments.
- Continuous improvement pipelines: Systems can automatically ingest failure warnings (e.g., “division‑by‑zero at step 3”) into the Gene, enabling online refinement without retraining the whole model.
- Version control & reproducibility: Because Genes are structured, they can be diff‑tracked (like code) and rolled back, making audit trails for scientific computation more manageable.
- Cross‑domain transfer: A well‑designed Gene can be ported between related problem families (e.g., from ODE solving to PDE discretization) with minimal adaptation, accelerating productization of research prototypes.
Limitations & Future Work
- Domain scope: Experiments focus on scientific code‑solving; results may not directly translate to NLP or vision tasks without further validation.
- Gene design heuristics: The paper proposes a specific Gene schema; discovering optimal schemas for other domains remains an open question.
- Scalability of evolution: While compact warnings work well now, the authors note that handling large, heterogeneous failure logs may require hierarchical Gene structures.
- Human interpretability: Genes are intentionally terse, which can make manual debugging harder; future work could explore hybrid representations that retain compactness while offering richer explanations.
Bottom line: For developers building AI systems that need to learn from past runs, the takeaway is clear—encode experience as a small, well‑structured “gene” rather than a bulky documentation package. This not only yields better immediate performance but also sets the stage for efficient, iterative improvement in production environments.
Authors
- Junjie Wang
- Yiming Ren
- Haoyang Zhang
Paper Information
- arXiv ID: 2604.15097v1
- Categories: cs.SE, cs.CL
- Published: April 16, 2026
- PDF: Download PDF