[Paper] ThetaEvolve: Test-time Learning on Open Problems

Published: 2 months ago (November 28, 2025 at 01:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.23473v1

Overview

ThetaEvolve is an open‑source framework that lets a single large language model (LLM) learn while it solves open‑ended mathematical optimization problems. By combining test‑time in‑context learning with reinforcement‑learning (RL) updates, the model can iteratively improve its own problem‑solving strategies and achieve record‑breaking bounds on classic challenges such as circle packing and the first auto‑correlation inequality.

Key Contributions

Unified test‑time learning loop: Merges in‑context prompting and RL updates into a single pipeline that runs at inference time.
Single‑model efficiency: Demonstrates that an 8‑billion‑parameter open‑source model (DeepSeek‑R1‑0528‑Qwen3‑8B) can surpass the performance of much larger, closed‑source ensembles used by AlphaEvolve.
Scalable exploration: Introduces a massive program database and batch sampling to dramatically increase throughput during search.
Stability tricks: Implements lazy penalties to discourage repetitive outputs and optional reward shaping for smoother RL signals.
Generalization evidence: Shows that RL‑trained checkpoints not only excel on the trained task but also transfer to unseen open problems.

Methodology

Program Database – A curated collection of candidate programs (e.g., mathematical constructions) is stored offline. The LLM samples from this pool as starting points for each trial.
In‑Context Prompting – For each batch, the model receives a prompt that includes the current best solution, a few recent attempts, and the problem definition. This lets the model “reason” about what worked and what didn’t.
Batch Sampling – Instead of a single sequential search, ThetaEvolve draws many candidates in parallel, feeding them through the LLM to boost throughput.
Reward Computation – Each generated program is executed (or analytically evaluated) to compute a numeric reward (e.g., tighter packing density).
Lazy Penalties – If a batch produces duplicate or stagnant solutions, a small penalty is added to the reward to push the model toward novelty.
RL Update at Test Time – Using a lightweight policy‑gradient algorithm (e.g., REINFORCE), the model’s parameters are nudged toward actions that yielded higher rewards, all while the model is still serving inference requests.
Optional Reward Shaping – For especially noisy tasks, a smoothed version of the reward (e.g., moving‑average baseline) can be supplied to reduce variance.

The whole loop runs repeatedly until a stopping criterion (time budget or convergence) is met, allowing the model to “evolve” its own solving tactics on the fly.

Results & Findings

Record bounds: ThetaEvolve with the 8B‑parameter model beat the best known results from AlphaEvolve on two benchmark problems (circle packing and first auto‑correlation inequality).
Consistent gains: Across two LLMs and four open‑ended tasks, the RL‑augmented version outperformed pure inference baselines by 10‑30 % in final reward.
Faster convergence: RL‑trained checkpoints reached high‑quality solutions in fewer iterations than the baseline, indicating that the model internalized useful heuristics.
Cross‑task transfer: Checkpoints fine‑tuned on one problem also showed improved performance on other, previously unseen problems, suggesting that the learned “evolutionary” behavior is somewhat generic.

Practical Implications

Cost‑effective research: Smaller, open‑source models can now compete with massive proprietary ensembles, lowering the barrier for academic and industry teams to explore automated theorem proving or combinatorial optimization.
Continuous improvement services: Developers can embed ThetaEvolve into SaaS platforms where the model keeps learning from user‑submitted challenges, delivering ever‑better solutions without retraining from scratch.
Automated design pipelines: Fields such as chip layout, material packing, or signal processing often involve open optimization problems; ThetaEvolve could act as a plug‑and‑play optimizer that self‑tunes during deployment.
Open‑source ecosystem: The publicly released code and program database invite community contributions, fostering a collaborative “evolutionary AI” community.

Limitations & Future Work

Scalability ceiling: While batch sampling improves throughput, the approach still hinges on executing many candidate programs, which can become a bottleneck for highly expensive evaluations.
Reward noise: For problems where the objective is noisy or hard to compute exactly, RL updates may be unstable despite the lazy‑penalty and shaping tricks.
Model size trade‑offs: The current success is demonstrated on an 8B‑parameter model; it remains unclear how the method scales down to much smaller models or up to multi‑billion‑parameter systems.
Generalization scope: Transfer to completely different domains (e.g., symbolic integration) needs systematic study. Future work could explore meta‑learning across diverse problem families and integrate more sophisticated exploration strategies (e.g., curiosity‑driven sampling).

ThetaEvolve opens a practical pathway for developers to harness the adaptive power of LLMs in solving open‑ended, mathematically intensive tasks—turning inference‑only models into self‑improving problem solvers.

Authors

Yiping Wang
Shao‑Rong Su
Zhiyuan Zeng
Eva Xu
Liliang Ren
Xinyu Yang
Zeyi Huang
Xuehai He
Luyao Ma
Baolin Peng
Hao Cheng
Pengcheng He
Weizhu Chen
Shuohang Wang
Simon Shaolei Du
Yelong Shen

Paper Information

arXiv ID: 2511.23473v1
Categories: cs.LG, cs.CL
Published: November 28, 2025
PDF: Download PDF

[Paper] ThetaEvolve: Test-time Learning on Open Problems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation