[Paper] ThetaEvolve: Test-time Learning on Open Problems
Source: arXiv - 2511.23473v1
Overview
ThetaEvolve is an open‑source framework that lets a single large language model (LLM) learn while it solves open‑ended mathematical optimization problems. By combining test‑time in‑context learning with reinforcement‑learning (RL) updates, the model can iteratively improve its own problem‑solving strategies and achieve record‑breaking bounds on classic challenges such as circle packing and the first auto‑correlation inequality.
Key Contributions
- Unified test‑time learning loop: Merges in‑context prompting and RL updates into a single pipeline that runs at inference time.
- Single‑model efficiency: Demonstrates that an 8‑billion‑parameter open‑source model (DeepSeek‑R1‑0528‑Qwen3‑8B) can surpass the performance of much larger, closed‑source ensembles used by AlphaEvolve.
- Scalable exploration: Introduces a massive program database and batch sampling to dramatically increase throughput during search.
- Stability tricks: Implements lazy penalties to discourage repetitive outputs and optional reward shaping for smoother RL signals.
- Generalization evidence: Shows that RL‑trained checkpoints not only excel on the trained task but also transfer to unseen open problems.
Methodology
- Program Database – A curated collection of candidate programs (e.g., mathematical constructions) is stored offline. The LLM samples from this pool as starting points for each trial.
- In‑Context Prompting – For each batch, the model receives a prompt that includes the current best solution, a few recent attempts, and the problem definition. This lets the model “reason” about what worked and what didn’t.
- Batch Sampling – Instead of a single sequential search, ThetaEvolve draws many candidates in parallel, feeding them through the LLM to boost throughput.
- Reward Computation – Each generated program is executed (or analytically evaluated) to compute a numeric reward (e.g., tighter packing density).
- Lazy Penalties – If a batch produces duplicate or stagnant solutions, a small penalty is added to the reward to push the model toward novelty.
- RL Update at Test Time – Using a lightweight policy‑gradient algorithm (e.g., REINFORCE), the model’s parameters are nudged toward actions that yielded higher rewards, all while the model is still serving inference requests.
- Optional Reward Shaping – For especially noisy tasks, a smoothed version of the reward (e.g., moving‑average baseline) can be supplied to reduce variance.
The whole loop runs repeatedly until a stopping criterion (time budget or convergence) is met, allowing the model to “evolve” its own solving tactics on the fly.
Results & Findings
- Record bounds: ThetaEvolve with the 8B‑parameter model beat the best known results from AlphaEvolve on two benchmark problems (circle packing and first auto‑correlation inequality).
- Consistent gains: Across two LLMs and four open‑ended tasks, the RL‑augmented version outperformed pure inference baselines by 10‑30 % in final reward.
- Faster convergence: RL‑trained checkpoints reached high‑quality solutions in fewer iterations than the baseline, indicating that the model internalized useful heuristics.
- Cross‑task transfer: Checkpoints fine‑tuned on one problem also showed improved performance on other, previously unseen problems, suggesting that the learned “evolutionary” behavior is somewhat generic.
Practical Implications
- Cost‑effective research: Smaller, open‑source models can now compete with massive proprietary ensembles, lowering the barrier for academic and industry teams to explore automated theorem proving or combinatorial optimization.
- Continuous improvement services: Developers can embed ThetaEvolve into SaaS platforms where the model keeps learning from user‑submitted challenges, delivering ever‑better solutions without retraining from scratch.
- Automated design pipelines: Fields such as chip layout, material packing, or signal processing often involve open optimization problems; ThetaEvolve could act as a plug‑and‑play optimizer that self‑tunes during deployment.
- Open‑source ecosystem: The publicly released code and program database invite community contributions, fostering a collaborative “evolutionary AI” community.
Limitations & Future Work
- Scalability ceiling: While batch sampling improves throughput, the approach still hinges on executing many candidate programs, which can become a bottleneck for highly expensive evaluations.
- Reward noise: For problems where the objective is noisy or hard to compute exactly, RL updates may be unstable despite the lazy‑penalty and shaping tricks.
- Model size trade‑offs: The current success is demonstrated on an 8B‑parameter model; it remains unclear how the method scales down to much smaller models or up to multi‑billion‑parameter systems.
- Generalization scope: Transfer to completely different domains (e.g., symbolic integration) needs systematic study. Future work could explore meta‑learning across diverse problem families and integrate more sophisticated exploration strategies (e.g., curiosity‑driven sampling).
ThetaEvolve opens a practical pathway for developers to harness the adaptive power of LLMs in solving open‑ended, mathematically intensive tasks—turning inference‑only models into self‑improving problem solvers.
Authors
- Yiping Wang
- Shao‑Rong Su
- Zhiyuan Zeng
- Eva Xu
- Liliang Ren
- Xinyu Yang
- Zeyi Huang
- Xuehai He
- Luyao Ma
- Baolin Peng
- Hao Cheng
- Pengcheng He
- Weizhu Chen
- Shuohang Wang
- Simon Shaolei Du
- Yelong Shen
Paper Information
- arXiv ID: 2511.23473v1
- Categories: cs.LG, cs.CL
- Published: November 28, 2025
- PDF: Download PDF