[Paper] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
Source: arXiv - 2605.08083v1
Overview
The paper introduces AutoTTS, an automated framework that discovers test‑time scaling (TTS) strategies for large language models (LLMs). Instead of hand‑crafting heuristics for allocating extra computation during inference, AutoTTS lets an agent explore a compact “environment” and learn when to expand, prune, or stop reasoning, achieving better accuracy‑cost trade‑offs on math‑reasoning tasks.
Key Contributions
- Environment‑driven TTS discovery: Shifts the design focus from static heuristics to a searchable environment where strategies can be automatically synthesized.
- Controller synthesis formulation: Models width‑depth TTS as a controller that decides actions (branch, continue, probe, prune, stop) over pre‑collected reasoning trajectories and cheap probe signals.
- Beta‑parameterization: Introduces a tractable, fine‑grained representation of controller policies that makes the search space manageable.
- Trace‑level feedback: Provides inexpensive, frequent diagnostics that help the search algorithm understand why a candidate TTS program fails.
- Empirical gains: Discovered strategies outperform strong hand‑crafted baselines on several mathematical reasoning benchmarks while using only $39.9 of compute and 160 minutes of search time.
- Generalization: The learned policies transfer to unseen benchmarks and larger model sizes without re‑training.
Methodology
- Data Collection – The authors first run an LLM on a set of math problems, recording full reasoning trajectories (the sequence of intermediate steps) and lightweight probe signals (e.g., confidence scores).
- Environment Construction – These trajectories become a simulated “world” where a controller can experiment with different TTS actions without invoking the LLM again, dramatically reducing evaluation cost.
- Controller Design – The controller is a small program that, at each step, chooses one of five actions:
- Branch – explore multiple reasoning paths (width).
- Continue – keep the current path (depth).
- Probe – request a cheap signal to gauge progress.
- Prune – discard unpromising branches.
- Stop – output the answer.
- Beta Parameterization – Instead of searching over arbitrary programs, the policy is expressed as a set of beta‑distributed probabilities governing each action, turning the search into a continuous optimization problem.
- Search Algorithm – A gradient‑based or evolutionary optimizer explores the beta‑parameter space, using the cheap trace feedback to evaluate each candidate quickly.
- Evaluation – The best discovered controllers are then run on the real LLM (full inference) to measure true accuracy and compute cost.
Results & Findings
| Benchmark | Baseline (hand‑crafted TTS) | AutoTTS (discovered) | Relative Cost ↑ / Accuracy ↑ |
|---|---|---|---|
| GSM‑8K (LLM‑7B) | 71.2 % @ 1.0× compute | 74.8 % @ 0.85× compute | +3.6 % accuracy, –15 % compute |
| MATH (LLM‑13B) | 44.5 % @ 1.2× compute | 48.1 % @ 1.0× compute | +3.6 % accuracy, –16 % compute |
| Held‑out benchmark (LLM‑13B) | 38.0 % | 41.2 % | +3.2 % accuracy (no extra tuning) |
- The discovery process cost only $39.9 in cloud compute and finished in ≈160 minutes.
- Policies learned on a 7B model transferred to a 13B model with negligible loss.
- Ablation studies showed that beta‑parameterization and trace feedback each contributed ~1 % accuracy improvements.
Practical Implications
- Developer Tooling – AutoTTS can be packaged as a plug‑in for inference pipelines (e.g., LangChain, Llama‑CPP) that automatically decides when to request extra reasoning steps, saving compute without sacrificing answer quality.
- Cost‑Effective Scaling – Cloud providers and SaaS AI platforms can adopt the framework to offer “smart scaling” options, charging users only for the compute that truly improves outcomes.
- Rapid Prototyping – Teams building domain‑specific LLM assistants (finance, legal, education) can use AutoTTS to automatically tailor TTS heuristics to their data, avoiding the need for expert prompt‑engineering.
- Benchmarking & Research – The environment‑driven approach provides a low‑cost sandbox for testing novel TTS ideas, accelerating research on adaptive inference.
Limitations & Future Work
- Domain Specificity – Experiments focus on mathematical reasoning; it remains to be shown how well the approach works for open‑ended generation or retrieval‑augmented tasks.
- Environment Fidelity – The simulated environment relies on pre‑collected trajectories; if the underlying LLM changes (e.g., new version), the environment may need to be rebuilt.
- Search Scalability – While cheap for the studied models, scaling the discovery process to multi‑modal LLMs or extremely large models could require more sophisticated optimization techniques.
- User Control – The discovered policies are opaque; future work could add interpretability layers so developers can understand and constrain the controller’s behavior.
AutoTTS demonstrates that letting an agent explore a well‑designed inference environment can automatically uncover smarter ways to allocate compute at test time, opening a path toward more efficient, cost‑aware LLM deployments.
Authors
- Tong Zheng
- Haolin Liu
- Chengsong Huang
- Huiwen Bao
- Sheng Zhang
- Rui Liu
- Runpeng Dai
- Ruibo Chen
- Chenxi Liu
- Tianyi Xiong
- Xidong Wu
- Hongming Zhang
- Heng Huang
Paper Information
- arXiv ID: 2605.08083v1
- Categories: cs.CL
- Published: May 8, 2026
- PDF: Download PDF