[Paper] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Published: 3 days ago (May 8, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.08083v1

Overview

The paper introduces AutoTTS, an automated framework that discovers test‑time scaling (TTS) strategies for large language models (LLMs). Instead of hand‑crafting heuristics for allocating extra computation during inference, AutoTTS lets an agent explore a compact “environment” and learn when to expand, prune, or stop reasoning, achieving better accuracy‑cost trade‑offs on math‑reasoning tasks.

Key Contributions

Environment‑driven TTS discovery: Shifts the design focus from static heuristics to a searchable environment where strategies can be automatically synthesized.
Controller synthesis formulation: Models width‑depth TTS as a controller that decides actions (branch, continue, probe, prune, stop) over pre‑collected reasoning trajectories and cheap probe signals.
Beta‑parameterization: Introduces a tractable, fine‑grained representation of controller policies that makes the search space manageable.
Trace‑level feedback: Provides inexpensive, frequent diagnostics that help the search algorithm understand why a candidate TTS program fails.
Empirical gains: Discovered strategies outperform strong hand‑crafted baselines on several mathematical reasoning benchmarks while using only $39.9 of compute and 160 minutes of search time.
Generalization: The learned policies transfer to unseen benchmarks and larger model sizes without re‑training.

Methodology

Data Collection – The authors first run an LLM on a set of math problems, recording full reasoning trajectories (the sequence of intermediate steps) and lightweight probe signals (e.g., confidence scores).
Environment Construction – These trajectories become a simulated “world” where a controller can experiment with different TTS actions without invoking the LLM again, dramatically reducing evaluation cost.
Controller Design – The controller is a small program that, at each step, chooses one of five actions:
- Branch – explore multiple reasoning paths (width).
- Continue – keep the current path (depth).
- Probe – request a cheap signal to gauge progress.
- Prune – discard unpromising branches.
- Stop – output the answer.
Beta Parameterization – Instead of searching over arbitrary programs, the policy is expressed as a set of beta‑distributed probabilities governing each action, turning the search into a continuous optimization problem.
Search Algorithm – A gradient‑based or evolutionary optimizer explores the beta‑parameter space, using the cheap trace feedback to evaluate each candidate quickly.
Evaluation – The best discovered controllers are then run on the real LLM (full inference) to measure true accuracy and compute cost.

Results & Findings

Benchmark	Baseline (hand‑crafted TTS)	AutoTTS (discovered)	Relative Cost ↑ / Accuracy ↑
GSM‑8K (LLM‑7B)	71.2 % @ 1.0× compute	74.8 % @ 0.85× compute	+3.6 % accuracy, –15 % compute
MATH (LLM‑13B)	44.5 % @ 1.2× compute	48.1 % @ 1.0× compute	+3.6 % accuracy, –16 % compute
Held‑out benchmark (LLM‑13B)	38.0 %	41.2 %	+3.2 % accuracy (no extra tuning)

The discovery process cost only $39.9 in cloud compute and finished in ≈160 minutes.
Policies learned on a 7B model transferred to a 13B model with negligible loss.
Ablation studies showed that beta‑parameterization and trace feedback each contributed ~1 % accuracy improvements.

Practical Implications

Developer Tooling – AutoTTS can be packaged as a plug‑in for inference pipelines (e.g., LangChain, Llama‑CPP) that automatically decides when to request extra reasoning steps, saving compute without sacrificing answer quality.
Cost‑Effective Scaling – Cloud providers and SaaS AI platforms can adopt the framework to offer “smart scaling” options, charging users only for the compute that truly improves outcomes.
Rapid Prototyping – Teams building domain‑specific LLM assistants (finance, legal, education) can use AutoTTS to automatically tailor TTS heuristics to their data, avoiding the need for expert prompt‑engineering.
Benchmarking & Research – The environment‑driven approach provides a low‑cost sandbox for testing novel TTS ideas, accelerating research on adaptive inference.

Limitations & Future Work

Domain Specificity – Experiments focus on mathematical reasoning; it remains to be shown how well the approach works for open‑ended generation or retrieval‑augmented tasks.
Environment Fidelity – The simulated environment relies on pre‑collected trajectories; if the underlying LLM changes (e.g., new version), the environment may need to be rebuilt.
Search Scalability – While cheap for the studied models, scaling the discovery process to multi‑modal LLMs or extremely large models could require more sophisticated optimization techniques.
User Control – The discovered policies are opaque; future work could add interpretability layers so developers can understand and constrain the controller’s behavior.

AutoTTS demonstrates that letting an agent explore a well‑designed inference environment can automatically uncover smarter ways to allocate compute at test time, opening a path toward more efficient, cost‑aware LLM deployments.

Authors

Tong Zheng
Haolin Liu
Chengsong Huang
Huiwen Bao
Sheng Zhang
Rui Liu
Runpeng Dai
Ruibo Chen
Chenxi Liu
Tianyi Xiong
Xidong Wu
Hongming Zhang
Heng Huang

Paper Information

arXiv ID: 2605.08083v1
Categories: cs.CL
Published: May 8, 2026
PDF: Download PDF

[Paper] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Accurate and Efficient Statistical Testing for Word Semantic Breadth