[Paper] Agentic Test-Time Scaling for WebAgents

Published: (February 12, 2026 at 01:58 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.12276v1

Overview

The paper introduces Confidence‑Aware Test‑Time Scaling (CATTS), a lightweight technique that lets web‑automation agents dynamically decide how much extra compute (e.g., more LLM tokens) to spend on each decision step. By allocating resources only when the model is uncertain, CATTS boosts success rates on complex, multi‑step web tasks while cutting inference cost compared to naïve “always‑bigger‑model” approaches.

Key Contributions

  • Empirical study of test‑time scaling for web agents – shows uniform per‑step compute increase quickly hits diminishing returns on long‑horizon tasks.
  • Uncertainty signals from the agent’s own vote distribution (entropy, top‑1/top‑2 margin) strongly correlate with downstream task success.
  • LLM‑based Arbiter that aggregates multiple candidate actions, outperforming simple majority voting but sometimes over‑corrects high‑consensus decisions.
  • CATTS algorithm – a simple rule that triggers extra sampling only when the vote‑derived uncertainty exceeds a threshold, achieving up to 9.1 % higher success on benchmark suites while using ≤ 2.3× fewer tokens than uniform scaling.
  • Interpretability – the decision to spend more compute is directly traceable to a measurable uncertainty metric, making it easy for developers to debug and tune.

Methodology

  1. Baseline agents – The authors start from the popular React web‑agent framework, which iteratively generates actions (click, type, etc.) using a language model.
  2. Uniform scaling experiments – They increase the number of LLM samples (or tokens) per step uniformly and record success rates across two web‑navigation benchmarks: WebArena‑Lite and GoBrowse.
  3. Vote‑based uncertainty extraction – For each step, the agent samples k candidate actions, computes a vote distribution, and derives:
    • Entropy of the distribution (higher → more disagreement)
    • Top‑1 / top‑2 margin (difference in probability between the most and second‑most voted actions)
  4. Arbiter design – An auxiliary LLM receives the top‑k candidates and the task context, then selects a final action. This tests whether a learned aggregator can beat raw voting.
  5. CATTS rule – Set a threshold on entropy (or margin). If the uncertainty is below the threshold, accept the majority vote; otherwise, invoke the Arbiter (or increase sampling) for that step only.
  6. Evaluation – Measure overall task success, token usage, and latency across the two benchmarks, comparing CATTS to:
    • Baseline React (single sample)
    • Uniform scaling (fixed larger k)
    • Arbiter‑only (always invoke the LLM arbiter).

Results & Findings

SettingSuccess ↑ (WebArena‑Lite)Success ↑ (GoBrowse)Tokens per episode ↓
React (1 sample)baseline
Uniform scaling (k=5)+3.2 %+2.8 %+120 %
Arbiter (always)+5.1 %+4.6 %+180 %
CATTS (entropy > 0.85)+9.1 %+8.4 %‑57 % (≈2.3× fewer tokens)
  • Uncertainty metrics: Entropy > 0.85 and margin < 0.15 reliably flagged steps that later caused failures.
  • Arbiter over‑correction: When the vote was already highly consensual, the arbiter sometimes flipped a correct action, hurting performance—highlighting the need for a gating mechanism.
  • Efficiency: CATTS achieved the best trade‑off, delivering higher success while dramatically reducing token consumption and inference latency.

Practical Implications

  • Cost‑effective web automation – Developers can run sophisticated agents on modest GPU/CPU budgets by only “spending” extra compute on ambiguous steps.
  • Dynamic reliability – In production systems (e.g., automated form filling, QA bots), CATTS provides a built‑in confidence check, allowing graceful fallback or human‑in‑the‑loop escalation when uncertainty spikes.
  • Generalizable to other multi‑step LLM agents – The vote‑derived uncertainty signal is model‑agnostic, so the same gating logic can be applied to code‑generation assistants, planning agents, or any chain‑of‑thought pipeline.
  • Simplified debugging – Since the trigger is a transparent entropy threshold, engineers can log which steps caused extra sampling and inspect the underlying candidate actions.
  • Potential for adaptive latency budgets – Real‑time services can cap the maximum extra compute per step, guaranteeing response time while still reaping most of the accuracy gains.

Limitations & Future Work

  • Threshold sensitivity – The entropy/margin cut‑off is hand‑tuned; a more principled, possibly learned, calibration could improve robustness across domains.
  • Arbiter scalability – The current arbiter is a separate LLM call; in high‑throughput settings this could become a bottleneck. Future work might explore lightweight classifiers or distillation of the arbiter.
  • Benchmark scope – Experiments focus on web navigation; extending to other sequential tasks (e.g., API orchestration, multi‑modal reasoning) is needed to confirm generality.
  • Long‑horizon compounding – While CATTS mitigates early errors, the paper notes that error propagation still limits performance on very long episodes (> 30 steps). More sophisticated planning or error‑correction mechanisms could be investigated.

Authors

  • Nicholas Lee
  • Lutfi Eren Erdogan
  • Chris Joseph John
  • Surya Krishnapillai
  • Michael W. Mahoney
  • Kurt Keutzer
  • Amir Gholami

Paper Information

  • arXiv ID: 2602.12276v1
  • Categories: cs.AI, cs.CL
  • Published: February 12, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »