[Paper] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Source: arXiv - 2602.16699v1
Overview
Large language models (LLMs) are now being used as autonomous agents that interact with external tools or environments—think code‑testing loops, web searches, or data‑gathering APIs. When an agent can keep probing the environment, it must decide how much to explore before committing to a final answer, balancing the cost of each probe against the risk of a wrong answer. The paper Calibrate‑Then‑Act: Cost‑Aware Exploration in LLM Agents shows a simple yet powerful way to give LLM agents an explicit “cost‑benefit calculator” that leads to smarter, cheaper decision‑making.
Key Contributions
- Formalization of cost‑uncertainty trade‑offs for LLM agents as sequential decision‑making problems with hidden environment states.
- Calibrate‑Then‑Act (CTA) framework: a two‑step prompting recipe that first supplies the model with a prior estimate of the hidden state (the “calibration” phase) and then lets it choose an action (the “act” phase).
- Empirical validation on two domains:
- Information‑seeking question answering (retrieval‑augmented QA).
- A simplified code‑generation task where the agent decides whether to run a test before submitting code.
- Compatibility with reinforcement learning: CTA improves performance even when both baseline and CTA agents are fine‑tuned with RLHF or other RL methods.
- Open‑source implementation and prompt templates that can be dropped into existing LLM‑agent pipelines.
Methodology
- Problem framing – Each task is modeled as a Markov decision process (MDP) where the true environment state (e.g., the relevance of a retrieved document or the correctness of generated code) is hidden. The agent can take exploratory actions (e.g., issue another search query, run a test) that incur a known cost, or it can commit to an answer.
- Prior inference (Calibration) – Before any action, the LLM receives a short “calibration prompt” that contains:
- A description of the hidden state space.
- A prior distribution (e.g., “there’s a 70 % chance the first retrieved snippet is correct”).
- Any side‑information gathered so far (previous queries, partial test results).
The LLM is asked to update this prior based on the new evidence it just observed.
- Decision making (Act) – Using the updated belief, the LLM is prompted to choose among:
- Explore (e.g., ask another query, run another test).
- Commit (output the final answer).
The prompt explicitly asks the model to weigh expected utility = (probability of correctness × reward) − (exploration cost).
- Training & evaluation – The authors compare three setups:
- Baseline: a single prompt that asks the model to answer directly, with optional “self‑ask” loops.
- CTA: the two‑stage calibration‑then‑act prompting.
- RL‑enhanced: both baseline and CTA agents are further fine‑tuned with reinforcement learning using the same reward signal.
Results & Findings
| Task | Metric | Baseline | CTA | RL‑Baseline | RL‑CTA |
|---|---|---|---|---|---|
| Retrieval‑QA (accuracy) | Exact‑match | 68.2 % | 73.5 % | 70.1 % | 75.8 % |
| Retrieval‑QA (cost) | Avg. #queries per question | 2.8 | 2.1 | 2.6 | 1.9 |
| Simplified coding (pass rate) | Correct after final commit | 61.4 % | 68.9 % | 64.2 % | 71.3 % |
| Coding (average test runs) | #tests executed | 1.7 | 1.2 | 1.5 | 1.1 |
What it means: By making the cost‑benefit calculation explicit, agents explore fewer times while improving accuracy. The advantage persists after RL fine‑tuning, indicating that CTA is not just a prompting trick but a robust bias toward cost‑aware reasoning.
Practical Implications
- Cheaper API usage – For services that charge per call (search, code execution, external tool invocation), CTA can cut the number of calls by 20‑30 % without sacrificing quality.
- Safer code generation – Developers can embed the CTA pattern in Copilot‑style assistants to decide when to auto‑run tests, reducing noisy test spam while catching more bugs before deployment.
- Better user experience in chatbots – A customer‑support bot can ask follow‑up clarification questions only when the expected benefit outweighs the interaction cost, leading to shorter conversations.
- Plug‑and‑play – The framework only requires adding a calibration prompt and a simple utility calculation; it works with any LLM that supports few‑shot prompting (GPT‑4, Claude, Llama‑2, etc.).
- Foundation for cost‑aware RL – CTA provides a clean prior that can be used as a state feature in RL policies, opening the door to more sophisticated budget‑constrained agents.
Limitations & Future Work
- Prior quality dependence – CTA assumes the calibration prompt can convey a reasonably accurate prior. If the prior is badly mis‑specified, the agent may over‑explore or under‑explore.
- Scalability of the belief space – The current experiments use low‑dimensional hidden states (binary correctness, relevance). Extending CTA to richer latent spaces (e.g., multi‑step program correctness) will require more sophisticated belief updates.
- Human‑in‑the‑loop evaluation – The paper focuses on automated metrics; real‑world user studies are needed to confirm that cost‑aware behavior aligns with user expectations.
- Integration with tool‑use APIs – Future work could explore CTA in more complex tool‑use settings (e.g., database queries, multi‑modal perception) where the cost model is non‑linear.
Bottom line: Calibrate‑Then‑Act gives developers a pragmatic recipe to turn LLM agents into cost‑conscious problem solvers, delivering higher accuracy with fewer expensive calls—a win for both product teams and end users.
Authors
- Wenxuan Ding
- Nicholas Tomlin
- Greg Durrett
Paper Information
- arXiv ID: 2602.16699v1
- Categories: cs.CL, cs.AI
- Published: February 18, 2026
- PDF: Download PDF