[Paper] CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

Published: (March 9, 2026 at 01:37 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08659v1

Overview

Large language models (LLMs) have shown that giving them more “thinking time” – i.e., more inference tokens – can dramatically boost performance on tough reasoning tasks. But the same extra compute is often wasted on easy questions, leading to unnecessary latency and cost. The paper CODA: Difficulty‑Aware Compute Allocation for Adaptive Reasoning tackles this imbalance by letting the model decide, on the fly, how much reasoning depth each input deserves.

Key Contributions

  • Formal utility framework for adaptive reasoning that stops generation when the expected accuracy gain no longer justifies the extra token cost.
  • CODA algorithm that extracts a difficulty signal from cheap “group rollouts” and translates it into two gating mechanisms (easy‑side & hard‑side) that bias token allocation.
  • Budget‑free operation – no external difficulty labels or user‑specified token budgets are needed; the model self‑regulates compute.
  • Empirical validation across multiple model sizes and benchmark suites, showing >60 % token savings on easy instances without hurting accuracy, and improved performance on hard instances.
  • Generalizable design that can be plugged into any decoder‑style LLM that supports token‑level rewards (e.g., via reinforcement learning from human feedback or policy gradients).

Methodology

  1. Problem formulation – Treat each inference step as a trade‑off between marginal accuracy gain and incremental compute cost. The optimal stopping point is where the gain falls below the cost.
  2. Difficulty estimation – For a given input, CODA runs a few lightweight “group rollouts” (short, cheap generations) and measures how quickly the model’s confidence stabilizes. Easy inputs converge fast; hard ones need more steps.
  3. Gate construction
    • Easy‑side gate: a non‑negative scalar that penalizes extra tokens when the difficulty estimate is low, effectively encouraging early termination.
    • Hard‑side gate: a complementary scalar that adds a length‑dependent shaping reward when difficulty is high, nudging the policy to keep reasoning longer.
  4. Reward shaping – The base reward is binary (correct/incorrect). CODA augments it with the two gates, producing a difficulty‑aware signal that the policy optimizes via standard RL (e.g., PPO).
  5. Inference loop – During generation, the model continuously evaluates the shaped reward; once the expected utility of another token drops below zero, decoding stops.

Results & Findings

SettingToken Reduction (Easy)Accuracy (Easy)Accuracy (Hard)
Baseline (fixed length)92.1 %78.4 %
CODA (adaptive)+62 % fewer tokens91.8 % (≈ baseline)+2.3 % over baseline
  • Across model scales (7B‑65B parameters) the pattern holds: larger models still benefit from adaptive stopping.
  • On diverse reasoning benchmarks (e.g., GSM‑8K, MathQA, CommonsenseQA), CODA consistently saves compute on the “easy” split while either preserving or improving scores on the “hard” split.
  • Ablation studies show that removing either gate degrades performance: the easy‑side gate is crucial for token savings, while the hard‑side gate drives the gains on difficult items.

Practical Implications

  • Cost‑effective APIs – Cloud providers can expose an “adaptive reasoning” mode that automatically trims latency and token billing for straightforward queries while preserving depth for complex ones.
  • Real‑time assistants – Voice or chat assistants can respond faster on routine user requests (e.g., factual look‑ups) and allocate more compute only when the conversation turns ambiguous or multi‑step.
  • Energy savings – Reducing unnecessary token generation translates directly into lower GPU utilization and carbon footprint, an increasingly important metric for large‑scale deployments.
  • Plug‑and‑play – Since CODA works at the policy‑level, existing LLM services that already employ RLHF can integrate the gating mechanism with minimal engineering effort.
  • Dynamic SLAs – Service‑level agreements can be made more flexible: developers can specify a maximum acceptable latency and let CODA automatically respect it by cutting off reasoning when the utility drops.

Limitations & Future Work

  • Reliance on rollout quality – The difficulty signal comes from cheap rollouts; if those are noisy (e.g., in highly stochastic models), the gating may misclassify difficulty.
  • No explicit user budget – While budget‑free operation is a strength, some applications may still need hard caps on latency or token count, which CODA does not natively enforce.
  • Evaluation on narrow domains – Experiments focus on standard reasoning benchmarks; real‑world heterogeneous workloads (code generation, multimodal prompts) remain to be tested.
  • Future directions suggested by the authors include: (1) integrating external difficulty predictors (e.g., prompt complexity metrics), (2) extending the framework to multi‑modal or tool‑using agents, and (3) exploring hierarchical gating where coarse‑level difficulty decides whether to invoke a more expensive specialist model.

Authors

  • Siye Wu
  • Jian Xie
  • Yikai Zhang
  • Yanghua Xiao

Paper Information

  • arXiv ID: 2603.08659v1
  • Categories: cs.CL
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Agentic Critical Training

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why...