[Paper] CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

Published: 16 hours ago (March 9, 2026 at 01:37 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.08659v1

Overview

Large language models (LLMs) have shown that giving them more “thinking time” – i.e., more inference tokens – can dramatically boost performance on tough reasoning tasks. But the same extra compute is often wasted on easy questions, leading to unnecessary latency and cost. The paper CODA: Difficulty‑Aware Compute Allocation for Adaptive Reasoning tackles this imbalance by letting the model decide, on the fly, how much reasoning depth each input deserves.

Key Contributions

Formal utility framework for adaptive reasoning that stops generation when the expected accuracy gain no longer justifies the extra token cost.
CODA algorithm that extracts a difficulty signal from cheap “group rollouts” and translates it into two gating mechanisms (easy‑side & hard‑side) that bias token allocation.
Budget‑free operation – no external difficulty labels or user‑specified token budgets are needed; the model self‑regulates compute.
Empirical validation across multiple model sizes and benchmark suites, showing >60 % token savings on easy instances without hurting accuracy, and improved performance on hard instances.
Generalizable design that can be plugged into any decoder‑style LLM that supports token‑level rewards (e.g., via reinforcement learning from human feedback or policy gradients).

Methodology

Problem formulation – Treat each inference step as a trade‑off between marginal accuracy gain and incremental compute cost. The optimal stopping point is where the gain falls below the cost.
Difficulty estimation – For a given input, CODA runs a few lightweight “group rollouts” (short, cheap generations) and measures how quickly the model’s confidence stabilizes. Easy inputs converge fast; hard ones need more steps.
Gate construction
- Easy‑side gate: a non‑negative scalar that penalizes extra tokens when the difficulty estimate is low, effectively encouraging early termination.
- Hard‑side gate: a complementary scalar that adds a length‑dependent shaping reward when difficulty is high, nudging the policy to keep reasoning longer.
Reward shaping – The base reward is binary (correct/incorrect). CODA augments it with the two gates, producing a difficulty‑aware signal that the policy optimizes via standard RL (e.g., PPO).
Inference loop – During generation, the model continuously evaluates the shaped reward; once the expected utility of another token drops below zero, decoding stops.

Results & Findings

Setting	Token Reduction (Easy)	Accuracy (Easy)	Accuracy (Hard)
Baseline (fixed length)	–	92.1 %	78.4 %
CODA (adaptive)	+62 % fewer tokens	91.8 % (≈ baseline)	+2.3 % over baseline

Across model scales (7B‑65B parameters) the pattern holds: larger models still benefit from adaptive stopping.
On diverse reasoning benchmarks (e.g., GSM‑8K, MathQA, CommonsenseQA), CODA consistently saves compute on the “easy” split while either preserving or improving scores on the “hard” split.
Ablation studies show that removing either gate degrades performance: the easy‑side gate is crucial for token savings, while the hard‑side gate drives the gains on difficult items.

Practical Implications

Cost‑effective APIs – Cloud providers can expose an “adaptive reasoning” mode that automatically trims latency and token billing for straightforward queries while preserving depth for complex ones.
Real‑time assistants – Voice or chat assistants can respond faster on routine user requests (e.g., factual look‑ups) and allocate more compute only when the conversation turns ambiguous or multi‑step.
Energy savings – Reducing unnecessary token generation translates directly into lower GPU utilization and carbon footprint, an increasingly important metric for large‑scale deployments.
Plug‑and‑play – Since CODA works at the policy‑level, existing LLM services that already employ RLHF can integrate the gating mechanism with minimal engineering effort.
Dynamic SLAs – Service‑level agreements can be made more flexible: developers can specify a maximum acceptable latency and let CODA automatically respect it by cutting off reasoning when the utility drops.

Limitations & Future Work

Reliance on rollout quality – The difficulty signal comes from cheap rollouts; if those are noisy (e.g., in highly stochastic models), the gating may misclassify difficulty.
No explicit user budget – While budget‑free operation is a strength, some applications may still need hard caps on latency or token count, which CODA does not natively enforce.
Evaluation on narrow domains – Experiments focus on standard reasoning benchmarks; real‑world heterogeneous workloads (code generation, multimodal prompts) remain to be tested.
Future directions suggested by the authors include: (1) integrating external difficulty predictors (e.g., prompt complexity metrics), (2) extending the framework to multi‑modal or tool‑using agents, and (3) exploring hierarchical gating where coarse‑level difficulty decides whether to invoke a more expensive specialist model.

Authors

Siye Wu
Jian Xie
Yikai Zhang
Yanghua Xiao

Paper Information

arXiv ID: 2603.08659v1
Categories: cs.CL
Published: March 9, 2026
PDF: Download PDF

[Paper] CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Critical Training

[Paper] How Far Can Unsupervised RLVR Scale LLM Training?

[Paper] OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

[Paper] Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates