[Paper] Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Published: 2 months ago (February 5, 2026 at 01:57 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.06025v1

Overview

Large‑language‑model (LLM) agents are starting to use external memory so they can reason over information that doesn’t fit in a single context window. Existing pipelines usually build this memory offline and without looking at the actual query, which can waste compute and even drop information that’s crucial for the current task. The paper Learning Query‑Aware Budget‑Tier Routing for Runtime Agent Memory introduces BudgetMem, a runtime‑centric memory system that lets developers explicitly trade off answer quality against the cost of constructing and using memory.

Key Contributions

Budget‑tiered memory modules – each module (e.g., retrieval, summarisation, reasoning) is offered in three pre‑defined “budget” levels (Low / Mid / High) that differ in complexity, inference behavior, or model size.
Lightweight routing policy – a compact neural controller, trained with reinforcement learning, decides per‑query which tier to use for each module, thereby shaping the overall cost‑performance curve.
Unified testbed – the authors wrap the three budget‑tier strategies (implementation, reasoning, capacity) into a single framework, enabling systematic comparison across diverse benchmarks (LoCoMo, LongMemEval, HotpotQA).
Empirical gains – BudgetMem outperforms strong baselines when the budget is generous and, more importantly, delivers a superior accuracy‑vs‑cost frontier when resources are tight.
Analytical insights – the study disentangles when each tiering axis (method complexity, inference style, model capacity) is most beneficial, offering practical guidance for system designers.

Methodology

Memory Modules – The system decomposes the agent’s memory pipeline into reusable components (e.g., document retrieval, passage summarisation, answer generation).
Budget Tiers
- Implementation tier: same algorithmic idea but with cheaper vs. richer implementations (e.g., BM25 vs. dense retrieval).
- Reasoning tier: different inference behaviours such as “single‑shot” prompting vs. multi‑step chain‑of‑thought.
- Capacity tier: smaller vs. larger underlying models (e.g., 7B vs. 13B).
Router Policy – A small transformer‑based policy network receives a query embedding and lightweight statistics about the current memory state, then outputs a tier choice for each module.
Training – The router is trained with reinforcement learning where the reward balances task accuracy (e.g., exact‑match on HotpotQA) against a budget penalty proportional to compute time or token usage.
Evaluation – Experiments sweep across three budget regimes (tight, moderate, generous) and compare against static‑tier baselines (always Low, always High) and prior runtime memory approaches.

Results & Findings

Benchmark	High‑budget (max tier)	Tight‑budget (Low tier)	Budget‑aware (BudgetMem)
LoCoMo	+3.2 % EM over baseline	–1.1 % EM vs. baseline	+2.0 % EM while staying under budget
LongMemEval	+4.5 % F1	–0.8 % F1	+3.1 % F1 with 30 % less compute
HotpotQA	+5.0 % EM	–0.5 % EM	+4.2 % EM at 40 % lower latency

Accuracy‑cost frontier: BudgetMem consistently dominates static baselines, delivering higher scores for the same compute budget and lower compute for the same score.
Tier‑axis analysis:
- Implementation tiers shine when the budget is extremely tight (cheap retrieval still finds the right document).
- Reasoning tiers give the biggest boost in the mid‑budget regime (chain‑of‑thought reasoning adds value without exploding cost).
- Capacity tiers dominate only when the budget is generous, confirming that scaling model size is not the most efficient lever under constraints.

Practical Implications

Dynamic cost control – Deployments (e.g., SaaS LLM assistants, chatbots) can expose a “performance budget” knob to customers, letting the system automatically dial up/down the memory sophistication per request.
Resource‑aware scaling – Cloud providers can schedule cheaper memory pipelines for low‑priority queries while reserving high‑tier modules for premium or time‑critical tasks, improving overall throughput.
Reduced hallucinations – By routing queries that need deep reasoning to higher reasoning tiers, agents can retrieve and synthesize more relevant context, mitigating the common “out‑of‑scope” errors.
Plug‑and‑play architecture – Because BudgetMem treats each memory component as a modular block, existing retrieval or summarisation services can be swapped in with minimal engineering effort.

Limitations & Future Work

Training overhead – The reinforcement‑learning router requires a separate optimisation phase; the authors note that the policy may need re‑training when new modules or datasets are added.
Budget definition – The current experiments use compute time and token count as proxies for cost; real‑world deployments may need to incorporate memory bandwidth, GPU allocation, or monetary pricing.
Generalisation – The router’s decisions are evaluated on the same benchmark families used for training; cross‑domain robustness (e.g., from QA to code generation) remains an open question.
Future directions suggested include: (1) meta‑learning the router to adapt on‑the‑fly to novel tasks, (2) extending the tier space to incorporate retrieval‑augmented generation models, and (3) exploring multi‑objective optimization that jointly considers latency, energy, and user‑satisfaction metrics.

Authors

Haozhen Zhang
Haodong Yue
Tao Feng
Quanyu Long
Jianzhu Bao
Bowen Jin
Weizhi Zhang
Xiao Li
Jiaxuan You
Chengwei Qin
Wenya Wang

Paper Information

arXiv ID: 2602.06025v1
Categories: cs.CL, cs.AI, cs.LG
Published: February 5, 2026
PDF: Download PDF

[Paper] Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] The Representational Geometry of Number